Follow the Anti-scraping protections course in Apify Academy for a complete guide with the latest tips & tricks.
Some websites adopt anti-scraping protections. However, sometimes it is still reasonable and fair (and based on a recent US court ruling also legit) to extract data from them. In this article, we will go through the most commonly used anti-scraping protection techniques and show you how to bypass them.
There are four main categories of anti-scraping tools:
IP detection
IP rate limiting
Browser detection
Tracking user behavior
Anti-scraping protections based on IP detection
Some protection techniques deny access to their content based on your IP address location. They just want to show their content to users from given countries.
Other protection techniques block access based on the IP range your address belongs to. This kind of anti-scraping protection usually aims at reducing the amount of non-human traffic. For instance, websites will deny access to IP ranges of Amazon Web Services and other commonly known ranges. It can often be easily bypassed by the use of a proxy server.
On the Apify platform, you can use our pool of proxy servers based in the United States, you can ask us to provide you with a custom dedicated pool from the countries you need, or you can use your own proxy servers from services like Oxylabs or Bright Data (formerly Luminati).
Anti-scraping protections based on IP rate limiting
The second most common anti-scraping protection technique is to limit access based on the number of requests made from a single IP address in a certain period of time.
This kind of anti-scraping protection can be either manual (meaning a human is checking logs, and if they see large volumes of traffic from the same IP address, they block it) or automatic.
For example, for google.com, you can typically make only around 300 requests per day, and if you reach this limit, you will run into a CAPTCHA instead of search results. Another example could be a website that allows ten requests per minute and throws an error for anything above this threshold. These anti-scraping protection techniques can be temporary or permanent.
There are two ways to work around rate limiting. One option is to limit the maximum concurrency, and possibly even introduce delays (after reaching concurrency 1) in execution, to make the crawling process slower. The second option is to use proxy servers and rotate IP addresses after a certain number of requests.
To lower the concurrency, when using our SDK, pass the maxConcurrency
option to the Crawler setup. If you use scrapers from our Store, then you can usually set the maximum concurrency in the input. If even maxConcurrency: 1
is too fast, you can add some delays but it is pretty rare.
Here is how you can do it in Web Scraper.
async function pageFunction(context) {
// Just wait 5 seconds on each page
await context.waitFor(5000);
// Do your scraping...
}
With an Apify Actor, you can use promises to introduce delays before execution using the sleep() function from the Apify SDK as follows:
const Apify = require('apify');
Apify.main(async () => {
await Apify.utils.sleep(10 * 1000);
// Any code bellow will be delayed by 10 seconds...
});
To use the second method and rotate proxy servers in your Apify Actor or task, you can just pass the proxyConfiguration
either to the input or the Crawler class setup.
Anti-scraping protections based on browser detection
Another relatively pervasive form of anti-scraping protection is based on the web browser that you are using.
User agents
Some websites use the detection of User-Agent HTTP headers to block access from specific devices. You can use a rotation of user agents to overcome this limit, but you should also be careful, as many libraries contain outdated user agents that can make the situation worse.
Apify SDK doesn't provide its own user agent rotation for now, until we figure out the best solution. However, both Apify.launchPuppeteer() and PuppeteerCrawler functions have a parameter called "userAgent". Here is an example of launching Puppeteer with a random user agent using the modern-random-ua NPM package:
const Apify = require('apify');
const randomUA = require('modern-random-ua');
Apify.main(async () => {
// Set one random modern user agent for entire browser
const browser = await Apify.launchPuppeteer({
userAgent: randomUA.generate(),
});
const page = await browser.newPage();
// Or you can set user agent for specific page
await page.setUserAgent(randomUA.get());
// And work on your code here
await page.close();
await browser.close();
});
Blocked PhantomJS
Old Apify crawlers used PhantomJS to open web pages, but when you open a web page in PhantomJS, it will add variables to the window object that makes it easy for browser detection libraries to figure out that the connection is automated and not from a real person. Usually, websites that employ protection against PhantomJS will either block these connections or, even worse, mark the used IP address as a robot and ban it. Some scraping technologies are still based on PhantomJS.
The only way to crawl websites with this kind of anti-scraping protection is to switch to a standard web browser, like headless Chrome or Firefox. That's one of the reasons why we launched Apify Actors. All our Actors in Apify Store and our SDK use headless or headful Chrome.
Blocked headless Chrome with Puppeteer
Puppeteer is essentially a Node.js API to headless Chrome. Although it is a relatively new library, there are already anti-scraping solutions on the market that can detect its usage based on a variable it puts into the browser's window.navigator
property.
As a start, we developed a solution that removes the property from the web browser, and thus prevents this kind of anti-scraping protection from figuring out that the browser is automated. This feature later expanded to a stealth
module that encompasses many useful tricks to make the browser look more human-like.
Here is an example of how to use it with Puppeteer and headless Chrome:
const Apify = require('apify');
Apify.main(async () => {
const browser = await Apify.launchPuppeteer({ stealth: true });
const page = await browser.newPage();
await page.goto('https://www.example.com');
// Add rest of your code here...
await page.close();
await browser.close();
});
Browser fingerprinting
Another option sometimes used by anti-scraping protection tools is to create a unique fingerprint of the web browser and connect it using a cookie with the browser's IP address. Then if the IP address changes but the cookie with the fingerprint stays the same, the website will block the request.
In this way, sites are also able to track or ban fingerprints that are commonly used by scraping solutions - for example, Chromium with the default window size running in headless mode.
The best way to fight this type of protection is to remove cookies and change the parameters of your browser for each run and switch to a real Chrome browser instead of Chromium.
Here is an example of how to launch Puppeteer with Chrome instead of Chromium using Apify SDK:
const browser = await Apify.launchPuppeteer({
useChrome: true,
});
const page = await browser.newPage();
This example shows how to remove cookies from the current page object:
// Get current cookies from the page for certain URL
const cookies = await page.cookies('https://www.example.com');
// And remove them
await page.deleteCookie(...cookies);
Note that the snippet above needs to be run before you call page.goto() !
And this is how you can randomly change the size of the Puppeteer window using the page.viewport() function:
await page.viewport({
width: 1024 + Math.floor(Math.random() * 100),
height: 768 + Math.floor(Math.random() * 100),
})
Finally, you can use Apify's base Docker image called Node.JS 8 + Chrome + Xvfb on Debian
to make Puppeteer use a normal Chrome in non-headless mode using the X virtual framebuffer (Xvfb).
Tracking user behavior
The most advanced anti-scraping protection technique is to track the behavior of the user in order to detect anything that is not done by humans, like clicking on a link without actually moving a mouse cursor. This kind of protection is commonly implemented together with browser fingerprinting and IP rate limiting by the most advanced anti-scraping solutions.
Bypassing this cannot be easily done with a simple piece of code, but we have noticed that there are some patterns to look for, and if you find those, then bypassing it is possible. Here's what you need to do:
1) Check the website to see if it's saving data about your browser
You can do that by opening Chrome DevTools in your Chrome browser and going to the Network tab. Then switch to either the XHR or Img tab, as the websites sometimes hide the tracking requests as image loads. Check if there are POST requests made when you open the page or carry out some action on the page. If you find a request that has weird encoded data, then you've hit the jackpot. Here's an example of what it might look like:
If you find a request like this one, you can check the payload value on a site like base64decode.org and if it contains data about your browser, you've found the tracking request.
2) Block the tracking requests
The next step is to disable the tracking. For that, you need to go to the view of all requests and check the Initiator column for that request. It usually contains the JavaScript file which initiated the call.
You will need to disable this file in order to block the anti-scraping protection. Here's an example of how to do that in Puppeteer:
// Tell Puppeteer that you want to be able to block
// requests on this page
await page.setRequestInterception(true);
page.on('request', (request) => {
const url = request.url();
// Check request if it is for the file
// that we want to block, and if it is, abort it
// otherwise just let it run
if (url.endsWith('main.min.js')) request.abort()
else request.continue();
});
Now try to run your Apify Actor, if everything works, you've successfully bypassed the anti-scraping protection. If the page stops working properly, then it means that the file contained other functions bundled with the protection, in which case you can use the code above but block the request with your browser data, instead of the file that creates them.
And that's all. If you find a website that still does not work even if you follow all these steps, let us know at support@apify.com - we love new challenges :)
Happy crawling!