In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the actor. Therefore, it's not recommended to follow the examples shown in this article. Puppeteer now uses a native cache that should work well enough for most use cases.

When running crawlers that go through a single website, each open page has to load all resources again (sadly, headless browsers don't use cache). The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used).

For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header "cache-control" with "max-age" above 0).

In this example, we will use this actor, which goes through top stories on the CNN website and takes a screenshot of each opened page (the actor is very slow because it waits till all network requests are finished and because the posts contain videos).

If the actor runs with disabled caching, these statistics will show at the end of the run:

Screenshot of statistics that will show at the end of the run.

As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page. We also stored all the screenshots, which you can find here.

From the screenshot above, it's clear that most of the traffic is coming from script files (124MB) and documents (22.8MB). For this kind of situation, it's always good to check if the content of the page is cachable. You can do that using Chromes Developer tools.

If we go to the CNN website, open up the tools and go to the "Network" tab, we will find an option to disable caching.

Screenshot of the developer tools bar.

Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools.

Screenshot of the bottom of the developer tools, showing how much data is transferred when we open the page.

If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses.

Screenshot of the bottom of the developer tools, showing how much data is transferred when we open the page without caching.

By comparison, the data transfer appears to be reduced by 88%.

We can now emulate this and cache responses in Puppeteer. All we have to do is to check, when the response is received, whether it contains the "cache-control" header, and whether it's set to max-age higher then 0. If so, then save the headers, URL, and body to the memory of the response, and on the next request check if the requested URL is already stored in the cache.

The code will look like this:

// On top of your code
const cache = {};

// The code bellow should go between newPage function and goto function

await page.setRequestInterception(true);
page.on('request', async(request) => {
    const url = request.url();
    if (cache[url] && cache[url].expires > Date.now()) {
        await request.respond(cache[url]);
        return;
    }
    request.continue();
});
page.on('response', async(response) => {
    const url = response.url();
    const headers = response.headers();
    const cacheControl = headers['cache-control'] || '';
    const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
    const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;
    if (maxAge) {
        if (!cache[url] || cache[url].expires > Date.now()) return;
       
        let buffer;
        try {
            buffer = await response.buffer();
        } catch (error) {
            // some responses do not contain buffer and do not need to be catched
            return;
        }

        cache[url] = {
            status: response.status(),
            headers: response.headers(),
            body: buffer,
            expires: Date.now() + (maxAge * 1000),
        };
    }
});

After implementing this code, we can run the actor again.

Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found here.

It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy.

Bonus: Since most of our users use our SDK, here is a small example of how you can use this functionality with our Apify.PuppeteerCrawler:

const Apify = require('apify');
const cache = {};

Apify.main(async () => {

    const requestList = await Apify.openRequestList('request-list', [
        'https://apify.com/store',
        'https://apify.com',
    ]);

    const crawler = new Apify.PuppeteerCrawler({
        requestList,
        gotoFunction: async({ page, request }) => {
            await page.setRequestInterception(true);

            page.on('request', async (request) => {
                const url = request.url();
                if (cache[url] && cache[url].expires > Date.now()) {
                    await request.respond(cache[url]);
                    return;
                }
                request.continue();
            });

            page.on('response', async (response) => {
                const url = response.url();
                const headers = response.headers();
                const cacheControl = headers['cache-control'] || '';
                const maxAgeMatch = cacheControl.match(/max-age=(\d+)/);
                const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0;

                if (maxAge) {
                    if (!cache[url] || cache[url].expires > Date.now()) return;

                    let buffer;
                    try {
                        buffer = await response.buffer();
                    } catch (error) {
                        // some responses do not contain buffer and do not need to be catched
                        return;
                    }

                    cache[url] = {
                        status: response.status(),
                        headers: response.headers(),
                        body: buffer,
                        expires: Date.now() + (maxAge * 1000),
                    };
                }
            });

            return page.goto(request.url, { waitUntil: 'domcontentloaded', timeout: 60000 });
        },

        handlePageFunction: async ({ page, request }) => {
            await Apify.pushData({
                title: await page.title(),
                url: request.url,
                succeeded: true,
            });
        },

        handleFailedRequestFunction: async ({ request }) => {
            await Apify.pushData({
                url: request.url,
                succeeded: false,
                errors: request.errorMessages,
            });
        },
    });

    await crawler.run();
});

Hopefully, this short tutorial helps you with your solutions!

Did this answer your question?