TL;DR
Use these helper functions to wait for data:
page.waitFor  in Puppeteer (Scraper).
context.waitFor
 in Web Scraper.
Pass in time in milliseconds or a selector to wait for.
Examples:

await page.waitFor(10000)
- waits for 10 seconds.
await context.waitFor('my-selector') - waits for my-selector  to appear on the page. 

A lot of websites load some data in the background via XHR requests. These are usually tracking data, ads, images and other content that may not be essential for the website load or is useful to collect periodically. But sometimes it may contain actual core page data that you need.

How page loading works

Before looking at code examples that solve this problem, let's review what the page loading process looks like:

  1. HTML document is loaded (domcontentloaded  event) - This document contains the HTML as it was rendered on the website server. It also includes all the JavaScript that is executed and rendered in the next step. This HTML is what you get when you use http-request or Cheerio Scraper (CheerioCrawler class).
  2. JavaScript is executed and rendered (load  event) - The page is fully rendered, but may still lack dynamically loaded data. 
  3. Network XHR requests are loaded and rendered (networkidle0 or networkidle2  events) - For some reason, some website creators decide to load essential data this way. The execution of these requests may depend on user behavior like in infinite scroll. This is when you get the page in Web Scraper or Puppeteer Scraper (PuppeteerCrawler class). Be careful that some pages track you very often with additional requests and the load may never end.

How to wait for dynamic content

http-request / Cheerio Scraper
Very often, all the essential data is presented in the initial HTML. And scraping it without a browser (Puppeteer) is much more efficient. That is why we created the Cheerio Scraper.  But even if the data is rendered via JavaScript or loaded dynamically, there are advanced techniques (note: links to be added) that allow you to reverse engineer this data and still retain Cheerio's efficiency.

Web Scraper / Puppeteer Scraper / Puppeteer
In 95% of cases, the JavaScript-rendered page that you get with Puppeteer is enough. But if you actually need to wait for the dynamic content, Puppeteer has a plethora of helper functions, where the most important are page.waitForSelector , page.waitForResponse , page.waitForNavigation, page.waitForFunction  and generic page.waitFor . The waitFor  helper method is also available in Web Scraper via context.waitFor  and we will explore it next.

waitFor function

Let's take a closer look at the waitFor  function that can be found as page.waitFor  in Puppeteer and context.waitFor  in Web Scraper.  It is a generic function that has three possible arguments:

  • Number in milliseconds - await page.waitFor(10000) (will wait for 10 seconds)
  • Selector string - await page.waitFor('my-selector') - The same as page.waitForSelector  (will wait until that selector appears on the page but timeouts after 30 seconds with an error)
  • Predicate function - await page.waitFor(functionThatReturnsTrueOrFalse) - The same as page.waitForFunction  (you can pass an arbitrary function that is executed periodically and the code waits until it returns true )

Testing it

If you need to update your code with waiting logic, simply start by waiting 10 seconds. If that doesn't help, try 30 seconds. If it still doesn't work, the problem is elsewhere. Try to debug it using logs and screenshots. If your code is working, you know that it was indeed dynamically loaded data that caused your problem. Now you can change the 10 seconds waiting time for a wait for selector to be more efficient.

Timeout and errors

By default, waitFor timeouts after 30 seconds with an error. Usually this means there is some other error preventing the selector to be loaded. The selector itself may be wrong, your browser got blocked or redirected to other version of the website. Most of the time, if the selector doesn't load in the first 5 seconds, it won't load at all. You can prevent wasteful waiting by changing the timeout - await page.waitFor('my-selector', { timeout: 10000 })

The waitFor  (the selector version) will throw an error once it reaches the timeout. That is usually a good thing because you don't want this to go unnoticed. But if the data is not so important or you want to fall back to some other solution, you can easily catch the waiting error:

await page.waitFor('my-selector', { timeout: 10000 })
.catch(() => console.log('Waiting for my-selector timeouted'))

The code will then continue.

Advanced use-cases

So far we have just scratched the surface of this topic. Let's have a quick look at some more advanced cases. We have not yet covered the third usage of waitFor  - waitForFunction .

waitForFunction

If a simple selector is not enough, we can implement an arbitrary function to be evaluated in the browser context to tell us if the page is ready. Let's imagine that we know the page needs to load 24 products, but for some reason they load over time. We can define a simple function to check it

// Let's assume JQuery is injected
const has24Products = () => {
    const numberOfProducts = $('.my-products').length;
    return numberOfProducts === 24;
};

Now we simply pass it to waitFor  or waitForFunction :

// In Puppeteer you need to inject JQuery with
// await Apify.utils.Puppeteer.injectJQuery(page);
await page.waitFor(has24Products);

waitForResponse

Sometimes it may be handy to work directly with the response of the XHR request. There are a few reasons for that:

  • It is faster - You don't need to wait for the rendering of the element
  • It may contain nicely structured JSON data

Keep in mind that waitForResponse  is not included in waitFor cases, so it doesn't work in Web Scraper. If you are interested in exploring the responses, you can look through them in your browser developer console. In Chrome, it is the network  tab and select xhr filter.

We can catch this response by checking for its URL and method (we have to do it since there is the same URL with OPTIONS method). We return true  or false depending if it is the response we want. The waitForResponse  is so handy that it will give us the response back.

const correctResponse = await page.waitForResponse(async (response) => {
   const url = response.url();
   const method = response.request().method();
   if (url.includes('/visit-data') && method === 'POST') {
        return true;
   }
   return false;
})

 Now we simply extract the JSON from it.

const data = await correctResponse.json();
const userAgent = data.user_agent;

Custom waiting functions

You don't need to rely on Puppeteer's smart functions to implement something. You can implement "waiters" with a simple loop. Then you can add your own functionality into it. For example, waitForSelector that logs its waiting.

const waitAndLog = async (page, selector, timeout = 30000) => {
    const start = Date.now();
    let myElement = await page.$(selector);
    while (!myElement) {
        await page.waitFor(500); // wait 0.5s each time
        const alreadyWaitingFor = Date.now() - start;
        if (alreadyWaitingFor > timeout) {
             throw new Error(`Waiting for ${selector} timeouted after ${timeout} ms`);
        }
        console.log(`Waiting for ${selector} for ${alreadyWaitingFor}`);
        myElement = await page.$(selector);
    }
    console.log(`Selector ${selector} appeared on the page!`)
    return myElement;
};

await waitAndLog(page, 'my-selector'); // You can use the element handle it returns

Did this answer your question?