Debugging is an essential programming skill. But even if you would not call yourself a programmer, having basic debugging skills will make building and maintaining scrapers and integration actors on Apify much easier. You don't have to hire expensive developer if you can understand and fix the problem (with help of simple tools) in few minutes. 

In this article, we will go through absolute basics. We will discuss what are the most common problems that can happen and what are simplest tools to analyze the problem. In next articles, we will look more into actual solutions.

Possible problems

Beginners usually don't understand the full scope of what can go wrong. They assume once the code is set up correctly, it will keep working. Unfortunately, that is not very true in the realm of web (scraping). Websites are changing, they are introducing new anti-scraping technologies, programming tools change and we as people also make mistakes. Let's look at the most common causes that can break your working solution.

  • Website changed its layout or data feed
  • Website changes layout depending on location or uses A/B testing.
  • Website started to block you (recognizes you as a bot)
  • Website loads its data later dynamically so the code works only sometimes if you are slow enough (lucky)
  • You made a mistake when updating your code.
  • Your code worked locally but not on Apify platform
  • You have lost access to Apify proxy (your proxy trial is over)
  • You have upgraded your dependencies (other software that you rely upon) and the new versions no longer work (this is harder to debug)

As you can see, there is a big range of possible problems and this list is by no means complete. The good thing is that if you use the right tools and have the most common causes in mind, you can discover the actual problem very quickly. The solution can then be quite trivial.

Analysis

You came to collect your data like every other day but for some reason, the dataset is empty or corrupted. You take a quick look at the log and there is bunch of errors that you have no clue what they mean. You would like to try fixing the problem but you have no idea where to start. You contact Apify support, they will give you advices but you are not sure how to apply them so you have to ask a developer to fix it for you. This is a typical situation for a non-developer.

Even programmers struggle a lot. Web scraping and automation is a very specific type of programming. Programmers are used that the same code will always output the same results. If not, something is wrong and they will run a specialized debugger to find the bug. This approach has its place in web scraping too but it is very limited because it is impossible to mock the web. A lot of problems in web scraping are edge cases that happen just in 1 out of 1000 pages. Or they are time dependent, the website will start blocking you after certain amount of requests. You cannot rely on simple determinism.

1. The power of logging

Logging is the most essential tool for any programmer. That is even more true for web scraper that runs in the cloud at arbitrary time. Logs will help you capture a surprising amount of information if you use them correctly. Don't forget that logs on Apify are not infinite. They have certain cap of rows per second and overall size. Those are generous but if you see a log message that some lines were skipped, you should tone down your logging. General rules for logging are:

  • Generally, a lot of logs is better than no logs.
  • Try to put more information into one line, rather than spawning multiple short lines. It helps to reduce the overall size of the log.
  • Focus on numbers. Log how many items you extracted from the page, etc.
  • Structure your logs. Use the same structure for all your logs. 
  • Append a page URL to each log. That will give you chance to immediately open that page and review it.

Examples of structured log:

[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store

We start with the type of the page. Usually, we use labels like CATEGORY and DETAIL. Then we log important numbers and other information. Finally, we add a URL of the page so we can check if the log is correct.

Errors
Errors require a bit different approach. If your code crashes, your normal logs won't be called and the code runs into exception handlers. These will print your error but that is usually an ugly message with a stack trace that only Apify expert will understand. You can overcome this by putting few try/catch blocks into your code. In the catch block, you explain what happened and re-throw the error (so the request is automatically retried).

try {
    // Sensitive code block
    // ...
} catch (error) {
    // You know where the code crashed so you can explain here
    console.error('Request failed during login with an error:');
    throw error;
}

You can read more information about logging and error handling in our public wiki about developer best practices.

2. Saving snapshots

Error handling with try/catch gets us to our second essential tool: snapshotting. By snapshots we mean screenshots (if you use browser/Puppeteer) and HTML saved into key-value store that you can easily display in your browser.

Snapshots are useful throughout your code but especially important in error handling. Keep in mind the point that I mentioned earlier. An error can happen only in few pages and look completely randomly. There is not much you can do other to save and analyze a snapshot. Snapshots can tell you that:

  • Website changed its layout. This can also mean A/B testing or different content for different locations.
  • You have been blocked. You open a captcha or Access Denied page.
  • Data load later dynamically. The page is empty.
  • Page was redirected. The content is different.

How to save a snapshot
If you use Apify Scrapers (Web Scraper, Cheerio Scraper or Puppeteer Scraper), you can use the built-in context.saveSnapshot() function. Once you call it, it will save a screenshot and HTML into the key-value store of the run. You can easily open them with a single click.

If you build your own actors with Puppeteer or the actor exposes the Apify  SDK package, you can use more powerful utils.puppeteer.saveSnapshot() helper function. It allows you to choose a name for the screenshot so you can identify it.

For Cheerio based actors, we don't have a helper function because you can do it with just a one-liner. You already have the HTML so you just need to save it with correct content type:

await Apify.setValue('SNAPSHOT', html, { contentType: 'text/html' })'

Now when you know how to save it, the next question is where to use it. The most common approach is to save on errors. We can simply enhance our previous try/catch block:

// storeId is ID of current key value store where we save snapshots
const storeId = Apify.getEnv().defaultKeyValueStoreId;

try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    await Apify.utils.puppeteer.saveSnapshot(page, { key });

    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`

    // You know where the code crashed so you can explain here
    console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
    throw error;
}

I'm trying to make the error snapshot descriptive so I call it ERROR-LOGIN . Then I add a random number. If I would not do that, all ERROR-LOGIN s would overwrite, but I want to see all the snapshots. If you can use an ID of some sort, it is even better.

Be careful with these 2 things:

  • Name (key) of the snapshot can only contain letters, number, dot and dash. Otherwise it will error out. That is why random number is a safe pick.
  • Don't go crazy with snapshots. Once you get out of testing phase, limit them to critical places. Saving snapshots costs some resources.

3. Error reporting

Logging and snapshotting are great tools but once you reach certain size of runs, it may be hard to read through all logs or snapshots. For large project, it is handy to create more sophisticated reporting system. We will cover it in more detail in next articles. Right now, let's just look at simple dataset reporting.

We will extend our previous snapshot solution by creating named dataset (named datasets have infinite retention) where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot so we can quickly check visually.

// Let's create reporting dataset. If you already have one, this will continue adding to that one.
const reportingDataset = await Apify.openDataset('REPORTING');

// storeId is ID of current key value store where we save snapshots
const storeId = Apify.getEnv().defaultKeyValueStoreId;

// We can also capture IDs of the actor and run to have easy access in the reporting dataset
const { actorId, actorRunId } = Apify.getEnv();
const linkToRun = `https://my.apify.com/actors/actorId#/runs/actorRunId`;

try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    await Apify.utils.puppeteer.saveSnapshot(page, { key });

    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`;

    // We create a report object
    const report = {
         errorType: 'login',
         errorMessage: error.toString(),
         // You will have to adjust the keys if you save them non-standard way
         htmlSnapshot: `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.html?disableRedirect=true`,
         screenshot: screenshotLink,
         run: linkToRun,
    };

    // And we push the report
    await reportingDataset.pushData(report);

    // You know where the code crashed so you can explain here
    console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
    throw error;
}

Did this answer your question?