Before we get to the real topic, let's do some quick time travel into our past, into the era of the Apify Crawler. This will help us understand the concept of compute units (CUs).
In the Crawler era, everything was simple. The only thing you had to care about when it came to pricing was how many pages you opened. You could usually estimate how many pages your project would need to open to scrape the data you needed. It was predictable. Users, and our sales team, loved it.
When we were designing actors, we flipped the Crawler design on its head. We started with the basics - we let you allocate an actor (a program) on our server. And you pay for the size of the server allocated (memory) and the duration it is allocated. When you multiply those two things you get compute units. We then know that our costs for the servers and your costs for CUs are directly proportional. There is nothing to abuse. And users who can use the resources efficiently will be much better off than with the old Crawler.
CUs are clear and fair, but the simplicity of estimating crawled pages has been lost. The goal of this article is to shed light on how can you get similar pricing estimations you were used to with Crawler.
Main factors determining consumption
We know some average numbers for CU usage, but firstly, we need to look at where your project stands regarding the main factors driving usage. We'll get to real numbers at the end of this article. In order of importance, the deciding factors are:
- Browser (Puppeteer) vs Cheerio (plain HTTP) based solution - a Cheerio-based solution can be as much as 20 times faster than a browser-based one. Always go with Cheerio if possible.
- Few bigger runs vs a lot of smaller runs - Bigger runs can utilize the full scaling of resources and don't suffer the few seconds of actor start-up every time. One run with 1000 URLs at once can be even 10 times more efficient than 1000 runs with one URL. Always choose bigger batches if possible.
- Heavy pages vs lighter pages - Big pages like Amazon or Facebook will take more time to load regardless of whether you use a browser or Cheerio. You usually can't choose a different website to scrape, but it is important to realize that it has an impact. Loading bigger pages can take up to 3 times more resources to load and parse than average ones.
Browser vs Cheerio
Is all the data I need in the HTML?
Usually (90%) the answer is yes. So let's celebrate? Actually, it's a slightly tricky question. You can get to the data in various ways and some of them are rather complex. The question is not only whether you can get the data without a browser, but how difficult it is to set it up. Sometimes you have to weigh the costs of the development of the solution (one-time fee) vs run-time consumption (monthly fee).
Where to find the data?
We cannot use a browser to test where the data is, because the browser carries out all the extra steps that we don't want. We can use simple tools like CURL, but I prefer Postman because it can render the HTML, so you can easily check whether the data is there.
So where exactly can the data be found?
- Directly in the HTML - This means you will see the data rendered by Postman.
If I want to check whether Amazon delivers all the important data in the HTML, I can easily visualize it with Postman.
And the answer is yes, more than 95% of the data is present visually on the page.
Unfortunately, Amazon doesn't have this nice JSON with all the data, but it is ideal for Aliexpress.
I searched for the price of the product and found this huge JSON that has everything I need. Parsing this JSON is pretty simple, you just locate the <script> tag where it is defined and then parse it with the help of a regular expression (or you can try to eval() some part of it, but you need to be more careful that it doesn't crash on undefined variables).
3. By doing other requests - Even if the previous solutions are not enough, you can usually get the data by doing additional requests using the information you have from the initial HTML. This is getting us closer to what the real browser is still doing, but usually we just need one or few requests to get all the data, so it doesn't come with a performance hit. To find out which requests are used for the data you need, open your browser and dev tools with the network tab and select the XHR types. Then load the page and examine the responses to find your data. This is an advanced technique and we will explore it in depth in future articles very soon.
Average browser consumption - 300 pages per 1 CU
Average cheerio consumption - 3000 pages per 1 CU
Bigger runs vs Smaller runs
Unless you really have to run a single URL periodically, you should always try to put at least a few hundred URLs in a single run so you can fully utilize Apify's autoscaling. Autoscaling is a system that runs within all Apify SDK crawler classes (don't confuse these with the old Apify Crawler product). Its goal is to find an optimal concurrency for your tasks that maximally utilizes the resources (mainly CPU, but also memory, API accesses, etc.). You can set up this scaling to start at a higher concurrency than 1, but it will still take at least a half a minute to find an ideal concurrency.
How to measure the impact of small vs big runs?
Mainly, you have to measure your actual use case. If you have a use case where you want to scrape just a single simple page, you can write quick prototype code that will just open/download that page, set up reasonable memory (128 - 512 MB for Cheerio, 1024 MB for Puppeteer), run it, and you will see how many compute units it consumed. You can then easily calculate your monthly usage.
For longer runs, you cannot run a single URL like before and then simply multiply the usage. That would give you too pessimistic of an estimate. What you can do is simply create a RequestList of 1000 duplicate URLs, give each of them a random uniqueKey, and then you can run it. For longer runs I recommend increasing the memory so you get more speed. I would test it with 1024 - 4096 MB for Cheerio and 4096 - 16384 MB for Puppeteer. You'll get a pretty good idea of what your full load will consume.
It is also important to note that you mostly really care only about the page load/render and its CPU cost. Those are (and should) be your bottlenecks. Scraping code is usually tiny and doesn't cost you any performance. It only matters if you do a lot of clicking and page manipulation, but that is pretty rare.
If you are a developer, I would suggest you write some simple generic code that you can run on any website as an actor and generate estimated CU usage for 1000 pages.
Heavy vs light pages
And lastly, we have to take into account how heavy the pages are. If you carried out the exact measurements from the previous section, you are basically done. If you don't want to measure, you will have to account for variance between sites.
We will get to detailed benchmarks in the next article, but to illustrate the difference between three types of page, I have run 1000 pages with CheerioCrawler with aggressive scaling. The results are:
- example.com - simple text HTML - 34 seconds
- apify.com - average complexity website - 82 seconds
- amazon.com (product page) - very big and complex page - 258 seconds
Different use cases
What we have discussed so far was mainly relevant for scraping websites. Apify actors can do much more, so how do we estimate those jobs? Partly by experience and partly by testing.
Here are a few other common actor usages:
- downloading images - Extremely fast, more than 10k images downloaded and uploaded per CU
- simple web workflow with login - Usually takes 15-30 seconds with 1024 MB to load a page, login, and click few buttons. So one run will cost about 0.01 CU.
- processing data - Computing itself is usually instant unless we're dealing with tens of millions of items. Most of the time in those cases is spent loading and saving data. If we want to process 1k items from a dataset and save it back, it will take just a few seconds on low memory. But if we want to process 1M items, loading will take up to a minute, processing will still be close to instant, but saving back to the dataset is pretty much impossible so it needs to be dealt with differently.