If you are running actors on Apify, performance is directly related to your wallet (or rather a bank account). The slower and heavier your program is, the more Compute units and higher subscription plan you will need. The goal of optimization is simple: Make the code run as fast possible and use the least resources possible. On Apify, the resources are memory and CPU usage (Don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionaly). Memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation to smaller parts). So in the rest of this article, we will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU.
There is one more thing. Optimization has its own cost: development time. You should always think how much time you are able to spend on it and if it is worth it.
Before we get to the practical part, lets diverge with an analogy to help us think about performance of scrapers.
Game development analogy
Modern games are extremely complicated beasts. Every frame (usually 60 times a second), the game has to calculate physics of the world, run AI, user input and render everything into a beautiful scene. You can imagine that running all of that every 16 ms in a complicated game is a developer's nightmare. That's why significant portion of game development is spent around optimizations. Every little waste matters.
This is mainly true in the programming heart of the game - the engine. The engine is responsible for the heavy lifting of performance critical parts like physics, animation, AI and rendering. Once the engine is build, you can design the game on top of it. You can add different spells, conversation chains, items, animations etc. to make your game cool. Those extra things may not run every frame and don't need to be optimized so heavily as the engine itself.
Now, if you want to build your own game and you are not C/C++ veteran with a team, you will likely use an existing engine (like Unreal or Unity) and focus on the design of the game environment itself. Unless you will go crazy, the game will likely run just fine since those engines have already been optimized for you. Your job is to choose an appropriate engine and use it well.
Back to scrapers
What are the engines of the scraping world? A browser, an HTTP library, an HTML parser and a JSON parser. The CPU spends more than 99% of its workload in these libraries. As with the engines, you are likely not gonna write these from scratch (although, we wrote our own HTTP library on top of popular Got). It is about how you use these tools. The little code you write in the (handle) pageFunction is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. Actor scraping only JSON API can be as much as 50 times faster/cheaper than a browser based solution.
Ranking of the tools from the most efficient to the least ones:
- JSON API (HTTP call + JSON parse) - Scraping an API (public or internal) is the best option. The response is usually smaller than the HTML page and the data are already structured and cheap to parse. Usable for about 30% of websites.
- Pure HTML (HTTP call + HTML parse) - All data is on the main single HTML page. Often the HTML contains script and JSON data that are rich and nicely structured. Some pages can be quite big and the parsing is slower than for JSON. But it is still 10-20 times faster than a browser. Usable for about 90% of websites.
- Browser (hundreds of HTTP calls, script execution, rendering) - Browsers are huge beasts. They do so much work to allow for smooth human interaction which makes them really inefficient for scraping. Use browser only if it helps you bypass anti-scraping protection or you need to interact with the page.
In the next articles, we will explore how to apply different techniques to scrape all data directly from the page HTML or JSON APIs.