An actor is a microservice (cloud app/function) that can perform a web automation or data extraction job. It takes an input configuration, runs the job and saves results. An actor can perform anything from a simple action, such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. An actor can be used directly (from UI, API or Scheduler) or via tasks. You can find ready-made actors in Apify Store, write your own actor, or request one from Apify Marketplace.
Example: You can use the Content checker actor to monitor a web page for content changes and get a notification when something changes. All you need to do is define an input (URL to monitor, content selector, and notification email address) and schedule it to run periodically (e.g. every hour). You don't need to download anything - it all runs on the Apify cloud platform.
Sometimes you want to use an actor with more input configurations. That's when tasks are handy. Tasks are saved input configurations for actors. Tasks can also be set up for actors made by someone else. Just search Apify Store to find an actor you want to use and create a task for it with a single click of the "Try actor" button. As with actors, you can run tasks from UI, via API or Scheduler.
Example: If you want to monitor two websites using the Content checker actor, just create two tasks for the actor. Each will have its own input (URL to monitor, content selector, and notification email address).
Use schedules to automatically run your actors or tasks at specific times. Each schedule can be associated with a number of actors and tasks and it is also possible to override the settings (input configuration and run options) of each actor (task). All you need to set up a scheduler (except the associated actors and tasks) is a Cron expression. You can use shortcuts like
@daily to trigger the schedule every day at the same time or a full Cron expression like
*/3 * * * * to run it every third minute.
Example: If you want to monitor the first website using the Content checker actor on a daily basis, create a schedule for your task with a
@daily Cron expression. If the second website should be monitored on a weekly basis, create another schedule for the second task with a
@weekly Cron expression.
Every time you start an actor or task, a run object is created. A run is associated with an actor (directly or via tasks) and has its own parameters (input, output, allocated memory, timeout, etc.). Once it's started, you can watch its log file and typically data being stored in a Key-value store or a Dataset.
Example: Here's an example of a Content checker run. You can see a log file there with links to the screenshots and text content. You can also check other basic data like input, timestamps, memory and CPU usage, etc.
Platform usage at Apify is mainly charged based on the consumption of compute units. An actor running with 1GB of allocated memory for 1 hour consumes 1 compute unit (CU). You can see the exact CU usage for each run under the "Run details" tab. Overall CU consumption in the current billing period can be seen on the Dashboard.
Example: The example run above for the Content checker actor ran for 22 seconds with 512MB memory and thus consumed 0.003 CUs. With a daily schedule, it would consume approximately 0.09 CUs per month (our free plan contains 10 CUs per month, so that means you can monitor about 100 pages daily at Apify for free).
Apify provides various data stores designed for web scraping and automation use cases. Every run has its own Key-value store with input. Actors that work with data (typically scrapers) use Datasets to store structured data which can be easily downloaded in various formats like CSV, JSON, XLSX, and more. They also use a request queue to manage URLs during the crawl. You can also use "named" stores which can be used across runs.
Example: The Content checker actor stores screenshot and content in a "named" Key-value store (to check it against the previous run for changes). You can find links to the previous and current screenshot and content in a run's log file.
Example: If you want to scrape a top link from Hacker News, you can create a task for Cheerio Scraper, set StartURL to
https://news.ycombinator.com/ and add this line of code to the Page function (in the return statement):
topLink: $('.title a').eq(0).attr('href')
Then you can hit the run button, wait for a couple of seconds and get this result.
Back in 2015, Apify launched with a single product - the Apify Crawler. It was easy to use with similar features to our current scrapers, but was hard to integrate and you could get stuck with more complex use cases. Apify is a much more versatile platform now and our scrapers cover all of the old crawler's functionalities. If you miss the old crawler, you can still use it as the Legacy PhantomJS Crawler (in which we've basically transformed our first product into an actor).