The most common integration of Apify with your system is usually very simple. You need to run an actor or task, wait for it to finish and then collect the data. With all the features Apify provides, new users may not be sure of the standard/easiest way to implement this. So let's dive in and show that it is actually pretty simple.

Don't forget to check the full API documentation with examples in different languages and live API console. I also recommend testing the API with some nice desktop client like Postman.

We will go through the 3 major steps chronologically:

  • Run actor/task
  • Wait for it to finish
  • Collect the data into your system

1. Run actor/task

The API endpoints and their usage is basically the same for actors and tasks. If you are still not sure of the difference between actor and task, read about that in the tasks docs. In short, tasks are just pre-saved inputs for actors, nothing more.

To call (that's how we say "to run") an actor/task, you will need a few things:

  • Name or id of the actor/task. The name is in the format username~actorName or username~taskName .
  • Your API token (make sure it doesn't leak anywhere!)
  • Possibly an input or other settings if you want to change the default values (like memory, build, etc.)

The template URL for a POST request to run the actor looks like this:
https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN

For tasks, we just switch the path from acts to actor-tasks:
https://api.apify.com/v2/actor-tasks/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN

If we send a correct POST request to this endpoint, the actor/task will start just as if we had pressed the Run button in the web app.

Additional settings

We can also add any settings (these will override the default settings) as additional query parameters. So if you want to change how much memory you want to allocate and which build you want to run, simply add these as parameters separated with & .

https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN&memory=8192&build=beta

This works almost identically for actors and tasks. However, for tasks there is no sense in providing a build since a task already has only one specific actor build.

Input JSON

Most actors wouldn't be much use if you could not pass any input to change their behavior. And even though each task already has an input, it is handy to be able  to always overwrite with the API call. 

The input of an actor or task can be an arbitrary JSON so its structure really depends only on the specific actor. This input JSON should be send as the body of the POST request.

If you want to run one of the major actors from Apify Store, you usually don't need to provide all possible fields in the input. Good actors have reasonable defaults for most of them.

Let's try to run the most popular actor - generic Web Scraper.

The full input with all possible fields is pretty long and ugly so we won't show it here. As it has default values for most of its fields, we can provide just a simple JSON input.

We will send a POST request to
https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN
and add the JSON as a body.

This is how it can look in Postman.

If we press Send, it will immediately return some info about the run. The status will be either READY  (which means that it is waiting to be allocated on a server) or RUNNING (99% of cases).

We will later use this run info JSON to retrieve the data. You can also get this info about the run with another call to GET RUN endpoint.

2. Wait for finish

There may be cases where we need to simply run the actor and go away. But in any kind of integration, we are usually interested in its output. We have three basic options for how to wait for the actor/task to finish.

  • Synchronous call
  • Webhooks
  • Polling

Synchronous call

For simple and short actor runs, the synchronous call is the easiest one to implement. You can make the POST request wait by simply adding a parameter waitForFinish  that can have a value from 0  to 300  which is a time in seconds (max wait time is 5 minutes). The example URL can be extended like this:

https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN&waitForFinish=300

Again, the final response will be the run info object, but now it should have its status  as SUCCEEDED  or FAILED . If the run exceeds the waitForFinish , the status  will still be RUNNING .

Run-sync endpoint
There is also one special case with a limited use case. The Apify API provides a special run-sync endpoint for actors and tasks that will wait as in the previous case. The advantage over the previous waiting parameter is that you will get back the data right away along with the info JSON as a response. This saves you one more call. The disadvantage is that this only works if the data is stored in a Key value store of the run. Most of the time, you store the data in a dataset where this endpoint doesn't help.

Webhooks

If you have a server, webhooks are the most elegant and flexible solution. You can simply set up a webhook for any actor or task and that webhook sends a POST request to your server after some event happens. Usually this event is a successfully finished run but you can also set a different webhook for failed runs, etc.

The webhook will send you a pretty complicated JSON but usually you are only interested in the resource  object, which is basically the run info JSON from the previous sections. You can leave the payload template as is as for our use case, since it is what we need.

Once you receive this request from the webhook, you know the event happened and you can ask for the complete data. Don't forget to respond to the webhook with a 200 status. Otherwise, it will ping you again.

Polling

However, there are cases where you don't have a server and the run is too long to use a synchronous call. Periodic polling of the run status is then the solution. 

You run the actor with the usual call shown in the beginning of this article. That will run the actor and give you back the run info JSON. You need to extract the id field from this JSON which is the ID of the actor run that you just started. Then you set an interval that will poll Apify API (let's say every 5 seconds). Every interval you will call the GET RUN endpoint to retrieve the status of the run. You simply replace RUN_ID with the id  in the following URL :

https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs/RUN_ID

Once it returns with a status of  SUCCEEDED  or FAILED  you know it has finished and you can cancel the interval and ask for the data.

3. Collect the data

Unless you have used the special run-sync endpoint mentioned above, you will have to make one additional request to the API to retrieve the data. The run info JSON also contains IDs of the default dataset and key value store that are allocated seprately for each run. This is usually everything you need. The fields are called defaultDatasetId  and defaultKeyValueStoreId .

Collecting dataset
If you are scraping products or basically any list of items with similar fields, the dataset is the storage of choice. Don't forget that dataset items are immutable: you can only push to the dataset, not change its content.

Retrieving the data is simple: Send a GET request to the GET ITEMS endpoint and pass the defaultDatasetId  to the URL. For GET request to the default dataset, no token is needed.

https://api.apify.com/v2/datasets/DATASET_ID/items

By default, it will return the data in JSON format with some metadata. The actual data are in the items array.

There are plenty of additional parameters that you can use. Learning about them is not the focus of this article, so check the docs. We will only mention that you can pass a format  parameter that transforms the response to any popular format like CSV, XML, Excel, RSS, etc. Also, the items are paginated, which means you can ask only for a subset of the data. This is specified with the limit  and offset  parameters. There is actually an overall limit of 250,000 items that the endpoint can return per request so to retrieve more, you need to send more requests incrementing the offset .

https://api.apify.com/v2/datasets/DATASET_ID/items?format=csv&offset=250000

Collecting files from key value store

Key value store is mainly useful if you have a single output or any kind of files that cannot be stringified like images, PDFs, etc. When you want to retrieve anything from key value store, the defaultKeyValueStoreId  is not enough. You also need to know the name of the record you want to retrieve.

If you have a single output JSON, the convention is to return this as a record named OUTPUT to the default key value store. To retrieve the content of the record, call GET RECORD endpoint. Again, no need for a token for simple GET requests.

https://api.apify.com/v2/key-value-stores/STORE_ID/records/RECORD_KEY

If you don't know the keys(names) of the records in advance, you can retrieve just the keys with LIST KEYS endpoint. Just keep in mind that you can get max 1000 keys per one request so you will need to paginate over the keys using the exclusiveStartKey parameter if you have more than 1000 keys. Basically, after each call, you will take the last record key and provide it as a exclusiveStartKey  parameter. You can do this until you get 0 keys back.

https://api.apify.com/v2/key-value-stores/STORE_ID/keys?exclusiveStartKey=myLastRecordKey

Summary

We have reviewed the basic integration process with all of its main options. Of course, there are plenty of parameters and functionalities that you can use to make your integration smoother. Check our help section for more knowledge content.

Did this answer your question?