This article is part of a series of articles about the deprecation of the Apify Crawler product and its replacement with the apify/legacy-phantomjs-crawler actor. Since Apify Crawler uses Apify API version 1 and Actors use API version 2, as part of the migration it is necessary to update your integration with the Apify API. In this article, you'll learn how to map specific API version 1 endpoints to version 2.
Crawlers ➡️ Actor tasks
Instead of creating, updating and deleting crawlers using the Crawlers API version 1 endpoints, after the migration you will be creating, updating and deleting tasks for the apify/legacy-phantomjs-crawler actor using the Actor tasks API version 2 endpoints. Each task contains the full input configuration of the apify/legacy-phantomjs-crawler actor, which has equivalent fields as the legacy crawler configuration, such as
For example, instead of creating a new crawler by sending a HTTP POST request with the JSON configuration of the crawler to the Create crawler API version 1 endpoint:
you will create a new task by sending a HTTP POST request to the Create task API version 2 endpoint:
The request must have the
Content-Type: application/json header and the POST payload must be a JSON object that contains information about the actor task:
"description": "This actor task was migrated from legacy crawler Google Business Listing.\n\nGet the Google Business Listing data like Business name , address , category , timings ,contact and website.",
Note that the field
actId: "YPh5JENjSSR6vBf2E" identifies the apify/legacy-phantomjs-crawler actor. The actual configuration of crawler is stored under the
Start crawler execution ➡️ Run actor task
Instead of running a crawler using the Start execution API version 1 endpoint:
you will run an actor task using the Run task API version 2 endpoint:
Both API endpoints have practically the same interface - they both accept a HTTP POST request where the payload is a JSON object with overridden crawler or task input configuration properties. Both endpoints require the
Content-Type: application/json header, and they support the
waitForFinish query parameter to synchronously wait the the crawler/task to finish.
As a response from the Run task API version 2 endpoint, you'll receive a JSON object containing details about the actor run object. The interesting fields are:
id- ID of newly created actor run
actId- ID of the apify/legacy-phantomjs-crawler actor
defaultDatasetId- ID of the default dataset that contains the crawling results. You will need to pass this ID to the Get items API endpoint in order to download the crawler results. Read below for more details.
Stop crawler execution ➡️ Abort actor run
Instead of stopping the crawler using the Stop execution API version 1 endpoint, you'll be stopping the actor task run using the Abort run API version 2 endpoint:
Simply send a HTTP POST request to that endpoint and the actor run will be aborted. The endpoint will return a JSON object with the details about the actor run.
Get execution results ➡️ Get dataset items
Instead of using the Get execution results API version 1 endpoint to obtain the crawler results, you'll use the Get items API version 2 endpoint to get items stored in the default dataset of the actor task run:
Note that the DATASET_ID must be the
defaultDatasetId value received in the JSON response after starting the actor run (regardless whether the actor was run in a task or alone) - see above for details. The Get items API endpoint accepts a HTTP GET request and supports most of the query parameters of the legacy API endpoint:
The Get items API endpoint doesn’t support the
hideUrl=1 parameter, which removes the
URL field from simplified results. However, you can achieve the same effect by adding the
omit=URL,url query parameter.
XML format uses different tag names for root and item elements. Crawler results in XML format have the following structure
<?xml version="1.0" encoding="UTF-8"?>
but dataset items are formatted as follows:
<?xml version="1.0" encoding="UTF-8"?>
To get results in original form add following query parameters to the dataset URL
Last crawler execution results ➡️ Last actor run dataset items
Instead of the Get last execution results API version 1 endpoint, you can use the Get last actor run's dataset API version 2 endpoint to get the crawling results from the last run:
Note that the API endpoint also supports the
status query parameter to only consider runs that succeeded. Just beware that actor runs have slightly different run statuses than crawler, e.g. instead of status
STOPPED the actors are using status
If you have any questions or something is not clear, please contact firstname.lastname@example.org
Happy crawling with the new actor!