This article is part of a series of articles about the deprecation of the Apify Crawler product and its replacement with the apify/legacy-phantomjs-crawler actor. Since Apify Crawler uses Apify API version 1 and Actors use API version 2, as part of the migration it is necessary to update your integration with the Apify API. In this article, you'll learn how to map specific API version 1 endpoints to version 2. 

Crawlers  ➡️  Actor tasks

Instead of creating, updating and deleting crawlers using the Crawlers API version 1 endpoints, after the migration you will be creating, updating and deleting tasks for the apify/legacy-phantomjs-crawler actor using the Actor tasks API version 2 endpoints. Each task contains the full input configuration of the apify/legacy-phantomjs-crawler actor, which has equivalent fields as the legacy crawler configuration, such as startUrls  or pageFunction. 

For example, instead of creating a new crawler by sending a HTTP POST request with the JSON configuration of the crawler to the Create crawler API version 1 endpoint:

https://api.apify.com/v1/[USER_ID]/crawlers?token=[API_TOKEN]

you will create a new task by sending a HTTP POST request to the Create task API version 2 endpoint:

https://api.apify.com/v2/actor-tasks?token=[API_TOKEN]

The request must have the Content-Type: application/json  header and the POST payload must be a JSON object that contains information about the actor task:

{
  "actId": "YPh5JENjSSR6vBf2E",
  "name": "my-new-crawler-task",
  "description": "This actor task was migrated from legacy crawler Google Business Listing.\n\nGet the Google Business Listing data like Business name , address , category , timings ,contact and website.",
  "options": {
    "build": "latest",
    "timeoutSecs": 600000,
    "memoryMbytes": 2048
  },
  "input": {
    "startUrls": [{
      "key": "START",
      "value": "https://www.google.com/search?q=test"
    }],
    "maxParallelRequests": 1
  }

Note that the field actId: "YPh5JENjSSR6vBf2E" identifies the apify/legacy-phantomjs-crawler actor. The actual configuration of crawler is stored under the input  field.

Start crawler execution  ➡️  Run actor task

Instead of running a crawler using the Start execution API version 1 endpoint:

https://api.apify.com/v1/[USER_ID]/crawlers/[CRAWLER_ID]/execute?token=[API_TOKEN]

you will run an actor task using the Run task API version 2 endpoint:

https://api.apify.com/v2/actor-tasks/[ACTOR_TASK_ID]/runs?token=[API_TOKEN]

Both API endpoints have practically the same interface - they both accept a HTTP POST request where the payload is a JSON object with overridden crawler or task input configuration properties. Both endpoints require the Content-Type: application/json  header, and they support the waitForFinish  query parameter to synchronously wait the the crawler/task to finish.

As a response from the Run task API version 2 endpoint, you'll receive a JSON object containing details about the actor run object. The interesting fields are:

  • id   - ID of newly created actor run
  • actId    - ID of the apify/legacy-phantomjs-crawler actor
  • defaultDatasetId  - ID of the default dataset that contains the crawling results. You will need to pass this ID to the Get items API endpoint in order to download the crawler results. Read below for more details.

Stop crawler execution  ➡️  Abort actor run

Instead of stopping the crawler using the Stop execution API version 1 endpoint, you'll be stopping the actor task run using the Abort run API version 2 endpoint:

https://api.apify.com/v2/acts/[ACTOR_ID]/runs/[RUN_ID]/abort?token=[API_TOKEN]

Simply send a HTTP POST request to that endpoint and the actor run will be aborted. The endpoint will return a JSON object with the details about the actor run.

Get execution results  ➡️  Get dataset items

Instead of using the Get execution results API version 1 endpoint to obtain the crawler results, you'll use the Get items API version 2 endpoint to get items stored in the default dataset of the actor task run:

https://api.apify.com/v2/datasets/[DATASET_ID]/items

Note that the DATASET_ID must be the defaultDatasetId value received in the JSON response after starting the actor run (regardless whether the actor was run in a task or alone) - see above for details. The Get items API endpoint accepts a HTTP GET request and supports most of the query parameters of the legacy API endpoint:

  • format 
  • simplified 
  • offset
  • limit 
  • desc 
  • attachment 
  • delimiter 
  • bom 
  • xmlRoot 
  • xmlRow 
  • skipHeaderRow 
  • skipFailedPages 

The Get items API endpoint doesn’t support the hideUrl=1 parameter, which removes the URL  field from simplified results. However, you can achieve the same effect by adding the omit=URL,url  query parameter.

XML format uses different tag names for root and item elements. Crawler results in XML format have the following structure

<?xml version="1.0" encoding="UTF-8"?>
<results>
  <result>
    ...
  </result>
  ...
</results>

but dataset items are formatted as follows:

<?xml version="1.0" encoding="UTF-8"?>
<items>
  <item>
    ...
  </item>
  ...
</items>

To get results in original form add following query parameters to the dataset URL xmlRoot=results&xmlRow=result .

Last crawler execution results  ➡️  Last actor run dataset items

Instead of the Get last execution results API version 1 endpoint, you can use the Get last actor run's dataset API version 2 endpoint to get the crawling results from the last run:

https://api.apify.com/v2/acts/[ACTOR_ID]/runs/last/dataset/items?token=[API_TOKEN]&status=SUCCEEDED

Note that the API endpoint also supports the status  query parameter to only consider runs that succeeded. Just beware that actor runs have slightly different run statuses than crawler, e.g. instead of status STOPPED  the actors are using status ABORTED

If you have any questions or something is not clear, please contact support@apify.com

Happy crawling with the new actor!

Did this answer your question?