Note: If you want to just import a simple URL list from spreadsheet, check this article.

Typically a scraper on Apify uses one or more Start URLs to start the crawling process. You have several options for defining these URLs:

  • define a static list of Start URLs in basic settings on the scraper configuration page
  • POST an array on Start URLs when starting the crawler via API and thus handle these settings dynamically from your application (guide for API integration here)
  • fetch a list of URLs from an external source via the REST API from the Page function or actor source code

We're going to take a look at the third option.

First, you have to prepare some external source of URLs. All it has to have is a URL which can be used to fetch data using a request from the Page function or actor code. It can be a CSV file, database, application, cloud service, text file etc. 

Uploading directly via file/link upload

Start URLs field of various scrapers has a neat option to upload arbitrary file from your computer or link from the web. The scraper will scan the provided resource and extract all URLs from it. That's cool, you don't even have to structure it because the URLs are automatically recognized with regular expressions.

Fetching any resource from a code

There are infinite ways you can fetch something from a code. We will explore few main options below.

Web Scraper

Web Scraper is a specific in that it doesn't allow you to access any external library (except JQuery) and all the code executes in browser context. To access an external resource, you have to use browser built-in functions like Fetch or JQuery's Ajax. Also, don't forget to switch on the Ignore CORS and CSP  field in Proxy and browser configuration tab. Otherwise, the browser may complain that you tried to access unauthorized external resources.

Let's say we want to fetch some CSV that we have on the web. The fetch function is accessible out of the box.

// We are inside pageFunction
const csvUrl = 'https://my-site.com/my-file.csv';
// We fetch the csv and convert the response to a string
const csvString = await fetch(csvUrl).then((response) => response.text());

// Now we can parse the rows by splitting on each newline character and trim
const rows = csvString.split('\n').map((row) => row.trim());

// And we loop over the rows, check if it contains a valid URL and enqueue them
for (const row of rows) {
    if (row.startsWith('http')) {
         await context.enqueueRequest({ url: row });
    }
}

And that's it. We downloaded a CSV file from our URL. Then we parsed it, checked if it contains valid URLs and enqueued those for later scraping. Don't forget that in real use-case, you don't want to load this CSV on each page (in every pageFunction) so you should label your requests. That is well explained in our scraping tutorial.

Actors

In other actors and scrapers, you will likely have access to the Apify library so you can use our own advanced HTTP client requestAsBrowser. The usage is similar to fetch .

// Actor or Scraper code with access to Apify (context.Apify in Cheerio or Puppeteer scrapers)

const csvUrl = 'https://my-site.com/my-file.csv';
// We fetch the csv and convert the response to a string
const csvString = await Apify.utils.requestAsBrowser({ url: csvUrl  })
    .then((response) => response.body);

The rest of the code is the same.

Dynamically loading from Google Sheets, Google Drive and other services

You can also check out our public store for actors that help you integrate with various services. These actors handle a lot of complexity for you and give you a simple interface to use. Actors can be "called" from anywhere in the code to load data, export data or perform some action. 

If we want to load a spreadsheet data from the code, we can do that by calling the Google Sheets actor. This actor allows us to use either public or authenticated (if we don't want to share our sheets) access. For more information, check the readme of the actor.

const sheetsActorInput = {
    mode: 'read',
    spreadsheetId: '1anU4EeWKxHEj2mAnB0tgfxGnkTdqXBSB76a7-FRLytr',
    publicSpreadsheet: true // switch to false for authorized access
};

const sheetRows = await Apify.call('lukaskrivka/google-sheets', sheetsActorInput);

// Do what you need with your data

In Web Scraper, you can use fetch, the example is shown in the readme.

Similar functionality is provided by the Google Drive actor and many more. If you would like us to add more integration actors, just ping us at support@apify.com

Did this answer your question?