Apify provides a number of actors such as Web Scraper (
apify/web-scraper), Cheerio Scraper (
apify/cheerio-scraper) or Puppeteer Scraper (
apify/puppeteer-scraper) that make it really simple to crawl web pages and extract data from them. These actors start with a pre-defined list of Start URLs and then optionally recursively follow links to find new pages.
You can enter the start URLs either manually one by one, by linking a remote text file with the URLs or uploading the file directly:
Let's say you have your start URLs to crawl entered in a Google Sheets spreadsheet, such as this one:
Of course, you could export the spreadsheet to a comma-separated values (CSV) file and then upload the file to the Start URLs control. However, with this approach, the changes in the spreadsheet will not be automatically propagated to the actor and you'd need to upload the text file again after every change. That's not very flexible.
Fortunately, there's a better way. Add the following query parameter of the
base part of the Google Sheet URL, right after the long string with identifier of the document:
And you'll get a URL that automatically exports the spreadsheet to CSV. Then you just need to click Link remote text file and paste the URL there:
IMPORTANT: Make sure the document can be viewed by anyone with the link, otherwise the actor will not be able to access it!
And that's it, now the actor will simply download the content of the spreadsheet with up-to-date URLs whenever it starts.
Beware that the spreadsheet should have a simple structure, so that Apify can easily find the URLs in it. Also, it should only have one sheet.
Happy crawling of URLs from Google Sheets document!