Check out our Sitemap Sniffer – it can discover sitemaps in hidden locations.

A website's sitemap contains a list of all of its pages in XML format. We can use it to generate a list of pages to crawl.

  1. Locate the sitemap, using either the /sitemap.xml path or with the Sitemap Sniffer.

  2. Identify the URLs for the pages you want to scrape and create a regular expression to capture them.

  3. Import the URLs into the Apify SDK's RequestList.

  4. Use the created RequestList in PuppeteerCrawler and save the results to your dataset.

See the full tutorial and code examples in our documentation.

