Check out our Sitemap Sniffer – it can discover sitemaps in hidden locations.
A website's sitemap contains a list of all of its pages in XML format. We can use it to generate a list of pages to crawl.
- Locate the sitemap, using either the
/sitemap.xml
path or with the Sitemap Sniffer. - Identify the URLs for the pages you want to scrape and create a regular expression to capture them.
- Import the URLs into the Apify SDK's RequestList.
- Use the created RequestList in PuppeteerCrawler and save the results to your dataset.
See the full tutorial and code examples in our documentation.