Finding how many proxy IP addresses you need to scrape a website with optimal performance always includes a bit of trial and error. Here are some tips that will help you start.
To find out approximately how many proxy IPs you'll need, you need to answer four questions:
How many pages do I need to scrape from a single domain?
How fast do I need to scrape the pages?
How often do I need to scrape the pages?
Am I scraping a very popular website?
Then, you calculate the number this way:
Take your number of pages for a single domain.
Divide by:
10000, if you can spread the scrape over a whole month
1000 if you want to scrape within a day.
100 if you need to scrape within an hour.
Multiply by:
1 if you need to scrape monthly.
10 if you need to scrape daily.
100 if you need to scrape hourly.
Multiply by 10 if it's a very popular website like Google, CNN or Twitter.
Examples
Example 1
You want to scrape 10,000 pages monthly from a local e-commerce website, but you need the data always on the first of the month. The calculation goes like this:
Number of pages: 10,000
Scrape within a day: divide by 1,000
Scrape monthly: multiply by 1
Very popular: no
10,000 / 1,000 * 1 = 10 IPs
You will need 10 proxy IPs. But only if you spread the scrape throughout the whole day. If you want the scraper to run faster, you will need more IPs.
Example 2
You need to scrape 5,000,000 products from a popular online marketplace every month and you will spread the scrape throughout the month.
Number of pages: 5,000,000
Scrape within a month: divide by 10,000
Scrape monthly: multiply by 1
Very popular: yes
5,000,000 / 10,000 * 1 * 10 = 5000 IPs
โ
5000 might seem like a lot, but the catch here is that you will be using them every day. If they had a time to cool down, you would need less, but the amount for long term reliable scraping of a popular website will be around this number.
Example 3
You want to scrape 100 pages from thousand websites every day. The key here is that you're scraping from many website, but not a lot of pages from each of them:
Number of pages: 100 (the number of pages per domain)
Scrape within a day: divide by 1,000
Scrape daily: multiply by 10
Very popular: no
100 / 1,000 * 10 = 1 IP
Yes, you can scrape 100 pages from a thousand websites every day with just a single IP if you spread the scrape and if the websites don't use Cloudflare or another distributed IP protection system.
Conclusion
The numbers above are ballpark figures and on some websites they could be extremely off, but they should give you a good start for your own experimentation. If you see that your crawlers are performing well over multiple scrapes, you could try reducing the amount of IPs. On the other hand, if you see blocking, try increasing the amount.