Dynamic page scraping intervals

Takeru · June 19, 2024, 6:35pm

What is the default scraping interval for dynamic pages?
Is it set to, for example, run randomly between 1 and 5 seconds?

Tim · June 20, 2024, 4:02am

It’s not random interval. If it’s 5s it will always wait that.

Takeru · June 20, 2024, 6:56am

Is there any need to randomise?
Is there a problem with consistent intervals?

Tim · June 20, 2024, 7:20am

In general not really.

Once you are on the page, it takes a random amount of time to load that page, scrape the content on that page, then write it to hard drive and for SCM to clean up and do it all again.

The scraping wait time is consistent, but the time to scrape and finish a page is always random.

Sites can rate limit you to stop you doing essentially a DOS attack.

They generally don’t care if you are accessing the server 1 second apart or 3 seconds apart etc. ie its not the interval of requests that matters but the amount of request over a time period.

If you are worried about being detected as a ‘bot’ most sites will throw up javascript checks and captchas instead of checking your access intervals. I know cloudflare can check your browser to see if you are a bot. However its just a simple javascript check from what I know, which will stop the static scraper but not dynamic scraper most times.

However rate limits is how they stop you at the end.

Takeru · June 20, 2024, 7:24am

I see.
So we do not need to specify an interval then?
So what does this setting mean?

Tim · June 20, 2024, 7:26am

It adds the wait time between opening a new browser.

Its optional setting.

If you are trying to scape 500 urls from the same domain, adding this wait time will slow down number of requests to same server.

But the wait time is fixed, not randomized

Takeru · June 20, 2024, 7:27am

When is it necessary to add waiting time?

Tim · June 20, 2024, 7:28am

Only if you get any form of 403 after access same domain multiple times.

Its rare that you would need it though.

Even using task to access 500 urls on same domain, the fact that each page has to fully load and then wait for elements to appear mean you can’t really go through more than 1 url everyone 5 secs

Takeru · June 20, 2024, 7:28am

Understood.