Request throttle for npm package website-scraper

February 14, 2020

In one of my projects, I needed to create an archive version of any given website. So I decided quite early on that I’d build a node-script that would handle the fetching / scraping / spindling. At first I was going to build something that would just integrate with wget or Curl (basically just run a CLI command).

But looking around, I quickly stumbled upon the npm package website-scraper (as of this writing, version 4.2.0). All in all the scraper works perfectly, and has almost all of the options that I need.

One problem that I ran in to was that the scraper is a bit too efficient. It fetches a page, scans after all links, and then starts downloading those links. Rinse, repeat.

As a result you’re going to hit the target server quite hard, especially if you have a lot of bandwidth. In my case, the scraper accidentally triggered an anti-DoS protection system, and subsequently, I got banned. Rightfully so.

Anyway, website-scraper doesn’t natively support any throttling of the request. However, it does offer quite an extensive plugin-interface, with hooks going in to most of the processing points.

Throttling plugin

Consequently, I built my own throttler, as a plugin. It hooks in to the beforeRequest hook. The plugin halts any-and-all requests (with async/await), and adds a timeout-function, which in turn is placed in a queue. The queue is then worked through, one at a time, until the queue is empty.

Flow diagram for website-scraper throttle plugin

I’ve added a bunch of configurable options for minimum timeout, maximum timeout (the script randomizes a timeout between these two values) and concurrent connections (i.e. how many requests can be ongoing at the same time). As a result you can fine-tune the throttle quite easily.

The code

You should place the following class in it it’s own file.

class ScrapeThrottler {
    constructor(minThrottle = 0, maxThrottle = 0, maxConcurrentRequests = 10) {
        this.minThrottle = minThrottle;
        this.maxThrottle = maxThrottle;
        this.maxConcurrentRequests = maxConcurrentRequests;
        this.processing = 0;
        this.requestQueue = [];
    }

    generateTimeoutMs(min, max) {
        const timeout = Math.round(Math.random() * max);

        return timeout < min ? min : timeout;
    }

    handleQueue() {
        const reachedMaxConcurrentRequests = this.processing >= this.maxConcurrentRequests;
        const hasUnprocessedRequests = this.requestQueue.length === 0;

        if (reachedMaxConcurrentRequests || hasUnprocessedRequests) {
            return;
        }

        const nextRequest = this.requestQueue.pop();

        nextRequest();
    }

    delay(callback, timeout) {
        this.processing++;

        setTimeout(
            () => {
                this.processing--;
                this.handleQueue();

                callback();
            },
            timeout
        );
    }

    async throttleRequest(timeout) {
        return new Promise(resolve => {
            this.requestQueue.push(() => {
                this.delay(resolve, timeout);
            });

            this.handleQueue();
        })
    }

    apply(registerAction) {
        registerAction('beforeRequest', async ({resource, requestOptions}) => {
            const timeout = this.generateTimeoutMs(this.minThrottle, this.maxThrottle);

            await this.throttleRequest(timeout);

            return {requestOptions};
        });
    }
}

module.exports = ScrapeThrottler;

Usage

Using the throttler is quite easy. You just inject the plugin instance in the options object that you pass to the website-scraper method.

const scrape = require('website-scraper');
const ScrapeThrottler = require('./ScrapeThrottler');

const options = {
    [...]
    plugins: [
        new ScrapeThrottler(
            150,   // Minimum request time, ms
            1500,  // Maximum request time, ms
            5      // Concurrent requests
        )
    ]
};

await scrape(options);

And you’re all set.

Tags