How to process file and image downloads? #2222

johannesstricker · 2023-12-07T07:22:09Z

johannesstricker
Dec 7, 2023

Scrapy has a file and image pipeline, which allows you to download and process files or images. See https://docs.scrapy.org/en/latest/topics/media-pipeline.html for details. Specifically, it lets you save images directly to an S3 bucket.

Is there something similar for crawlee? How do you handle file downloads?

barjin · 2023-12-11T14:58:58Z

barjin
Dec 11, 2023
Maintainer

Hello @johannesstricker and thank you for your interest in the Crawlee project!

There are multiple ways to download linked files with Crawlee - two at least.

1) The easy way

In the requestHandler, you can use the sendRequest helper function. By default, this downloads the current request with our custom HTTP client, but you can pass any URL to it. Consider the following example:

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request, sendRequest }) => {

        // Get file links from your target page here (returns a list of urls)
        const links = $('img').toArray().map((e) => $(e).attr('src')).filter((e) => e);

        // Download your files one by one using the sendRequest helper
        for (const link of links) {  
            const { headers, body, rawBody } = await sendRequest({ url: new URL(link, request.url).href });
            // Do whatever you want with the response
            console.log(headers);
        }
    },
});

While this is quite straightforward (and perfectly fine for smaller scrapers), it's not very resilient - in case you expect the crawler to fail and restart, all the file requests get processed again and again. This also downloads all the files one by one, which can create a performance bottleneck, if you have more files.

2) The "proper" way

With modern versions of Crawlee, you can run multiple crawlers in parallel - making the process of downloading files much more idiomatic and declarative. Consider the following example:

import { BasicCrawler, CheerioCrawler, Configuration } from 'crawlee';

const startUrls = ['...'];

// Create a separate crawler for downloading files
const filesCrawler = new BasicCrawler({
    requestHandler: async ({ sendRequest }) => {
        // Call to sendRequest gets you your file data...
        const { headers, rawBody, body } = await sendRequest();
        // Process your data however you like.
        console.log(headers['content-type']);
    },
    // keepAlive keeps the crawler running until you call .teardown()
    keepAlive: true,
}, new Configuration({
    // Set the default key-value store, dataset and request queue to 'files', 
    // so the crawlers don't interfere with each other
    defaultKeyValueStoreId: 'files',
    defaultDatasetId: 'files',
    defaultRequestQueueId: 'files',
}));

// Your basic crawler for crawling the pages.
const mainCrawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        console.log(`Processing ${request.url}`);
        // Get file links from your target page here (returns a list of urls)
        const links = $('img').toArray().map((e) => $(e).attr('src')).filter((e) => e);

        // Enqueue the file links to the filesCrawler
        await filesCrawler.addRequests(
            links.map((e) => new URL(e, request.url).href), 
            { waitForAllRequestsToBeAdded: true }
        );
    },
});

// Start the files crawler and wait for new file requests to process.
filesCrawler.run();
await mainCrawler.run(startUrls);

const interval = setInterval(async () => {
   // stop the files crawler after the mainCrawler is done and the files
    if(await filesCrawler.requestQueue?.isEmpty() ?? true){
        await filesCrawler.teardown();
        clearInterval(interval);
    } else {
        console.log('Waiting for the file downloads to finish...');
    }

}, 5000);

As you can see, we're matching the Scrapy file pipelines with a separate crawler dedicated to downloading files. This has multiple benefits: the requests to the files are automatically parallelized, each request processing is independent of the others and there is no way one file request will be processed multiple times (the RequestQueue instance in the filesCrawler deduplicates the request URLs by default).

Does this answer your question? Mind that I'm not an active Scrapy user, so maybe I just got confused at some point and might have digressed. If so, please let me know... and if not, but you still have some details you want to discuss, let me know as well :)

Thanks!

1 reply

johannesstricker Dec 13, 2023
Author

I think that perfectly answers my question, thank you! Scrapy has a couple of other benefits, like an integrated adapter for S3 uploads, but I think that can easily be replicated with 3rd party libraries. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to process file and image downloads? #2222

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to process file and image downloads? #2222

johannesstricker Dec 7, 2023

Replies: 1 comment · 1 reply

barjin Dec 11, 2023 Maintainer

1) The easy way

2) The "proper" way

johannesstricker Dec 13, 2023 Author

johannesstricker
Dec 7, 2023

Replies: 1 comment 1 reply

barjin
Dec 11, 2023
Maintainer

johannesstricker Dec 13, 2023
Author