How to process file and image downloads? #2222
Replies: 1 comment 1 reply
-
Hello @johannesstricker and thank you for your interest in the Crawlee project! There are multiple ways to download linked files with Crawlee - two at least. 1) The easy wayIn the const crawler = new CheerioCrawler({
requestHandler: async ({ $, request, sendRequest }) => {
// Get file links from your target page here (returns a list of urls)
const links = $('img').toArray().map((e) => $(e).attr('src')).filter((e) => e);
// Download your files one by one using the sendRequest helper
for (const link of links) {
const { headers, body, rawBody } = await sendRequest({ url: new URL(link, request.url).href });
// Do whatever you want with the response
console.log(headers);
}
},
}); While this is quite straightforward (and perfectly fine for smaller scrapers), it's not very resilient - in case you expect the crawler to fail and restart, all the file requests get processed again and again. This also downloads all the files one by one, which can create a performance bottleneck, if you have more files. 2) The "proper" wayWith modern versions of Crawlee, you can run multiple crawlers in parallel - making the process of downloading files much more idiomatic and declarative. Consider the following example: import { BasicCrawler, CheerioCrawler, Configuration } from 'crawlee';
const startUrls = ['...'];
// Create a separate crawler for downloading files
const filesCrawler = new BasicCrawler({
requestHandler: async ({ sendRequest }) => {
// Call to sendRequest gets you your file data...
const { headers, rawBody, body } = await sendRequest();
// Process your data however you like.
console.log(headers['content-type']);
},
// keepAlive keeps the crawler running until you call .teardown()
keepAlive: true,
}, new Configuration({
// Set the default key-value store, dataset and request queue to 'files',
// so the crawlers don't interfere with each other
defaultKeyValueStoreId: 'files',
defaultDatasetId: 'files',
defaultRequestQueueId: 'files',
}));
// Your basic crawler for crawling the pages.
const mainCrawler = new CheerioCrawler({
requestHandler: async ({ $, request }) => {
console.log(`Processing ${request.url}`);
// Get file links from your target page here (returns a list of urls)
const links = $('img').toArray().map((e) => $(e).attr('src')).filter((e) => e);
// Enqueue the file links to the filesCrawler
await filesCrawler.addRequests(
links.map((e) => new URL(e, request.url).href),
{ waitForAllRequestsToBeAdded: true }
);
},
});
// Start the files crawler and wait for new file requests to process.
filesCrawler.run();
await mainCrawler.run(startUrls);
const interval = setInterval(async () => {
// stop the files crawler after the mainCrawler is done and the files
if(await filesCrawler.requestQueue?.isEmpty() ?? true){
await filesCrawler.teardown();
clearInterval(interval);
} else {
console.log('Waiting for the file downloads to finish...');
}
}, 5000); As you can see, we're matching the Scrapy file pipelines with a separate crawler dedicated to downloading files. This has multiple benefits: the requests to the files are automatically parallelized, each request processing is independent of the others and there is no way one file request will be processed multiple times (the Does this answer your question? Mind that I'm not an active Scrapy user, so maybe I just got confused at some point and might have digressed. If so, please let me know... and if not, but you still have some details you want to discuss, let me know as well :) Thanks! |
Beta Was this translation helpful? Give feedback.
-
Scrapy has a file and image pipeline, which allows you to download and process files or images. See https://docs.scrapy.org/en/latest/topics/media-pipeline.html for details. Specifically, it lets you save images directly to an S3 bucket.
Is there something similar for crawlee? How do you handle file downloads?
Beta Was this translation helpful? Give feedback.
All reactions