limit concurrency for input downloads #45

rabernat · 2021-01-23T17:13:25Z

NOAA NCEI might not like it if we fire off hundreds of simultaneous requests to their servers. We would like to limit the concurrency of this step if possible.

From an API perspective, the question is:

Should a user have to specify concurrency limits as part of the recipe?
Alternatively, should we try to auto-detect if flows access certain resources (e.g. specific FTP servers) and then automatically enforce concurrency limits?

In terms of implementation, Prefect cloud has a prefect solution: https://docs.prefect.io/orchestration/concepts/task-concurrency-limiting.html

However this only works with cloud. Some questions about this option are:

Are we okay with getting locked into a prefect cloud feature?
What convention do we use for the task tags to indicate concurrency? For example, I could imagine a tag like www.ncei.noaa.gov, allowing us to limit concurrency for all requests to that server from all flows simultaneously! That would be pretty useful.

If we don't want to get locked into cloud features, @jcrist made the following suggestion on the Prefect slack:

I'd handle this with a distributed.Semaphore within your tasks for now. Alternatively, you could make use of dask's worker resources. Tasks tagged with tags of the form dask-resource:KEY=N will each take N amount of KEY resource. So you could limit active download tasks by creating a resource for downloading then tagging download tasks to mark that they require that resource. (edited)
That would mean that the total concurrency limit scales with the number of workers (so it isn't absolute across the whole run), but would also work and wouldn't block other tasks from running like the Semaphore would.

The text was updated successfully, but these errors were encountered:

rabernat · 2022-03-04T04:26:14Z

The situation described in pangeo-forge/staged-recipes#108 (comment) adds another dimension to the concurrency story. That recipe pulls data over opendap. When using opendap, the data loading happens during the store_chunk stage, not the cache_input stage.

If we follow the path outlined in #245, we may end up making significant changes to how Pangeo Forge works internally. That should give us the ability to attach a concurrency restriction on any stage of the pipeline. In pseudocode it may look something like

recipe = source | subset({'time': 100}) | load_with_xarray(concurrency=5) | zarr_destination

cisaacstern · 2023-08-05T02:57:44Z

xref #389, which is a duplicate I opened not remembering this was already here.

Aiming to fix this in #557

rabernat added design question A question of the design of Pangeo Forge executors Related to executors and pipelines labels Jan 23, 2021

rabernat mentioned this issue Jan 24, 2021

Handle inputs with variable items per input #50

Closed

cisaacstern mentioned this issue Aug 5, 2023

Concurrency limiting #557

Merged

cisaacstern closed this as completed in #557 Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limit concurrency for input downloads #45

limit concurrency for input downloads #45

rabernat commented Jan 23, 2021

rabernat commented Mar 4, 2022

cisaacstern commented Aug 5, 2023

limit concurrency for input downloads #45

limit concurrency for input downloads #45

Comments

rabernat commented Jan 23, 2021

rabernat commented Mar 4, 2022

cisaacstern commented Aug 5, 2023