Limit concurrency for caching #389

cisaacstern · 2022-07-22T20:44:40Z

On the first production run of https://github.com/pangeo-forge/terraclimate-feedstock, Dataflow autoscaled the cluster to 1000 workers, in response to the slow throughput of caching ~882 inputs (totaling ~1.9 TB).

We should be able to limit concurrency for caching, given that the source file servers will generally be bandwidth-constrained. Dataflow provides a max_num_workers option to cap the size of the worker pool, but this issue is separate from that concern: concurrency should be limited only for the caching step, and then we should support larger scale-out after data is cached.

There must be a more formal discussion of this somewhere in the Beam docs, but for now the most direct discussion I've found is in the replies to https://stackoverflow.com/a/65634538, which suggest GroupByKey might be used to achieve this.

I believe this will require pulling caching out from OpenURLWithFSSpec. Currently, if a cache argument is provided to OpenURLWithFSSpec, the input is cached and then immediately opened from the cache

pangeo-forge-recipes/pangeo_forge_recipes/openers.py

Lines 31 to 32 in bdb32f2

    
           cache.cache_file(url, secrets, **kw) 
        
           open_file = cache.open_file(url, mode="rb")

In order to limit concurrency for the caching, but not for the opening, I believe caching will need to be its own transform, the output of which is then passed to OpenURLWithFSSpec, which does not do any caching.

cc @rabernat @alxmrs, xref #376

The text was updated successfully, but these errors were encountered:

alxmrs · 2022-07-22T21:54:11Z

Something that I just learned about in Beam is resource hints (https://beam.apache.org/documentation/runtime/resource-hints/). It sounds like this could pair really well with breaking out file caching.

alxmrs · 2022-07-22T21:55:59Z

After TAL at OpenURLWifFSSpec, I agree that Caching should be a separate PTransform.

cisaacstern · 2022-09-07T18:25:36Z

Noting that pangeo-forge/paleo-pism-feedstock#2 is blocked by this. Looks like we will not be able to deploy a production run of that feedstock until we have some way to limit concurrency during the caching stage. cc @jkingslake

alxmrs · 2022-09-08T02:42:40Z

Here's an example that could help: for this kind of problem we use this RateLimit transform. https://github.com/google/weather-tools/blob/main/weather_mv/loader_pipeline/util.py#L282

cisaacstern · 2022-09-24T21:17:42Z

Belated thanks for sharing this example, @alxmrs. 🙏

jkingslake · 2023-02-27T14:29:00Z

Hi all,
As always, thanks for all the work you do with these tools!

Any updates on this issue? As noted above, it is stopping progress on pangeo-forge/paleo-pism-feedstock#2

cisaacstern · 2023-08-05T02:56:12Z

For those still following this, I am now working on a fix in #557

cisaacstern · 2023-08-18T23:47:17Z

This is fixed by #557, @jkingslake please feel free to ping me on your recipe thread if you'd like to work together to revive it in light of this fix. And thanks for your patience!

cisaacstern added the beam refactor label Jul 22, 2022

cisaacstern mentioned this issue Sep 7, 2022

ClientResponseError pangeo-forge/paleo-pism-feedstock#2

Open

This was referenced Aug 1, 2023

ClimSim production deployments failing with caching-related errors leap-stc/data-management#36

Open

Concurrency limiting #557

Merged

cisaacstern mentioned this issue Aug 5, 2023

limit concurrency for input downloads #45

Closed

cisaacstern closed this as completed Aug 18, 2023

cisaacstern mentioned this issue Aug 22, 2023

Caching pain points, and possible solutions #570

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit concurrency for caching #389

Limit concurrency for caching #389

cisaacstern commented Jul 22, 2022

alxmrs commented Jul 22, 2022

alxmrs commented Jul 22, 2022

cisaacstern commented Sep 7, 2022

alxmrs commented Sep 8, 2022

cisaacstern commented Sep 24, 2022

jkingslake commented Feb 27, 2023

cisaacstern commented Aug 5, 2023

cisaacstern commented Aug 18, 2023

Limit concurrency for caching #389

Limit concurrency for caching #389

Comments

cisaacstern commented Jul 22, 2022

alxmrs commented Jul 22, 2022

alxmrs commented Jul 22, 2022

cisaacstern commented Sep 7, 2022

alxmrs commented Sep 8, 2022

cisaacstern commented Sep 24, 2022

jkingslake commented Feb 27, 2023

cisaacstern commented Aug 5, 2023

cisaacstern commented Aug 18, 2023