Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit concurrency for caching #389

Closed
cisaacstern opened this issue Jul 22, 2022 · 8 comments
Closed

Limit concurrency for caching #389

cisaacstern opened this issue Jul 22, 2022 · 8 comments

Comments

@cisaacstern
Copy link
Member

On the first production run of https://github.com/pangeo-forge/terraclimate-feedstock, Dataflow autoscaled the cluster to 1000 workers, in response to the slow throughput of caching ~882 inputs (totaling ~1.9 TB).

We should be able to limit concurrency for caching, given that the source file servers will generally be bandwidth-constrained. Dataflow provides a max_num_workers option to cap the size of the worker pool, but this issue is separate from that concern: concurrency should be limited only for the caching step, and then we should support larger scale-out after data is cached.

There must be a more formal discussion of this somewhere in the Beam docs, but for now the most direct discussion I've found is in the replies to https://stackoverflow.com/a/65634538, which suggest GroupByKey might be used to achieve this.

I believe this will require pulling caching out from OpenURLWithFSSpec. Currently, if a cache argument is provided to OpenURLWithFSSpec, the input is cached and then immediately opened from the cache

cache.cache_file(url, secrets, **kw)
open_file = cache.open_file(url, mode="rb")

In order to limit concurrency for the caching, but not for the opening, I believe caching will need to be its own transform, the output of which is then passed to OpenURLWithFSSpec, which does not do any caching.

cc @rabernat @alxmrs, xref #376

@alxmrs
Copy link
Contributor

alxmrs commented Jul 22, 2022

Something that I just learned about in Beam is resource hints (https://beam.apache.org/documentation/runtime/resource-hints/). It sounds like this could pair really well with breaking out file caching.

@alxmrs
Copy link
Contributor

alxmrs commented Jul 22, 2022

After TAL at OpenURLWifFSSpec, I agree that Caching should be a separate PTransform.

@cisaacstern
Copy link
Member Author

Noting that pangeo-forge/paleo-pism-feedstock#2 is blocked by this. Looks like we will not be able to deploy a production run of that feedstock until we have some way to limit concurrency during the caching stage. cc @jkingslake

@alxmrs
Copy link
Contributor

alxmrs commented Sep 8, 2022

Here's an example that could help: for this kind of problem we use this RateLimit transform. https://github.com/google/weather-tools/blob/main/weather_mv/loader_pipeline/util.py#L282

@cisaacstern
Copy link
Member Author

Belated thanks for sharing this example, @alxmrs. 🙏

@jkingslake
Copy link

Hi all,
As always, thanks for all the work you do with these tools!

Any updates on this issue? As noted above, it is stopping progress on pangeo-forge/paleo-pism-feedstock#2

@cisaacstern
Copy link
Member Author

For those still following this, I am now working on a fix in #557

@cisaacstern
Copy link
Member Author

This is fixed by #557, @jkingslake please feel free to ping me on your recipe thread if you'd like to work together to revive it in light of this fix. And thanks for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants