Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation/best-practice xarray and dask programming patterns #210

Open
aidanheerdegen opened this issue Aug 28, 2020 · 5 comments
Open

Comments

@aidanheerdegen
Copy link
Collaborator

aidanheerdegen commented Aug 28, 2020

Many people report problems with running calculations on large datasets, and would like some general advice on the best approaches for tackling large problems.

There are lots of parameters that determine the success/efficiency of a calculation:

  1. Order of operations
  2. Calculating intermediate results
  3. Dask chunking
  4. netCDF chunking on disk
  5. Number of dask workers (or not using a scheduler/dask at all)
  6. Number of threads and amount of memory per worker

It becomes very complex very quickly.

One approach is to have some representative test calculations that can then be used as a target for optimisation. These test calculations can be run whenever there are infrastructure or algorithm changes to check there has been no degradation in performance, or if they might be further improved.

If that sounds like a useful idea then we need people to propose calculations that they know to be strenuous as possibilities for optimisation/best-practicification*. Ideally these would be fairly compact, reproducible chunks of code.

ping @AndyHoggANU @aekiss @adele-morrison @navidcy @angus-g

  • not a real word
@navidcy
Copy link
Collaborator

navidcy commented Sep 14, 2021

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

@navidcy
Copy link
Collaborator

navidcy commented Sep 14, 2021

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

I'm guessing that I should save the interpolated fields and reload them... But this might be just my random (or semi-educated) guess...

@navidcy navidcy pinned this issue Sep 14, 2021
@navidcy
Copy link
Collaborator

navidcy commented Sep 14, 2021

Actually, now I noticed that this MnWE might not be as relevant here since it does not use the cookbook... Oh well....

@angus-g
Copy link
Collaborator

angus-g commented Sep 14, 2021

The cookbook only really wraps the act of getting the data in the first place, so it's the actual (attempted) computation that's more important IMO. Thanks for the example! I'll take a look

@micaeljtoliveira micaeljtoliveira unpinned this issue Jun 9, 2022
@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-cookbook-updating-needs/130/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants