Optimisation/best-practice xarray and dask programming patterns #210

aidanheerdegen · 2020-08-28T05:27:59Z

Many people report problems with running calculations on large datasets, and would like some general advice on the best approaches for tackling large problems.

There are lots of parameters that determine the success/efficiency of a calculation:

Order of operations
Calculating intermediate results
Dask chunking
netCDF chunking on disk
Number of dask workers (or not using a scheduler/dask at all)
Number of threads and amount of memory per worker

It becomes very complex very quickly.

One approach is to have some representative test calculations that can then be used as a target for optimisation. These test calculations can be run whenever there are infrastructure or algorithm changes to check there has been no degradation in performance, or if they might be further improved.

If that sounds like a useful idea then we need people to propose calculations that they know to be strenuous as possibilities for optimisation/best-practicification*. Ideally these would be fairly compact, reproducible chunks of code.

ping @AndyHoggANU @aekiss @adele-morrison @navidcy @angus-g

not a real word

navidcy · 2021-09-14T02:08:32Z

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

navidcy · 2021-09-14T02:11:40Z

OK, here's one!

https://gist.github.com/navidcy/b12e5469d1a809cc4c9b447456da1fe5

(better viewed in nbviewer)

cc: @ongqingyee and @angus-g. @angus-g this is the one I was chatting with you yesterday

I'm guessing that I should save the interpolated fields and reload them... But this might be just my random (or semi-educated) guess...

navidcy · 2021-09-14T02:40:50Z

Actually, now I noticed that this MnWE might not be as relevant here since it does not use the cookbook... Oh well....

angus-g · 2021-09-14T03:34:59Z

The cookbook only really wraps the act of getting the data in the first place, so it's the actual (attempted) computation that's more important IMO. Thanks for the example! I'll take a look

access-hive-bot · 2022-11-10T01:22:37Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-cookbook-updating-needs/130/2

aidanheerdegen mentioned this issue Sep 1, 2020

Possibly non-optimal chunk sizes? #213

Open

AndyHoggANU mentioned this issue Sep 22, 2020

More automatic handling of dask #115

Closed

navidcy pinned this issue Sep 14, 2021

navidcy added the 🛟 help wanted label Sep 14, 2021

micaeljtoliveira unpinned this issue Jun 9, 2022

micaeljtoliveira pinned this issue Nov 10, 2022

navidcy added the 🏎 performance / optimization label Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation/best-practice xarray and dask programming patterns #210

Optimisation/best-practice xarray and dask programming patterns #210

aidanheerdegen commented Aug 28, 2020 •

edited by navidcy

Loading

navidcy commented Sep 14, 2021 •

edited

Loading

navidcy commented Sep 14, 2021 •

edited

Loading

navidcy commented Sep 14, 2021

angus-g commented Sep 14, 2021

access-hive-bot commented Nov 10, 2022

Optimisation/best-practice xarray and dask programming patterns #210

Optimisation/best-practice xarray and dask programming patterns #210

Comments

aidanheerdegen commented Aug 28, 2020 • edited by navidcy Loading

navidcy commented Sep 14, 2021 • edited Loading

navidcy commented Sep 14, 2021 • edited Loading

navidcy commented Sep 14, 2021

angus-g commented Sep 14, 2021

access-hive-bot commented Nov 10, 2022

aidanheerdegen commented Aug 28, 2020 •

edited by navidcy

Loading

navidcy commented Sep 14, 2021 •

edited

Loading

navidcy commented Sep 14, 2021 •

edited

Loading