Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explorer tools performance #305

Open
aidanheerdegen opened this issue Jul 14, 2022 · 2 comments
Open

Explorer tools performance #305

aidanheerdegen opened this issue Jul 14, 2022 · 2 comments

Comments

@aidanheerdegen
Copy link
Collaborator

The time it takes to load the Database Explorer seems to have increased quite markedly recently.

import cosima_cookbook as cc
session = cc.database.create_session(db='/g/data/ik11/databases/cosima_master_2022-07-01.db')

%%time
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx
CPU times: user 1min 16s, sys: 16 s, total: 1min 32s
Wall time: 4min 16s

Previous timings when the explorer was retooled because it was taking a minute to start up suggests this is a recent, and significant, slow down

#274 (comment)

TL;DR it used to take 17s

@angus-g
Copy link
Collaborator

angus-g commented Jul 14, 2022

I don't know if it's a red herring, but I'm still getting 17s from the database on tmpfs (disk latency seems important...), after changing:

diff --git a/cosima_cookbook/querying.py b/cosima_cookbook/querying.py
index 8ff2e47..05cf3c1 100644
--- a/cosima_cookbook/querying.py
+++ b/cosima_cookbook/querying.py
@@ -210,7 +210,7 @@ def get_variables(
         q = q.outerjoin(subq, subq.c.ncvar_id == NCVar.id)
 
     q = q.order_by(NCFile.frequency, CFVariable.name, NCFile.time_start, NCFile.ncfile)
-    q = q.group_by(CFVariable, NCFile.frequency)
+    q = q.group_by(CFVariable.id, NCFile.frequency)
 
     if experiment is not None:
         q = q.group_by(subq.c.value)

Before the change was about 30s:

before:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
32.3 s ± 559 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

after:
>>> %timeit -r3 dbx = cc.explore.DatabaseExplorer(session=session)
17.1 s ± 60.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

I think any latency beyond that is down to serving the database over lustre, which is unavoidable...

@aidanheerdegen
Copy link
Collaborator Author

Nice detective work, and a good idea to take the disk latency out of the equation. I don't know why it used to be so much faster. It was on the OOD, so perhaps the IO was being saturated. I know that it has always been slower on the cloud infrastructure. Now I am intrigued by the option of loading the DB into memory to improve performance.

A couple of quick tests on gadi suggest a fair amount of variability, and still decently long disk access time

$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    2m38.213s
user    0m56.782s
sys     0m21.232s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    1m30.493s
user    0m50.484s
sys     0m19.030s
$ time python -c 'import cosima_cookbook as cc; dbx = cc.explore.DatabaseExplorer()'
real    2m36.051s
user    0m59.151s
sys     0m21.248s

Second was fast because it was still in cache. Third was after ~30min break, so similar to first.

Sounds like that code change is worth making if it halves the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants