Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperthreading #32

Open
itamblyn opened this issue Sep 10, 2020 · 11 comments
Open

Hyperthreading #32

itamblyn opened this issue Sep 10, 2020 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@itamblyn
Copy link
Contributor

The command top shows 32 CPU. Are trixie compute nodes dual socket, i.e. they have 2 Xeon6130 processors? If so, I don't understand why we aren't seeing 64 CPU in top (because of hyperthreading).

@itamblyn itamblyn added the documentation Improvements or additions to documentation label Sep 10, 2020
@joeydumont
Copy link

There are two sockets, with 16 CPUs each. HyperThreading is disabled.

@itamblyn
Copy link
Contributor Author

...why is HyperTreading disabled?

@joeydumont
Copy link

Hyperthreading is typically off on HPC clusters, as it is expected that each processor is almost always fully utilized. If you oversubscribe threads to cores, you can lose performance due to the constant context switching.

Of course, it depends a lot on what the typical workload looks like. For CPU-bound workloads, hyperthreading is better left disabled. It might be different for other workloads.

@itamblyn itamblyn changed the title Hardware specifications Hyperthreading Sep 10, 2020
@itamblyn
Copy link
Contributor Author

Ok, I think this needs to be revisited. Trixie is not a "general" HPC machine - it was designed for GPU and data intensive workloads, so we should be tuning it for the purpose.

Hyperthreading should be turned back on, as I suspect we are bottlenecking the cards right now (or at least we have the potential to).

@SamuelLarkin
Copy link
Collaborator

Have you looked at nvidia-smi -l on one of your nodes because I see 4 v100 att 100% which to me indicates that there isn't a bottleneck where the GPUs are waiting for data from the CPUs.

@itamblyn
Copy link
Contributor Author

That just indicates there isn't a bottleneck with that particular model.

I am not aware of any vendor who is supplying deep learning gear with hyperthreading disabled. The onus of proof is the other way here.

@joeydumont
Copy link

I am fairly sure this is how the cluster was first delivered. If you want to enable hyperthreading, we can queue that work for the next compute node refresh. We'll probably want to involve the working group on that decision.

My own experience is with CPU-based workloads, so I won't argue for/against hyperthreading here. We can implement whatever the working group thinks is best for the cluster.

@kryczko
Copy link

kryczko commented Sep 10, 2020

on the niagara supercomputer, if you request 40 CPUs you get 40 physical CPUs and the hyperthreads that go along with it (that would make 80 threads). The user can easily turn hyperthreading off on-the-fly using OMP_NUM_THREADS=1.

@ddamoursNRC
Copy link
Collaborator

Documentation update to reflect that hyperthreading is currently off.
Developing plan to test benchmark hyperthreading on and off for some of our workloads.

@ddamoursNRC
Copy link
Collaborator

Agreement has been reached to turn hyperthreading on for all compute nodes during the next scheduled maintenance window. Details will be communicated once this occurs.

@itamblyn
Copy link
Contributor Author

itamblyn commented Aug 4, 2022

Did this happen? Can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants