Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polaris app_run changes #357

Merged
merged 21 commits into from
Aug 4, 2023
Merged

Polaris app_run changes #357

merged 21 commits into from
Aug 4, 2023

Conversation

cms21
Copy link
Contributor

@cms21 cms21 commented Jun 1, 2023

This PR modifies the way cpu_bind is set to make it consistent with the cpus assigned to the job by the node manager. A few items/questions that should be noted:

  • The meaning of -d depends on the value of cpu-bind. If cpu-bind=cores, -d is the number of physical cores per rank. If it is depth or numa, it is the number of hardware threads. We have a check for this now in _build_cmdline.
  • There are some print statements that probably need to be removed, but may be helpful for testing and review of the PR.
  • This PR reorganized and changed get_cpus_per_rank(). It now uses the ComputeNode cpu_ids in the multimode case, and it has been reorganized to be more readable.
  • Should we also set OMP_PLACES and OMP_PROC_BIND for the user?

cpu_ids = self._node_spec.cpu_ids[0]
cpus_per_node = len(cpu_ids)
if not cpu_ids:
compute_node = ComputeNode(self._node_spec.node_ids[0], self._node_spec.hostnames[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to watch out for here is that ComputeNode will always have an empty list of cpu_ids since it's the generic base class. Only at runtime do the launchers load the specific compute node class from the site configuration.

I think it should be safe for the NodeManager to include the full set of CPU IDs in the NodeSpec of a multinode job. This can be done by updating the method NodeManager._assign_multi_node.

That would keep the abstraction intact and avoid the need to pass an extra piece of information (which subclass of ComputeNode) from the launcher to the AppRun.

https://github.com/argonne-lcf/balsam/blob/main/balsam/site/launcher/node_manager.py#L68

Copy link
Contributor Author

@cms21 cms21 Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that (modifying _assign_multi_node) but the way the NodeSpec object is structured, cpu_ids and gpu_ids will then be lists of a bunch of identical lists stored in memory. That seemed a bit silly. I did wonder why cpu_ids and gpu_ids have to be lists of lists? They only time they contain non-empty lists in the single node case. Having a list for every node does not seem to be used in any functional way.

@masalim2
Copy link
Contributor

masalim2 commented Jun 2, 2023 via email

@cms21 cms21 self-assigned this Aug 4, 2023
@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2023

Codecov Report

Patch coverage: 14.28% and project coverage change: -0.23% ⚠️

Comparison is base (51b9c7d) 60.91% compared to head (3890bfa) 60.69%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #357      +/-   ##
==========================================
- Coverage   60.91%   60.69%   -0.23%     
==========================================
  Files         157      157              
  Lines        9627     9677      +50     
  Branches     1259     1271      +12     
==========================================
+ Hits         5864     5873       +9     
- Misses       3502     3544      +42     
+ Partials      261      260       -1     
Files Changed Coverage Δ
balsam/platform/app_run/app_run.py 71.52% <10.00%> (-2.27%) ⬇️
balsam/platform/app_run/polaris.py 15.68% <11.36%> (-21.82%) ⬇️
balsam/platform/compute_node/alcf_polaris_node.py 41.46% <100.00%> (+1.46%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cms21 cms21 merged commit 10f5d27 into main Aug 4, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants