Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting CUDA_VISIBLE_DEVICES env variable #113

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ryanamazon
Copy link

@ryanamazon ryanamazon commented Aug 19, 2022

I'm resubmitting this, as an update to #105

I attempted to validate the patch in light of the comments on that PR. Please let me know if I'm misunderstanding.

To evaluate this comment:
"For example, if you run with 2 ranks, and set CUDA_VISIBLE_DEVICES to "0,1,2,3" on one rank and "4,5,6,7" and the other rank, then launch with only 2 GPUs per rank, users would expect to use 0,1 and 4,5 but instead we'd use 0,1 and 6,7."

I applied the proposed patch and added logging (zerodev) to show the offset of the actual device from zero. I launched with two nodes, one process per node, and two GPUs per process, which results in two 2 GPUs per rank.

ubuntu@ip-172-31-30-49:~$ ./dompi
ip-172-31-21-32 CUDA_VISIBLE_DEVICES=4,5,6,7
ip-172-31-30-49 CUDA_VISIBLE_DEVICES=0,1,2,3
# nThread 1 nGpus 2 minBytes 8 maxBytes 8 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  36533 on ip-172-31-30-49 device  0 [0x00] devid=0x16 zerodev=0 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  1 Pid  36533 on ip-172-31-30-49 device  1 [0x00] devid=0x17 zerodev=1 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  2 Pid  33645 on ip-172-31-21-32 device  0 [0x00] devid=0x1a zerodev=4 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  3 Pid  33645 on ip-172-31-21-32 device  1 [0x00] devid=0x1b zerodev=5 domid=0x0 Tesla V100-SXM2-32GB
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    163.9    0.00    0.00  1e-07    163.3    0.00    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.33647e-05

To evaluate this comment:
"Also, today if you run on a 4 GPUs system and launch 8 ranks or one rank with -g 8 it will error out as "invalid CUDA numeral" or something similar. With this patch it would try to re-use each GPU twice (and generate and error later during NCCL init)."

When I run the patch with four devices but a GPUs requested, I get an argument error.

ubuntu@ip-172-31-30-49:~$ CUDA_VISIBLE_DEVICES=7,6,5,4 nccl-tests/build/all_reduce_perf -b 8 -e 8 -f 2 -g 8 -c 1 -n 100
invalid number of GPUs specified (8), only for 4 GPUs exist

@ryanamazon
Copy link
Author

@sjeaugey @AddyLaddy Thank you for looking at this previously; I'd like to continue the discussion on this PR if possible.

@sjeaugey
Copy link
Member

Hi Ryan. As I mentioned the code was old and I didn't want to base your changes on that old version as it would not work well when we want to update it.
That's why we didn't make progress on this PR: we were preparing an update so that we could work on current code. Sorry it took us some time, but it's done now.
I'll try to revisit this PR based on the new code soon.

@ryanamazon ryanamazon reopened this Sep 16, 2022
@ryanamazon
Copy link
Author

ryanamazon commented Sep 16, 2022

NCCL and NCCL-tests have been updated; I also updated this PR to resolve build errors against the previous code.

@ryanamazon
Copy link
Author

Thank you for the update. Given the recent updates of NCCL and NCCL Tests, is this getting to a point where it could move forward?

@sjeaugey
Copy link
Member

Sorry, the changes I wanted were missed in the update. So I'm working on pushing another patch.

@ryanamazon
Copy link
Author

Thank you for the update!

@sjeaugey
Copy link
Member

sjeaugey commented Sep 26, 2022

I updated the NCCL tests. Now there is no need to replicate the computation of GPU numbers, as it's computed once at the beginning. Plus NCCL_TESTS_DEVICE supports multiple GPUs.

So I looked again at the goal and pros/cons of the different approaches. The problem is that we want to be able to work in an environment where CUDA_VISIBLE_DEVICES is set to restrict the set of GPUs each rank is seeing (each rank seeing different devices) while the NCCL tests assume each rank sees the same set of GPUs, hence compute absolute CUDA devices for each rank.

The two approaches are:

  1. Detect the number of visible CUDA devices and apply a modulo to the computed device number.
  2. Set NCCL_TESTS_DEVICE=0 to tell the perf tests to start counting at 0 for all ranks

I did not find a case where the second approach would not be able to do everything 1 does. However, there is still the case I mentioned before where option 2 works better than option 1: when we set CUDA_VISIBLE_DEVICES to multiple GPUs but each rank does not use all GPUs.
For example: we partitioned the system between two ranks, so that rank 0 has CUDA_VISIBLE_DEVICES set to 0,1,2,3 while rank 1 has 4,5,6,7. In that case, if we run the tests with -g 1 or -g 2, instead of using its first 1 or 2 GPUs, rank 1 will be using GPU 5 (-g1) or GPUs 6,7 (-g2). With NCCL_TESTS_DEVICE=0, rank 1 would be using GPU 4, or GPUs 4,5, which I believe is what a user would expect.

So I'm still not sure that the modulo is the right approach, and setting NCCL_TESTS_DEVICE=0 seems like a normal way to tell the tests that each rank is running inside a different CUDA_VISIBLE_DEVICES set.

Is there something I'm missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants