Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow ranks search improvements #99

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dmonakhov
Copy link

No description provided.

Dmitry Monakhov added 4 commits December 15, 2021 14:03
This allow us to simulate slow ranks, and deadlocks
New options:
    -S/--slowrank <rank>
    -D/--slowrank_delay <usec>
Currenly sendrecv allow to send data only to local peers. Let's introduce distance metric
for peers, so one can test different cicles.

For example ./sendrecv -r -1 will iterate all possible distances,
so all NxN communication routes will be tested only in N iterations.
This is good diagnostic tool for various network issues.
Currenlty we only way to iterate different roots is to iterace one-by-one, which is not usefull.
This patch allows to skip some ranks, where negarive number is a step size
For example:
My hosts has 8 gpu, so by iterating {0,8,16,...} ranks will emulate all possible hosts orders 
./resuce_per -r -8
Communication timeouts are vital build blocks of reliable distributed algorithms.
If one of ranks crashes, or deadlock whole test will deadlock forever, this is
expected behaviour because of FLP impossibility[1]. NCCL has no built in
communication timeout support because it is general purpose library.
Timeouts should be implemented at application level. Set default communication
timeout to 1800sec (30min), user may change via NCCL_TESTS_COMM_TIMEOUT env.


Footnotes:
[1] https://en.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant