Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow ranks search improvements #99

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Commits on Dec 15, 2021

  1. add slow rank simulation options

    This allow us to simulate slow ranks, and deadlocks
    New options:
        -S/--slowrank <rank>
        -D/--slowrank_delay <usec>
    Dmitry Monakhov committed Dec 15, 2021
    Configuration menu
    Copy the full SHA
    a2b7115 View commit details
    Browse the repository at this point in the history
  2. sendrecv: Test different distances

    Currenly sendrecv allow to send data only to local peers. Let's introduce distance metric
    for peers, so one can test different cicles.
    
    For example ./sendrecv -r -1 will iterate all possible distances,
    so all NxN communication routes will be tested only in N iterations.
    This is good diagnostic tool for various network issues.
    Dmitry Monakhov committed Dec 15, 2021
    Configuration menu
    Copy the full SHA
    2cf4f1f View commit details
    Browse the repository at this point in the history
  3. Allow to configure iteration steps

    Currenlty we only way to iterate different roots is to iterace one-by-one, which is not usefull.
    This patch allows to skip some ranks, where negarive number is a step size
    For example:
    My hosts has 8 gpu, so by iterating {0,8,16,...} ranks will emulate all possible hosts orders 
    ./resuce_per -r -8
    Dmitry Monakhov committed Dec 15, 2021
    Configuration menu
    Copy the full SHA
    ef73747 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2021

  1. Add communication timeout support

    Communication timeouts are vital build blocks of reliable distributed algorithms.
    If one of ranks crashes, or deadlock whole test will deadlock forever, this is
    expected behaviour because of FLP impossibility[1]. NCCL has no built in
    communication timeout support because it is general purpose library.
    Timeouts should be implemented at application level. Set default communication
    timeout to 1800sec (30min), user may change via NCCL_TESTS_COMM_TIMEOUT env.
    
    
    Footnotes:
    [1] https://en.wikipedia.org/wiki/Consensus_(computer_science)#The_FLP_impossibility_result_for_asynchronous_deterministic_consensus
    Dmitry Monakhov committed Dec 16, 2021
    Configuration menu
    Copy the full SHA
    fdba13b View commit details
    Browse the repository at this point in the history