Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wider allowed port range #77

Open
SamuelLarkin opened this issue Apr 7, 2022 · 7 comments
Open

Wider allowed port range #77

SamuelLarkin opened this issue Apr 7, 2022 · 7 comments

Comments

@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Apr 7, 2022

Hi,
I'm trying to run the new Sockeye-3 in multi-nodes with multi-gpus and it fails. I opened a ticket with sockeye and their hypothesis is that the allowed port range is to small. Sockeye-3 uses pytorch-1.10 and NCCL and tries to create a C10D rendez-vous service to synchronize the workers but there is no way to specify a port, it randomly chooses one.

My request is to widen the allowed port range on trixie's worker nodes and the head node.

Note for myself:
/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes

source /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/tools/activate
sbatch train.slurm
@nrcfieldsa
Copy link

nrcfieldsa commented Apr 7, 2022

In order to allow a wider port range both head nodes and compute nodes will need to have the following kernel tunable set:

net.ipv4.ip_local_port_range = 1192 65535

April 6, 2022 5:32 PM

All the trixie head and compute node are presently set for 32768-60999

@nrcfieldsa
Copy link

nrcfieldsa commented Apr 8, 2022

The kernel sysctl setting was applied to trixie head nodes and compute nodes in /etc/sysctl.conf, and is now set correctly to support jobs which initiate many connections, beginning at lower starting ports. A backup of original configuration is in /root/etc_sysctl.d_99-sysctl.conf.bak file.

Further troubleshooting shows this is not the current issue stopping the training job from working now and there is another HPC SLURM / application-specific issue being tracked in upstream project's issue tracker, which is still preventing communication or related to task/gpu resource allocation across nodes.

@nrcfieldsa
Copy link

Settings have been re-applied:

net.ipv4.ip_local_port_range = 2048 65000

Please confirm this is working post-upgrade.

@nrcfieldsa
Copy link

@SamuelLarkin : Can this issue be resolved now?

@SamuelLarkin
Copy link
Collaborator Author

SamuelLarkin commented Aug 25, 2022

I'm still unable to run sockeye on trixie. I get the following error message.

Sockeye-3-multi-nodes-159008.1.out

Started sockeye.train at Wed Aug 24 14:18:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=32805 --rdzv_id=159008 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:32916 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30193:30193 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30193:30193 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
cn135:30193:30246 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30193:30246 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/IB/0

cn135:30193:30246 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn135:30193:30246 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn135:30193:30246 [0] NCCL INFO include/net.h:23 -> 2
cn135:30193:30246 [0] NCCL INFO transport/net.cc:223 -> 2
cn135:30193:30246 [0] NCCL INFO transport.cc:111 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:778 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:904 -> 2
cn135:30193:30246 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
    resume_training = check_resume(args, output_folder)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
    torch.distributed.barrier()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30193) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:18:45
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30193)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
      resume_training = check_resume(args, output_folder)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
      torch.distributed.barrier()
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
      work = default_pg.barrier(opts=opts)
  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    0m18.540s
user    0m3.384s
sys     0m6.924s

@SamuelLarkin
Copy link
Collaborator Author

SamuelLarkin commented Aug 25, 2022

Looks like the Infiniti Band is not properly configured or at least not setup to be compatible with pytorch.distributed. If I set NCCL_IB_DISABLE=1, I get a bit further. The master node seems to start and train but the worker node fails.

Sockeye-3-multi-nodes-159009.out

Master Log

Key error message cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368> which seems to indicate that the whole thing failed because the worker node's connection dropped.

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30932:30932 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30932:30932 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn135:30932:30932 [0] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
cn135:30932:30989 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30932:30989 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Connected all rings
cn135:30932:30989 [0] NCCL INFO Connected all trees
cn135:30932:30989 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn135:30932:30989 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn135:30932:30989 [0] NCCL INFO comm 0x7ffef0000fa0 rank 0 nranks 2 cudaDev 0 busId 89000 - Init COMPLETE
cn135:30932:30932 [0] NCCL INFO Launch mode Parallel
[INFO:sockeye.utils] Sockeye: 3.1.9, commit unknown, path /home/larkins/git/sockeye/sockeye/__init__.py
[INFO:sockeye.utils] PyTorch: 1.11.0 (/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/__init__.py)

...

[INFO:sockeye.data_io] Shuffling the shards.
[INFO:sockeye.data_io] Loading shard corpora/prepared.shared_vocab/shard.00008.
[INFO:sockeye.data_io] Replicating bucket of 1 sentence(s) 2 times to cover 2 splits.

cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368>
cn135:30932:30993 [0] NCCL INFO transport/net_socket.cc:414 -> 2
cn135:30932:30993 [0] NCCL INFO include/net.h:28 -> 2
cn135:30932:30993 [0] NCCL INFO transport/net.cc:459 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:351 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:452 -> 2 [Proxy Thread]
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
    train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
  File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
    train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
    train_iter = ShardedParallelSampleIter(shard_fnames,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
    self.reset()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
    self._load_shard()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
    dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
    target_num_samples = max(utils.all_gather_object(target_num_samples))
  File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
    torch.distributed.all_gather_object(obj_list, obj)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
    all_gather(output_tensors, input_tensor, group=group)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30932) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:53
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30932)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
      self._load_shard()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
      dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
      target_num_samples = max(utils.all_gather_object(target_num_samples))
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
      torch.distributed.all_gather_object(obj_list, obj)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
      all_gather(output_tensors, input_tensor, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
      work = default_pg.allgather([tensor_list], [tensor])
  RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    30m30.189s
user    31m38.458s
sys     28m35.649s

Worker Log

Key Error messages

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.

Full Log

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn136:970:970 [1] NCCL INFO Bootstrap : Using ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn136:970:970 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn136:970:970 [1] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO Using network Socket
cn136:970:1023 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
cn136:970:1023 [1] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Connected all rings
cn136:970:1023 [1] NCCL INFO Connected all trees
cn136:970:1023 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn136:970:1023 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn136:970:1023 [1] NCCL INFO comm 0x7ffefc000fa0 rank 1 nranks 2 cudaDev 1 busId 8a000 - Init COMPLETE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:47
  host      : cn136
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 970)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1742, in reset
      self.shards_fnames = utils.broadcast_object(self.shards_fnames)
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 609, in broadcast_object
      torch.distributed.broadcast_object_list(obj_list, src=src)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
      broadcast(object_sizes_tensor, src=src, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
      work = default_pg.broadcast([tensor], opts)
  RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
  Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
  frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fff8e9961bd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7fff8e99290c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7fffcd49bfef in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7fffcd49cf71 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7fffcd49cffb in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7fff8fcf7834 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7fff8fcfb8c9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #9: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7fff8fd06c21 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #10: <unknown function> + 0x801f49 (0x7fffd54bef49 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #11: <unknown function> + 0x1e5d37 (0x7fffd4ea2d37 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #12: PyCFunction_Call + 0x6e (0x55555568fe7e in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #13: _PyObject_MakeTpCall + 0x501 (0x555555678631 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #14: <unknown function> + 0x13bbfd (0x55555568fbfd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #15: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #16: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #17: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #18: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #19: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #20: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #21: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #22: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #23: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #24: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #25: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #26: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #27: _PyEval_EvalFrameDefault + 0x67d (0x55555566fd9d in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #28: _PyEval_EvalCodeWithName + 0x7d7 (0x55555566e957 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #29: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #30: <unknown function> + 0x1373b8 (0x55555568b3b8 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #31: _PyObject_MakeTpCall + 0x51c (0x55555567864c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #32: _PyEval_EvalFrameDefault + 0x4ebf (0x5555556745df in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #33: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #34: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #35: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #36: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #37: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #38: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #39: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #40: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #41: PyObject_Call + 0x2d2 (0x555555692172 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #42: _PyEval_EvalFrameDefault + 0x2150 (0x555555671870 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #43: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #44: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #45: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #46: _PyFunction_Vectorcall + 0xf6 (0x5555556802a6 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #47: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #48: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #49: PyEval_EvalCodeEx + 0x39 (0x55555572dde9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #50: PyEval_EvalCode + 0x1b (0x55555572ddab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #51: <unknown function> + 0x1fa903 (0x55555574e903 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #52: <unknown function> + 0x1f98e3 (0x55555574d8e3 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #53: <unknown function> + 0x99f2f (0x5555555edf2f in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #54: PyRun_SimpleFileExFlags + 0x364 (0x5555555eda23 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #55: <unknown function> + 0x8d0ac (0x5555555e10ac in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #56: Py_BytesMain + 0x39 (0x555555722219 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #57: __libc_start_main + 0xf5 (0x7ffff6f02555 in /lib64/libc.so.6)
  frame #58: <unknown function> + 0x1ce125 (0x555555722125 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)


============================================================

real    30m30.101s
user    0m6.278s
sys     0m11.023s

@NRCGavin
Copy link
Collaborator

I discovered a miss configuration of the headnode's firewall today that probably caused the timeout error you saw in the logs.

Please try again when you have time and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants