Wider allowed port range #77

SamuelLarkin · 2022-04-07T13:17:20Z

Hi,
I'm trying to run the new Sockeye-3 in multi-nodes with multi-gpus and it fails. I opened a ticket with sockeye and their hypothesis is that the allowed port range is to small. Sockeye-3 uses pytorch-1.10 and NCCL and tries to create a C10D rendez-vous service to synchronize the workers but there is no way to specify a port, it randomly chooses one.

My request is to widen the allowed port range on trixie's worker nodes and the head node.

Note for myself:
/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes

source /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/tools/activate
sbatch train.slurm

The text was updated successfully, but these errors were encountered:

nrcfieldsa · 2022-04-07T14:18:36Z

In order to allow a wider port range both head nodes and compute nodes will need to have the following kernel tunable set:

net.ipv4.ip_local_port_range = 1192 65535

April 6, 2022 5:32 PM

All the trixie head and compute node are presently set for 32768-60999

nrcfieldsa · 2022-04-08T13:11:27Z

The kernel sysctl setting was applied to trixie head nodes and compute nodes in /etc/sysctl.conf, and is now set correctly to support jobs which initiate many connections, beginning at lower starting ports. A backup of original configuration is in /root/etc_sysctl.d_99-sysctl.conf.bak file.

Further troubleshooting shows this is not the current issue stopping the training job from working now and there is another HPC SLURM / application-specific issue being tracked in upstream project's issue tracker, which is still preventing communication or related to task/gpu resource allocation across nodes.

nrcfieldsa · 2022-04-28T19:34:24Z

Settings have been re-applied:

net.ipv4.ip_local_port_range = 2048 65000

Please confirm this is working post-upgrade.

nrcfieldsa · 2022-08-03T16:36:13Z

@SamuelLarkin : Can this issue be resolved now?

SamuelLarkin · 2022-08-25T13:26:21Z

I'm still unable to run sockeye on trixie. I get the following error message.

Sockeye-3-multi-nodes-159008.1.out

Started sockeye.train at Wed Aug 24 14:18:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=32805 --rdzv_id=159008 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:32916 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30193:30193 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30193:30193 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
cn135:30193:30246 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30193:30246 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30193:30246 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/IB/0

cn135:30193:30246 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn135:30193:30246 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn135:30193:30246 [0] NCCL INFO include/net.h:23 -> 2
cn135:30193:30246 [0] NCCL INFO transport/net.cc:223 -> 2
cn135:30193:30246 [0] NCCL INFO transport.cc:111 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:778 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:904 -> 2
cn135:30193:30246 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
    resume_training = check_resume(args, output_folder)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
    torch.distributed.barrier()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30193) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:18:45
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30193)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
      resume_training = check_resume(args, output_folder)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
      torch.distributed.barrier()
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
      work = default_pg.barrier(opts=opts)
  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    0m18.540s
user    0m3.384s
sys     0m6.924s

SamuelLarkin · 2022-08-25T13:33:01Z

Looks like the Infiniti Band is not properly configured or at least not setup to be compatible with pytorch.distributed. If I set NCCL_IB_DISABLE=1, I get a bit further. The master node seems to start and train but the worker node fails.

Sockeye-3-multi-nodes-159009.out

Master Log

Key error message cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368> which seems to indicate that the whole thing failed because the worker node's connection dropped.

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30932:30932 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30932:30932 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn135:30932:30932 [0] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
cn135:30932:30989 [0] NCCL INFO Channel 00/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Channel 01/02 :    0   1
cn135:30932:30989 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30932:30989 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Connected all rings
cn135:30932:30989 [0] NCCL INFO Connected all trees
cn135:30932:30989 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn135:30932:30989 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn135:30932:30989 [0] NCCL INFO comm 0x7ffef0000fa0 rank 0 nranks 2 cudaDev 0 busId 89000 - Init COMPLETE
cn135:30932:30932 [0] NCCL INFO Launch mode Parallel
[INFO:sockeye.utils] Sockeye: 3.1.9, commit unknown, path /home/larkins/git/sockeye/sockeye/__init__.py
[INFO:sockeye.utils] PyTorch: 1.11.0 (/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/__init__.py)

...

[INFO:sockeye.data_io] Shuffling the shards.
[INFO:sockeye.data_io] Loading shard corpora/prepared.shared_vocab/shard.00008.
[INFO:sockeye.data_io] Replicating bucket of 1 sentence(s) 2 times to cover 2 splits.

cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368>
cn135:30932:30993 [0] NCCL INFO transport/net_socket.cc:414 -> 2
cn135:30932:30993 [0] NCCL INFO include/net.h:28 -> 2
cn135:30932:30993 [0] NCCL INFO transport/net.cc:459 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:351 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:452 -> 2 [Proxy Thread]
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
    sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
  File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
    train(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
    train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
  File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
    train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
    train_iter = ShardedParallelSampleIter(shard_fnames,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
    self.reset()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
    self._load_shard()
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
    dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
  File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
    target_num_samples = max(utils.all_gather_object(target_num_samples))
  File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
    torch.distributed.all_gather_object(obj_list, obj)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
    all_gather(output_tensors, input_tensor, group=group)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30932) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:53
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 30932)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
      self._load_shard()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
      dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
      target_num_samples = max(utils.all_gather_object(target_num_samples))
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
      torch.distributed.all_gather_object(obj_list, obj)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
      all_gather(output_tensors, input_tensor, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
      work = default_pg.allgather([tensor_list], [tensor])
  RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
  ncclSystemError: System call (socket, malloc, munmap, etc) failed.

============================================================

real    30m30.189s
user    31m38.458s
sys     28m35.649s

Worker Log

Key Error messages

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.

Full Log

Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn136:970:970 [1] NCCL INFO Bootstrap : Using ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn136:970:970 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn136:970:970 [1] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO Using network Socket
cn136:970:1023 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
cn136:970:1023 [1] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Connected all rings
cn136:970:1023 [1] NCCL INFO Connected all trees
cn136:970:1023 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn136:970:1023 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn136:970:1023 [1] NCCL INFO comm 0x7ffefc000fa0 rank 1 nranks 2 cudaDev 1 busId 8a000 - Init COMPLETE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-08-24_14:53:47
  host      : cn136
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 970)
  error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
      return f(*args, **kwargs)
    File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
      train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
    File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
      train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
      train_iter = ShardedParallelSampleIter(shard_fnames,
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
      self.reset()
    File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1742, in reset
      self.shards_fnames = utils.broadcast_object(self.shards_fnames)
    File "/home/larkins/git/sockeye/sockeye/utils.py", line 609, in broadcast_object
      torch.distributed.broadcast_object_list(obj_list, src=src)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
      broadcast(object_sizes_tensor, src=src, group=group)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
      work = default_pg.broadcast([tensor], opts)
  RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
  Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
  frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fff8e9961bd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7fff8e99290c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
  frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7fffcd49bfef in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7fffcd49cf71 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7fffcd49cffb in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
  frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7fff8fcf7834 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7fff8fcfb8c9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #9: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7fff8fd06c21 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
  frame #10: <unknown function> + 0x801f49 (0x7fffd54bef49 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #11: <unknown function> + 0x1e5d37 (0x7fffd4ea2d37 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
  frame #12: PyCFunction_Call + 0x6e (0x55555568fe7e in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #13: _PyObject_MakeTpCall + 0x501 (0x555555678631 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #14: <unknown function> + 0x13bbfd (0x55555568fbfd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #15: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #16: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #17: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #18: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #19: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #20: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #21: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #22: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #23: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #24: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #25: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #26: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #27: _PyEval_EvalFrameDefault + 0x67d (0x55555566fd9d in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #28: _PyEval_EvalCodeWithName + 0x7d7 (0x55555566e957 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #29: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #30: <unknown function> + 0x1373b8 (0x55555568b3b8 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #31: _PyObject_MakeTpCall + 0x51c (0x55555567864c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #32: _PyEval_EvalFrameDefault + 0x4ebf (0x5555556745df in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #33: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #34: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #35: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #36: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #37: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #38: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #39: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #40: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #41: PyObject_Call + 0x2d2 (0x555555692172 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #42: _PyEval_EvalFrameDefault + 0x2150 (0x555555671870 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #43: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #44: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #45: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #46: _PyFunction_Vectorcall + 0xf6 (0x5555556802a6 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #47: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #48: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #49: PyEval_EvalCodeEx + 0x39 (0x55555572dde9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #50: PyEval_EvalCode + 0x1b (0x55555572ddab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #51: <unknown function> + 0x1fa903 (0x55555574e903 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #52: <unknown function> + 0x1f98e3 (0x55555574d8e3 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #53: <unknown function> + 0x99f2f (0x5555555edf2f in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #54: PyRun_SimpleFileExFlags + 0x364 (0x5555555eda23 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #55: <unknown function> + 0x8d0ac (0x5555555e10ac in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #56: Py_BytesMain + 0x39 (0x555555722219 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
  frame #57: __libc_start_main + 0xf5 (0x7ffff6f02555 in /lib64/libc.so.6)
  frame #58: <unknown function> + 0x1ce125 (0x555555722125 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)


============================================================

real    30m30.101s
user    0m6.278s
sys     0m11.023s

NRCGavin · 2022-09-14T15:45:57Z

I discovered a miss configuration of the headnode's firewall today that probably caused the timeout error you saw in the logs.

Please try again when you have time and report back.

SamuelLarkin mentioned this issue Sep 14, 2022

Run torchrun aka pytorch distributed on multiple nodes & GPUs. #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wider allowed port range #77

Wider allowed port range #77

SamuelLarkin commented Apr 7, 2022 •

edited

Loading

nrcfieldsa commented Apr 7, 2022 •

edited

Loading

nrcfieldsa commented Apr 8, 2022 •

edited

Loading

nrcfieldsa commented Apr 28, 2022

nrcfieldsa commented Aug 3, 2022

SamuelLarkin commented Aug 25, 2022 •

edited

Loading

SamuelLarkin commented Aug 25, 2022 •

edited

Loading

NRCGavin commented Sep 14, 2022

Wider allowed port range #77

Wider allowed port range #77

Comments

SamuelLarkin commented Apr 7, 2022 • edited Loading

nrcfieldsa commented Apr 7, 2022 • edited Loading

nrcfieldsa commented Apr 8, 2022 • edited Loading

nrcfieldsa commented Apr 28, 2022

nrcfieldsa commented Aug 3, 2022

SamuelLarkin commented Aug 25, 2022 • edited Loading

SamuelLarkin commented Aug 25, 2022 • edited Loading

Master Log

Worker Log

NRCGavin commented Sep 14, 2022

SamuelLarkin commented Apr 7, 2022 •

edited

Loading

nrcfieldsa commented Apr 7, 2022 •

edited

Loading

nrcfieldsa commented Apr 8, 2022 •

edited

Loading

SamuelLarkin commented Aug 25, 2022 •

edited

Loading

SamuelLarkin commented Aug 25, 2022 •

edited

Loading