-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wider allowed port range #77
Comments
In order to allow a wider port range both head nodes and compute nodes will need to have the following kernel tunable set:
|
The kernel sysctl setting was applied to trixie head nodes and compute nodes in Further troubleshooting shows this is not the current issue stopping the training job from working now and there is another HPC SLURM / application-specific issue being tracked in upstream project's issue tracker, which is still preventing communication or related to task/gpu resource allocation across nodes. |
Settings have been re-applied:
Please confirm this is working post-upgrade. |
@SamuelLarkin : Can this issue be resolved now? |
I'm still unable to run
Started sockeye.train at Wed Aug 24 14:18:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=32805 --rdzv_id=159008 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:32916 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30193:30193 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30193:30193 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.135<0>
cn135:30193:30193 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda10.2
cn135:30193:30246 [0] NCCL INFO Channel 00/02 : 0 1
cn135:30193:30246 [0] NCCL INFO Channel 01/02 : 0 1
cn135:30193:30246 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30193:30246 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/IB/0
cn135:30193:30246 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/IB/0
cn135:30193:30246 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn135:30193:30246 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn135:30193:30246 [0] NCCL INFO include/net.h:23 -> 2
cn135:30193:30246 [0] NCCL INFO transport/net.cc:223 -> 2
cn135:30193:30246 [0] NCCL INFO transport.cc:111 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:778 -> 2
cn135:30193:30246 [0] NCCL INFO init.cc:904 -> 2
cn135:30193:30246 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
train(args)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
resume_training = check_resume(args, output_folder)
File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
torch.distributed.barrier()
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30193) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json)
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-24_14:18:45
host : cn135
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30193)
error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_aznz52jc/159008_8iguorrd/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/larkins/git/sockeye/sockeye/train.py", line 872, in train
resume_training = check_resume(args, output_folder)
File "/home/larkins/git/sockeye/sockeye/train.py", line 179, in check_resume
torch.distributed.barrier()
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
============================================================
real 0m18.540s
user 0m3.384s
sys 0m6.924s |
Looks like the Infiniti Band is not properly configured or at least not setup to be compatible with
Master LogKey error message Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn135:30932:30932 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:30932:30932 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn135:30932:30932 [0] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.135<0>
cn135:30932:30932 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
cn135:30932:30989 [0] NCCL INFO Channel 00/02 : 0 1
cn135:30932:30989 [0] NCCL INFO Channel 01/02 : 0 1
cn135:30932:30989 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:30932:30989 [0] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [receive] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [send] via NET/Socket/0
cn135:30932:30989 [0] NCCL INFO Connected all rings
cn135:30932:30989 [0] NCCL INFO Connected all trees
cn135:30932:30989 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn135:30932:30989 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn135:30932:30989 [0] NCCL INFO comm 0x7ffef0000fa0 rank 0 nranks 2 cudaDev 0 busId 89000 - Init COMPLETE
cn135:30932:30932 [0] NCCL INFO Launch mode Parallel
[INFO:sockeye.utils] Sockeye: 3.1.9, commit unknown, path /home/larkins/git/sockeye/sockeye/__init__.py
[INFO:sockeye.utils] PyTorch: 1.11.0 (/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/__init__.py)
...
[INFO:sockeye.data_io] Shuffling the shards.
[INFO:sockeye.data_io] Loading shard corpora/prepared.shared_vocab/shard.00008.
[INFO:sockeye.data_io] Replicating bucket of 1 sentence(s) 2 times to cover 2 splits.
cn135:30932:30993 [0] include/socket.h:423 NCCL WARN Net : Connection closed by remote peer 10.11.0.136<20368>
cn135:30932:30993 [0] NCCL INFO transport/net_socket.cc:414 -> 2
cn135:30932:30993 [0] NCCL INFO include/net.h:28 -> 2
cn135:30932:30993 [0] NCCL INFO transport/net.cc:459 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:351 -> 2
cn135:30932:30993 [0] NCCL INFO proxy.cc:452 -> 2 [Proxy Thread]
[ERROR:root] Uncaught exception
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/sockeye-train", line 33, in <module>
sys.exit(load_entry_point('sockeye', 'console_scripts', 'sockeye-train')())
File "/home/larkins/git/sockeye/sockeye/train.py", line 845, in main
train(args)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
train_iter = ShardedParallelSampleIter(shard_fnames,
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
self.reset()
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
self._load_shard()
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
target_num_samples = max(utils.all_gather_object(target_num_samples))
File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
torch.distributed.all_gather_object(obj_list, obj)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
all_gather(output_tensors, input_tensor, group=group)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30932) of binary: sockeye-train
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json)
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-24_14:53:53
host : cn135
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30932)
error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_zzyxq__8/159009_f51vopx0/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
train_iter = ShardedParallelSampleIter(shard_fnames,
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
self.reset()
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1745, in reset
self._load_shard()
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1715, in _load_shard
dataset = ParallelDataSet.load(self.shards_fnames[self.shard_index]).fill_up(self.bucket_batch_sizes,
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1448, in fill_up
target_num_samples = max(utils.all_gather_object(target_num_samples))
File "/home/larkins/git/sockeye/sockeye/utils.py", line 616, in all_gather_object
torch.distributed.all_gather_object(obj_list, obj)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1661, in all_gather_object
all_gather(output_tensors, input_tensor, group=group)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2060, in all_gather
work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
============================================================
real 30m30.189s
user 31m38.458s
sys 28m35.649s Worker LogKey Error messages ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError. Full Log Started sockeye.train at Wed Aug 24 14:23:30 EDT 2022
time torchrun --no_python --nnodes=2 --nproc_per_node=1 --node_rank=1 --master_addr=10.10.0.135 --master_port=39906 --rdzv_id=159009 --rdzv_backend=c10d --rdzv_endpoi
nt=10.10.0.135:40017 sockeye-train --dist --quiet-secondary-workers --config=model_config.yaml
cn136:970:970 [1] NCCL INFO Bootstrap : Using ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn136:970:970 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
cn136:970:970 [1] NCCL INFO NET/Socket : Using [0]ib0:10.11.0.136<0>
cn136:970:970 [1] NCCL INFO Using network Socket
cn136:970:1023 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
cn136:970:1023 [1] NCCL INFO Channel 00 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 0[89000] -> 1[8a000] [receive] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 00 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Channel 01 : 1[8a000] -> 0[89000] [send] via NET/Socket/0
cn136:970:1023 [1] NCCL INFO Connected all rings
cn136:970:1023 [1] NCCL INFO Connected all trees
cn136:970:1023 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cn136:970:1023 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
cn136:970:1023 [1] NCCL INFO comm 0x7ffefc000fa0 rank 1 nranks 2 cudaDev 1 busId 8a000 - Init COMPLETE
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 970) of binary: sockeye-train
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'cn136_961_0' has failed to send a keep-alive heartbeat to the rendezvous '159009' due to an
error of type RendezvousTimeoutError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json)
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sockeye-train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-24_14:53:47
host : cn136
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 970)
error_file: /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes/tmp/torchelastic_64urue0u/159009_mewv0m4k/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/larkins/git/sockeye/sockeye/train.py", line 915, in train
train_iter, eval_iter, config_data, source_vocabs, target_vocabs = create_data_iters_and_vocabs(
File "/home/larkins/git/sockeye/sockeye/train.py", line 283, in create_data_iters_and_vocabs
train_iter, validation_iter, data_config, source_vocabs, target_vocabs = data_io.get_prepared_data_iters(
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 840, in get_prepared_data_iters
train_iter = ShardedParallelSampleIter(shard_fnames,
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1710, in __init__
self.reset()
File "/home/larkins/git/sockeye/sockeye/data_io.py", line 1742, in reset
self.shards_fnames = utils.broadcast_object(self.shards_fnames)
File "/home/larkins/git/sockeye/sockeye/utils.py", line 609, in broadcast_object
torch.distributed.broadcast_object_list(obj_list, src=src)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1869, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '1', but store->get('1') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fff8e9961bd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x6c (0x7fff8e99290c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x11f (0x7fffcd49bfef in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x21 (0x7fffcd49cf71 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x5b (0x7fffcd49cffb in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x32 (0x7fffcd46eb42 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::string const&, int) + 0xe4 (0x7fff8fcf7834 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x1d9 (0x7fff8fcfb8c9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x341 (0x7fff8fd06c21 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x801f49 (0x7fffd54bef49 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x1e5d37 (0x7fffd4ea2d37 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: PyCFunction_Call + 0x6e (0x55555568fe7e in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #13: _PyObject_MakeTpCall + 0x501 (0x555555678631 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #14: <unknown function> + 0x13bbfd (0x55555568fbfd in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #16: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #17: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #20: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #22: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #23: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x48dc (0x555555673ffc in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #26: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x67d (0x55555566fd9d in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x7d7 (0x55555566e957 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #29: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #30: <unknown function> + 0x1373b8 (0x55555568b3b8 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #31: _PyObject_MakeTpCall + 0x51c (0x55555567864c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x4ebf (0x5555556745df in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #33: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #34: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #36: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #37: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x10e8 (0x555555670808 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #39: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #40: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #41: PyObject_Call + 0x2d2 (0x555555692172 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x2150 (0x555555671870 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #43: _PyEval_EvalCodeWithName + 0x9f6 (0x55555566eb76 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #44: _PyFunction_Vectorcall + 0x18c (0x55555568033c in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #46: _PyFunction_Vectorcall + 0xf6 (0x5555556802a6 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #47: _PyEval_EvalFrameDefault + 0x38b (0x55555566faab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #48: _PyEval_EvalCodeWithName + 0x2e1 (0x55555566e461 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #49: PyEval_EvalCodeEx + 0x39 (0x55555572dde9 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #50: PyEval_EvalCode + 0x1b (0x55555572ddab in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #51: <unknown function> + 0x1fa903 (0x55555574e903 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #52: <unknown function> + 0x1f98e3 (0x55555574d8e3 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #53: <unknown function> + 0x99f2f (0x5555555edf2f in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #54: PyRun_SimpleFileExFlags + 0x364 (0x5555555eda23 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #55: <unknown function> + 0x8d0ac (0x5555555e10ac in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #56: Py_BytesMain + 0x39 (0x555555722219 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
frame #57: __libc_start_main + 0xf5 (0x7ffff6f02555 in /lib64/libc.so.6)
frame #58: <unknown function> + 0x1ce125 (0x555555722125 in /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.9.debug/bin/python)
============================================================
real 30m30.101s
user 0m6.278s
sys 0m11.023s |
I discovered a miss configuration of the headnode's firewall today that probably caused the timeout error you saw in the logs. Please try again when you have time and report back. |
Hi,
I'm trying to run the new
Sockeye-3
in multi-nodes with multi-gpus and it fails. I opened a ticket with sockeye and their hypothesis is that the allowed port range is to small. Sockeye-3 uses pytorch-1.10 and NCCL and tries to create a C10D rendez-vous service to synchronize the workers but there is no way to specify a port, it randomly chooses one.My request is to widen the allowed port range on trixie's worker nodes and the head node.
Note for myself:
/gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/v3-HoC/en2fr/baseline.hf.sockeye-3.1.0_2nodes
source /gpfs/projects/DT/mtp/models/Senate_HOC-Debates-2019-11-15/nmt.sockeye-3/tools/activate sbatch train.slurm
The text was updated successfully, but these errors were encountered: