Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run torchrun aka pytorch distributed on multiple nodes & GPUs. #87

Open
SamuelLarkin opened this issue Sep 14, 2022 · 2 comments
Open

Comments

@SamuelLarkin
Copy link
Collaborator

SamuelLarkin commented Sep 14, 2022

I'm trying to run a script with torchrun to get my job running on multiple nodes and GPUs but it fails.

Related to #77

The error seems to be

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL versi
on 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can
 check NCCL warnings for failure reason and see if there is connection closure by a peer.

Logs

Master

name: pytorch-1.12.0
channels:
  - pytorch
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - attrs=22.1.0=pyh71513ae_1
  - blas=1.0=mkl
  - brotlipy=0.7.0=py38h27cfd23_1003
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2022.07.19=h06a4308_0
  - certifi=2022.6.15=py38h06a4308_0
  - cffi=1.15.1=py38h74dc2b5_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - colorama=0.4.5=pyhd8ed1ab_0
  - cryptography=37.0.1=py38h9ce1e76_0
  - cudatoolkit=11.3.1=h2bc3f7f_2
  - cudnn=8.4.1.50=hed8a83a_0
  - cxxfilt=0.3.0=py38hfa26641_2
  - ffmpeg=4.3=hf484d3e_0
  - freetype=2.11.0=h70c0345_0
  - giflib=5.2.1=h7b6447c_0
  - gmp=6.2.1=h295c915_3
  - gnutls=3.6.15=he1e5248_0
  - idna=3.3=pyhd3eb1b0_0
  - iniconfig=1.1.1=pyh9f0ad1d_0
  - jpeg=9e=h7f8727e_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=3.0=h295c915_0
  - libdeflate=1.8=h7f8727e_5
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.1.0=h8d9b700_16
  - libiconv=1.16=h7f8727e_2
  - libidn2=2.3.2=h7f8727e_0
  - libnsl=2.0.0=h7f98852_0
  - libpng=1.6.37=hbc83047_0
  - libsqlite=3.39.2=h753d276_1
  - libstdcxx-ng=12.1.0=ha89aaad_16
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.4.0=hecacb30_0
  - libunistring=0.9.10=h27cfd23_0
  - libuuid=2.32.1=h7f98852_1000
  - libwebp=1.2.2=h55f646e_0
  - libwebp-base=1.2.2=h7f8727e_0
  - libzlib=1.2.12=h166bdaf_2
  - llvm-openmp=14.0.4=he0ac6c6_0
  - lz4-c=1.9.3=h295c915_1
  - mkl=2021.4.0=h8d4b97c_729
  - mkl-service=2.4.0=py38h95df7f1_0
  - mkl_fft=1.3.1=py38h8666266_1
  - mkl_random=1.2.2=py38h1abd341_0
  - nccl=2.14.3.1=h0800d71_0
  - ncurses=6.3=h27087fc_1
  - nettle=3.7.3=hbbd107a_1
  - numpy=1.23.1=py38h6c91a56_0
  - numpy-base=1.23.1=py38ha15fc14_0
  - nvidia-apex=0.1=py38h0f76e55_4
  - openh264=2.1.1=h4ff587b_0
  - openssl=3.0.5=h166bdaf_1
  - packaging=21.3=pyhd8ed1ab_0
  - pillow=9.2.0=py38hace64e9_1
  - pip=22.2.2=pyhd8ed1ab_0
  - pluggy=1.0.0=py38h578d9bd_3
  - py=1.11.0=pyh6c4a22f_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=22.0.0=pyhd3eb1b0_0
  - pyparsing=3.0.9=pyhd8ed1ab_0
  - pysocks=1.7.1=py38h06a4308_0
  - pytest=7.1.2=py38h578d9bd_0
  - python=3.8.13=ha86cf86_0_cpython
  - python_abi=3.8=2_cp38
  - pytorch=1.12.0=py3.8_cuda11.3_cudnn8.3.2_0
  - pytorch-mutex=1.0=cuda
  - pyyaml=6.0=py38h0a891b7_4
  - readline=8.1.2=h0f457ee_0
  - requests=2.28.1=py38h06a4308_0
  - setuptools=65.3.0=py38h578d9bd_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.39.2=h4ff8645_1
  - tbb=2021.5.0=h924138e_1
  - tk=8.6.12=h27826a3_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - torchaudio=0.12.0=py38_cu113
  - torchvision=0.13.0=py38_cu113
  - tqdm=4.64.0=pyhd8ed1ab_0
  - typing_extensions=4.3.0=pyha770c72_0
  - urllib3=1.26.11=py38h06a4308_0
  - wheel=0.37.1=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - yaml=0.2.5=h7f98852_2
  - zlib=1.2.12=h7f8727e_2
  - zstd=1.5.2=ha4553b6_0
prefix: /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0
cn135:14298:14298 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.135<0>
cn135:14298:14298 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn135:14298:14298 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.135<0>
cn135:14298:14298 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.3
cn135:14298:14333 [0] NCCL INFO Channel 00/02 :    0   1
cn135:14298:14333 [0] NCCL INFO Channel 01/02 :    0   1
cn135:14298:14333 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
cn135:14298:14333 [0] NCCL INFO Channel 00 : 1[89000] -> 0[89000] [receive] via NET/IB/0
cn135:14298:14333 [0] NCCL INFO Channel 01 : 1[89000] -> 0[89000] [receive] via NET/IB/0
cn135:14298:14333 [0] NCCL INFO Channel 00 : 0[89000] -> 1[89000] [send] via NET/IB/0
cn135:14298:14333 [0] NCCL INFO Channel 01 : 0[89000] -> 1[89000] [send] via NET/IB/0

cn135:14298:14333 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn135:14298:14333 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn135:14298:14333 [0] NCCL INFO include/net.h:23 -> 2
cn135:14298:14333 [0] NCCL INFO transport/net.cc:223 -> 2
cn135:14298:14333 [0] NCCL INFO transport.cc:111 -> 2
cn135:14298:14333 [0] NCCL INFO init.cc:778 -> 2
cn135:14298:14333 [0] NCCL INFO init.cc:904 -> 2
cn135:14298:14333 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' i
s deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weigh
t enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Namespace(batch_size=128, learning_rate=5e-05, local_rank=None, model_dir='saved_models', model_filename='resnet_distributed.pth', num_epochs=10000, random_seed=0, re
sume=False)
MASTER_ADDR: cn135
MASTER_PORT: 36363
LOCAL_RANK: 0
Initializing torch.distributed
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 173, in <module>
    main()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line
 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 113, in main
    ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across
_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL versi
on 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can
 check NCCL warnings for failure reason and see if there is connection closure by a peer.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14298) of binary: /gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distribut
ed/main.py
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_s6rgdlte/none_59r_q
acn/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line
 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-14_12:02:31
  host      : cn135
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 14298)
  error_file: /tmp/torchelastic_s6rgdlte/none_59r_qacn/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", li
ne 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 113, in main
      ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
      _verify_param_shape_across_processes(self.process_group, parameters)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_acro
ss_processes
      return dist._verify_params_across_processes(process_group, tensors, logger)
  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL ver
sion 2.10.3
  ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you c
an check NCCL warnings for failure reason and see if there is connection closure by a peer.

============================================================

real    0m46.218s
user    0m3.216s
sys     0m7.519s

Slave

name: pytorch-1.12.0
channels:
  - pytorch
  - anaconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - attrs=22.1.0=pyh71513ae_1
  - blas=1.0=mkl
  - brotlipy=0.7.0=py38h27cfd23_1003
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2022.07.19=h06a4308_0
  - certifi=2022.6.15=py38h06a4308_0
  - cffi=1.15.1=py38h74dc2b5_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - colorama=0.4.5=pyhd8ed1ab_0
  - cryptography=37.0.1=py38h9ce1e76_0
  - cudatoolkit=11.3.1=h2bc3f7f_2
  - cudnn=8.4.1.50=hed8a83a_0
  - cxxfilt=0.3.0=py38hfa26641_2
  - ffmpeg=4.3=hf484d3e_0
  - freetype=2.11.0=h70c0345_0
  - giflib=5.2.1=h7b6447c_0
  - gmp=6.2.1=h295c915_3
  - gnutls=3.6.15=he1e5248_0
  - idna=3.3=pyhd3eb1b0_0
  - iniconfig=1.1.1=pyh9f0ad1d_0
  - jpeg=9e=h7f8727e_0
  - lame=3.100=h7b6447c_0
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=3.0=h295c915_0
  - libdeflate=1.8=h7f8727e_5
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=12.1.0=h8d9b700_16
  - libiconv=1.16=h7f8727e_2
  - libidn2=2.3.2=h7f8727e_0
  - libnsl=2.0.0=h7f98852_0
  - libpng=1.6.37=hbc83047_0
  - libsqlite=3.39.2=h753d276_1
  - libstdcxx-ng=12.1.0=ha89aaad_16
  - libtasn1=4.16.0=h27cfd23_0
  - libtiff=4.4.0=hecacb30_0
  - libunistring=0.9.10=h27cfd23_0
  - libuuid=2.32.1=h7f98852_1000
  - libwebp=1.2.2=h55f646e_0
  - libwebp-base=1.2.2=h7f8727e_0
  - libzlib=1.2.12=h166bdaf_2
  - llvm-openmp=14.0.4=he0ac6c6_0
  - lz4-c=1.9.3=h295c915_1
  - mkl=2021.4.0=h8d4b97c_729
  - mkl-service=2.4.0=py38h95df7f1_0
  - mkl_fft=1.3.1=py38h8666266_1
  - mkl_random=1.2.2=py38h1abd341_0
  - nccl=2.14.3.1=h0800d71_0
  - ncurses=6.3=h27087fc_1
  - nettle=3.7.3=hbbd107a_1
  - numpy=1.23.1=py38h6c91a56_0
  - numpy-base=1.23.1=py38ha15fc14_0
  - nvidia-apex=0.1=py38h0f76e55_4
  - openh264=2.1.1=h4ff587b_0
  - openssl=3.0.5=h166bdaf_1
  - packaging=21.3=pyhd8ed1ab_0
  - pillow=9.2.0=py38hace64e9_1
  - pip=22.2.2=pyhd8ed1ab_0
  - pluggy=1.0.0=py38h578d9bd_3
  - py=1.11.0=pyh6c4a22f_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=22.0.0=pyhd3eb1b0_0
  - pyparsing=3.0.9=pyhd8ed1ab_0
  - pysocks=1.7.1=py38h06a4308_0
  - pytest=7.1.2=py38h578d9bd_0
  - python=3.8.13=ha86cf86_0_cpython
  - python_abi=3.8=2_cp38
  - pytorch=1.12.0=py3.8_cuda11.3_cudnn8.3.2_0
  - pytorch-mutex=1.0=cuda
  - pyyaml=6.0=py38h0a891b7_4
  - readline=8.1.2=h0f457ee_0
  - requests=2.28.1=py38h06a4308_0
  - setuptools=65.3.0=py38h578d9bd_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.39.2=h4ff8645_1
  - tbb=2021.5.0=h924138e_1
  - tk=8.6.12=h27826a3_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - torchaudio=0.12.0=py38_cu113
  - torchvision=0.13.0=py38_cu113
  - tqdm=4.64.0=pyhd8ed1ab_0
  - typing_extensions=4.3.0=pyha770c72_0
  - urllib3=1.26.11=py38h06a4308_0
  - wheel=0.37.1=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - yaml=0.2.5=h7f98852_2
  - zlib=1.2.12=h7f8727e_2
  - zstd=1.5.2=ha4553b6_0
prefix: /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0
cn136:20327:20327 [0] NCCL INFO Bootstrap : Using ib0:10.11.0.136<0>
cn136:20327:20327 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cn136:20327:20327 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.11.0.136<0>
cn136:20327:20327 [0] NCCL INFO Using network IB
cn136:20327:20392 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
cn136:20327:20392 [0] NCCL INFO Channel 00 : 0[89000] -> 1[89000] [receive] via NET/IB/0
cn136:20327:20392 [0] NCCL INFO Channel 01 : 0[89000] -> 1[89000] [receive] via NET/IB/0
cn136:20327:20392 [0] NCCL INFO Channel 00 : 1[89000] -> 0[89000] [send] via NET/IB/0
cn136:20327:20392 [0] NCCL INFO Channel 01 : 1[89000] -> 0[89000] [send] via NET/IB/0

cn136:20327:20392 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
cn136:20327:20392 [0] NCCL INFO transport/net_ib.cc:640 -> 2
cn136:20327:20392 [0] NCCL INFO include/net.h:23 -> 2
cn136:20327:20392 [0] NCCL INFO transport/net.cc:223 -> 2
cn136:20327:20392 [0] NCCL INFO transport.cc:111 -> 2
cn136:20327:20392 [0] NCCL INFO init.cc:778 -> 2
cn136:20327:20392 [0] NCCL INFO init.cc:904 -> 2
cn136:20327:20392 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' i
s deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
  warnings.warn(
/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weigh
t enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Namespace(batch_size=128, learning_rate=5e-05, local_rank=None, model_dir='saved_models', model_filename='resnet_distributed.pth', num_epochs=10000, random_seed=0, re
sume=False)
MASTER_ADDR: cn135
MASTER_PORT: 36363
LOCAL_RANK: 0
Initializing torch.distributed
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 173, in <module>
    main()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line
 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 113, in main
    ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across
_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL versi
on 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can
 check NCCL warnings for failure reason and see if there is connection closure by a peer.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20327) of binary: /gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distribut
ed/main.py
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_dj__qkuc/none_q4s99
mna/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.0', 'console_scripts', 'torchrun')())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line
 345, in wrapper
    return f(*args, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-09-14_12:02:31
  host      : cn136
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 20327)
  error_file: /tmp/torchelastic_dj__qkuc/none_q4s99mna/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", li
ne 345, in wrapper
      return f(*args, **kwargs)
    File "/gpfs/projects/DT/mtp/Project_SamuelL/pytorch.distributed/main.py", line 113, in main
      ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
      _verify_param_shape_across_processes(self.process_group, parameters)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/pytorch-1.12.0/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_acro
ss_processes
      return dist._verify_params_across_processes(process_group, tensors, logger)
  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352465323/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL ver
sion 2.10.3
  ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you c
an check NCCL warnings for failure reason and see if there is connection closure by a peer.

============================================================

real    0m46.217s
user    0m3.083s
sys     0m7.397s

Code & Scripts

task.slurm

#!/bin/bash

#SBATCH --job-name=pytorch.distributed
#SBATCH --comment="Pytorch multi nodes & multi GPUs testing"

# On GPSC5
##SBATCH --partition=gpu_v100
##SBATCH --account=nrc_ict__gpu_v100
# On Trixie
#SBATCH --partition=JobTesting
#SBATCH --account=dt-mtp

#SBATCH --gres=gpu:4
#SBATCH --time=02:00:00
##SBATCH --ntasks=2

#SBATCH --wait-all-nodes=1
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=32G
##SBATCH --exclusive
#SBATCH --output=%x-%j.out


# USEFUL Bookmarks
# [Run PyTorch Data Parallel training on ParallelCluster](https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data
-parallel.html)
# [slurm SBATCH - Multiple Nodes, Same SLURMD_NODENAME](https://stackoverflow.com/a/51356947)

{
   cat $0
} >&2

#readonly MASTER_ADDR_JOB=$SLURMD_NODENAME
readonly MASTER_ADDR_JOB=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# NOTE Trixie valid port range is 32768-60999
#readonly master_port_job=$(($RANDOM % 10000 + 32769))
readonly MASTER_PORT_JOB=36363

readonly srun='srun --output=%x-%j_%t.out'

( set -o posix ; set | sort )

$srun bash \
   task.sh \
      $MASTER_ADDR_JOB \
      $MASTER_PORT_JOB &

wait

task.sh

#!/bin/bash

# USEFUL Bookmarks
# [Run PyTorch Data Parallel training on ParallelCluster](https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data
-parallel.html)
# [slurm SBATCH - Multiple Nodes, Same SLURMD_NODENAME](https://stackoverflow.com/a/51356947)

#SBATCH --job-name=pytorch.distributed
#SBATCH --comment="Pytorch multi nodes & multi GPUs testing"

# On GPSC5
##SBATCH --partition=gpu_v100
##SBATCH --account=nrc_ict__gpu_v100
# On Trixie
#SBATCH --partition=JobTesting
#SBATCH --account=dt-mtp

#SBATCH --gres=gpu:4
#SBATCH --time=02:00:00
#SBATCH --wait-all-nodes=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
# To reserve a whole node for yourself
##SBATCH --exclusive
#SBATCH --open-mode=append
#SBATCH --requeue
#SBATCH --signal=B:USR1@30
#SBATCH --output=%x-%j.out


#module load miniconda3-4.7.12.1-gcc-9.2.0-j2idqxp
#source activate molecule

{
   cat $0
} >&2

source /gpfs/projects/DT/mtp/WMT20/opt/miniconda3/bin/activate ""
conda activate pytorch-1.12.0

readonly MASTER_ADDR_JOB=$1
readonly MASTER_PORT_JOB=$2

export SLURM_TASKS_PER_NODE=${SLURM_TASKS_PER_NODE%%(*)}   # '4(x2)' => '4'
export LOCAL_RANK=$SLURM_LOCALID
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

#export NCCL_DEBUG=INFO
#export NCCL_DEBUG_SUBSYS=ALL
#export NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,COLL
export TORCH_DISTRIBUTED_DETAIL=DEBUG

# IMPORTANT we MUST have the following defined or else we get an error
# misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
# DOC [ibv_reg_mr](https://www.ibm.com/docs/en/aix/7.1?topic=management-ibv-reg-mr)
# Possible Solution: [InfiniBand](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#infiniband)
#export NCCL_IB_DISABLE=1

# [Choosing the network interface to use](https://pytorch.org/docs/1.10/distributed.html#choosing-the-network-interface-to-use)
#export NCCL_SOCKET_IFNAME=eth0   # => NCCL WARN Bootstrap : no socket interface found
#export NCCL_SOCKET_IFNAME=eno,ib

# Change the tmpdir so we can see the logs from where we are.
export TMPDIR=$(pwd)/tmp



( set -o posix ; set | sort )
conda env export

torchrun=torchrun
#torchrun="python -m torch.distributed.launch --use_env"   # DEPRECATED

time $torchrun \
   --no_python \
   --nnodes=$SLURM_NNODES \
   --node_rank=$SLURM_NODEID \
   --nproc_per_node=$SLURM_TASKS_PER_NODE \
   --master_addr=$MASTER_ADDR_JOB \
   --master_port=$MASTER_PORT_JOB \
   $(readlink -f main.py) \
      --batch_size 128 \
      --learning_rate 5e-5 &

sleep 10
{
   echo "MASTER_ADDR_JOB: ${MASTER_ADDR_JOB}"
   echo "MASTER_PORT_JOB: ${MASTER_PORT_JOB}"
   netstat -apn
} > ${SLURM_JOB_NAME}-${SLURM_JOBID}_${SLURM_NODEID}.netstat

wait

main.py

#!/usr/bin/env  python3


import torch
import torch.distributed
import torch.distributed.elastic.multiprocessing.errors

from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

import argparse
import os
import random
import numpy as np



def set_random_seeds(random_seed=0):
    torch.manual_seed(random_seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(random_seed)
    random.seed(random_seed)



def evaluate(model, device, test_loader):
    model.eval()

    correct = 0
    total = 0
    with torch.no_grad():
        for data in test_loader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = correct / total

    return accuracy



@torch.distributed.elastic.multiprocessing.errors.record
def main():
    num_epochs_default = 10000
    batch_size_default = 256 # 1024
    learning_rate_default = 0.1
    random_seed_default = 0
    model_dir_default = "saved_models"
    model_filename_default = "resnet_distributed.pth"

    # Each process runs on 1 GPU device specified by the local_rank argument.
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("--local_rank", type=int, help="Local rank. Necessary for using the torch.distributed.launch utility.")
    parser.add_argument("--num_epochs", type=int, help="Number of training epochs.", default=num_epochs_default)
    parser.add_argument("--batch_size", type=int, help="Training batch size for one process.", default=batch_size_default)
    parser.add_argument("--learning_rate", type=float, help="Learning rate.", default=learning_rate_default)
    parser.add_argument("--random_seed", type=int, help="Random seed.", default=random_seed_default)
    parser.add_argument("--model_dir", type=str, help="Directory for saving models.", default=model_dir_default)
    parser.add_argument("--model_filename", type=str, help="Model filename.", default=model_filename_default)
    parser.add_argument("--resume", action="store_true", help="Resume training from saved checkpoint.")
    argv = parser.parse_args()

    if False:
        local_rank = argv.local_rank
    else:
        local_rank = int(os.environ['LOCAL_RANK'])
    num_epochs = argv.num_epochs
    batch_size = argv.batch_size
    learning_rate = argv.learning_rate
    random_seed = argv.random_seed
    model_dir = argv.model_dir
    model_filename = argv.model_filename
    resume = argv.resume

    print(argv)
    print(os.environ)
    print(f"MASTER_ADDR: {os.environ['MASTER_ADDR']}")
    print(f"MASTER_PORT: {os.environ['MASTER_PORT']}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

    # Create directories outside the PyTorch program
    # Do not create directory here because it is not multiprocess safe
    '''
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
    '''

    model_filepath = os.path.join(model_dir, model_filename)

    # We need to use seeds to make sure that the models initialized in different processes are the same
    set_random_seeds(random_seed=random_seed)

    # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    print("Initializing torch.distributed")
    #torch.distributed.init_process_group(backend="nccl")
    torch.distributed.init_process_group(torch.distributed.Backend.NCCL)
    # torch.distributed.init_process_group(backend="gloo")

    # Encapsulate the model on the GPU assigned to the current process
    model = torchvision.models.resnet18(pretrained=False)

    device = torch.device("cuda:{}".format(local_rank))
    model = model.to(device)
    ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

    # We only save the model who uses device "cuda:0"
    # To resume, the device for the saved model would also be "cuda:0"
    if resume == True:
        map_location = {"cuda:0": "cuda:{}".format(local_rank)}
        ddp_model.load_state_dict(torch.load(model_filepath, map_location=map_location))

    # Prepare dataset and dataloader
    transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])

    # Data should be prefetched
    # Download should be set to be False, because it is not multiprocess safe
    train_set = torchvision.datasets.CIFAR10(root="data", train=True, download=False, transform=transform)
    test_set  = torchvision.datasets.CIFAR10(root="data", train=False, download=False, transform=transform)

    # Restricts data loading to a subset of the dataset exclusive to the current process
    train_sampler = DistributedSampler(dataset=train_set)

    train_loader = DataLoader(dataset=train_set, batch_size=batch_size, sampler=train_sampler, num_workers=8)
    # Test loader does not have to follow distributed sampling strategy
    test_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=False, num_workers=8)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=1e-5)

    # Loop over the dataset multiple times
    for epoch in range(num_epochs):

        print("Local Rank: {}, Epoch: {}, Training ...".format(local_rank, epoch))

        # Save and evaluate model routinely
        if epoch % 10 == 0:
            if local_rank == 0:
                accuracy = evaluate(model=ddp_model, device=device, test_loader=test_loader)
                torch.save(ddp_model.state_dict(), model_filepath)
                print("-" * 75)
                print("Epoch: {}, Accuracy: {}".format(epoch, accuracy))
                print("-" * 75)

        ddp_model.train()

        for data in train_loader:
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()


if __name__ == "__main__":
    main()
@SamuelLarkin
Copy link
Collaborator Author

Note
I can run this setup on GPSC5. This sounds like there is still something not properly configured on Trixie. Also, the set of scripts were used about 2 years ago to understand AND successfully run multi nodes & multi GPUs jobs.

@SamuelLarkin
Copy link
Collaborator Author

SamuelLarkin commented Sep 15, 2022

I'm setting the port to 36363 and it looks like the master is listening to 36363 and it also has connections to the slave. We can also see the slave been connected to the master on port 36363.

Netstat

Master

pytorch.distributed-161471_0.netstat:MASTER_PORT_JOB: 36363
pytorch.distributed-161471_0.netstat:tcp6       0      0 :::36363                :::*                    LISTEN      27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:31558       10.10.0.135:36363       ESTABLISHED 27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:36363       10.10.0.135:31558       ESTABLISHED 27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:36363       10.10.0.136:6506        ESTABLISHED 27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:36363       10.10.0.136:6504        ESTABLISHED 27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:31560       10.10.0.135:36363       ESTABLISHED 27255/python  
pytorch.distributed-161471_0.netstat:tcp6       0      0 10.10.0.135:36363       10.10.0.135:31560       ESTABLISHED 27255/python  

Slave

pytorch.distributed-161471_1.netstat:MASTER_PORT_JOB: 36363
pytorch.distributed-161471_1.netstat:tcp6       0      0 10.10.0.136:6506        10.10.0.135:36363       ESTABLISHED 24878/python  
pytorch.distributed-161471_1.netstat:tcp6       0      0 10.10.0.136:6504        10.10.0.135:36363       ESTABLISHED 24878/python  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant