Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch Scatter gives an illegal memory access #456

Open
IsmaelElsharkawi opened this issue Aug 10, 2024 · 1 comment
Open

Torch Scatter gives an illegal memory access #456

IsmaelElsharkawi opened this issue Aug 10, 2024 · 1 comment

Comments

@IsmaelElsharkawi
Copy link

IsmaelElsharkawi commented Aug 10, 2024

Hi @rusty1s,

Thanks for the awesome work of putting together and maintaining pytorch_scatter.
I'm facing an issue with scatter.
When I run the following code:

from torch_scatter import scatter
import torch
x_j = torch.randn((12143200, 192), dtype=torch.float32).to('cuda:0')
edge_index = torch.randint(low=0, high=73727, size=(12143200,)).to('cuda:0')
out = scatter(src=x_j.to(torch.float32), index=edge_index, dim=0, dim_size=73728, reduce='max') 
print(out)

I'm setting export CUDA_LAUNCH_BLOCKING=1 before running this code

I'm using one V100 GPU with 32GB of memory to run this code, here's my nvidia-smi data:

Sat Aug 10 13:21:38 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:07:00.0 Off |                    0 |
| N/A   34C    P0    43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   34C    P0    46W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here's my conda environment:

name: MyEnv
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - blas=1.0=mkl
  - brotli-python=1.0.9=py37hd23a5d3_7
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2023.08.22=h06a4308_0
  - certifi=2022.12.7=py37h06a4308_0
  - charset-normalizer=3.3.0=pyhd8ed1ab_0
  - cudatoolkit=10.2.89=hfd86e86_1
  - ffmpeg=4.3.2=hca11adc_0
  - flit-core=3.6.0=pyhd3eb1b0_0
  - freetype=2.12.1=h4a9f257_0
  - giflib=5.2.1=h5eee18b_3
  - gmp=6.2.1=h58526e2_0
  - gnutls=3.6.13=h85f3911_1
  - idna=3.4=pyhd8ed1ab_0
  - intel-openmp=2023.1.0=hdb19cb5_46305
  - jpeg=9b=h024ee3a_2
  - lame=3.100=h7f98852_1001
  - lcms2=2.12=h3be6417_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.4=h6a678d5_0
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libpng=1.6.39=h5eee18b_0
  - libstdcxx-ng=11.2.0=h1234567_1
  - libtiff=4.2.0=h85742a9_0
  - libuv=1.44.2=h5eee18b_0
  - libwebp=1.2.0=h89dd481_0
  - libwebp-base=1.2.0=h27cfd23_0
  - lz4-c=1.9.4=h6a678d5_0
  - mkl=2020.2=256
  - mkl-service=2.3.0=py37he8ac12f_0
  - mkl_fft=1.3.0=py37h54f3939_0
  - mkl_random=1.1.1=py37h0573a6f_0
  - ncurses=6.4=h6a678d5_0
  - nettle=3.6=he412f7d_0
  - ninja=1.10.2=h06a4308_5
  - ninja-base=1.10.2=hd09550d_5
  - openh264=2.1.1=h780b84a_0
  - openssl=1.1.1w=h7f8727e_0
  - pillow=9.3.0=py37hace64e9_1
  - pip=22.3.1=py37h06a4308_0
  - pysocks=1.7.1=py37h89c1867_5
  - python=3.7.16=h7a1cb2a_0
  - python_abi=3.7=2_cp37m
  - pytorch-mutex=1.0=cuda
  - pyyaml=6.0=py37h5eee18b_1
  - readline=8.2=h5eee18b_0
  - requests=2.31.0=pyhd8ed1ab_0
  - setuptools=65.6.3=py37h06a4308_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.41.2=h5eee18b_0
  - tbb=2021.8.0=hdb19cb5_0
  - timm=0.3.2=pyhd8ed1ab_0
  - tk=8.6.12=h1ccaba5_0
  - typing_extensions=4.4.0=py37h06a4308_0
  - urllib3=2.0.6=pyhd8ed1ab_0
  - wheel=0.38.4=py37h06a4308_0
  - x264=1!161.3030=h7f98852_1
  - xz=5.4.2=h5eee18b_0
  - yaml=0.2.5=h7b6447c_0
  - zlib=1.2.13=h5eee18b_0
  - zstd=1.4.9=haebb681_0
  - pip:
      - cffi==1.15.1
      - cryptography==42.0.5
      - cupy-cuda102==11.6.0
      - cycler==0.11.0
      - fastrlock==0.8.2
      - fonttools==4.38.0
      - jinja2==3.1.3
      - joblib==1.3.2
      - kiwisolver==1.4.5
      - markupsafe==2.1.5
      - matplotlib==3.5.3
      - numpy==1.21.6
      - nvidia-cublas-cu11==11.10.3.66
      - nvidia-cuda-nvrtc-cu11==11.7.99
      - nvidia-cuda-runtime-cu11==11.7.99
      - nvidia-cudnn-cu11==8.5.0.96
      - packaging==23.2
      - pandas==1.3.5
      - psutil==5.9.8
      - pycparser==2.21
      - pydeprecate==0.3.2
      - pyopenssl==24.1.0
      - pyparsing==3.1.1
      - python-dateutil==2.8.2
      - pytz==2023.3.post1
      - scikit-learn==1.0.2
      - scipy==1.7.3
      - threadpoolctl==3.1.0
      - torch==1.7.1+cu110
      - torch-geometric==2.3.1
      - torch-scatter==2.0.7
      - torchaudio==0.7.2
      - torcheval==0.0.7
      - torchmetrics==0.7.2
      - torchprofile==0.0.4
      - torchvision==0.8.2+cu110
      - tqdm==4.66.2

This is the error I face:

Traceback (most recent call last):
  File "playground.py", line 5, in <module>
    out = scatter(src=x_j.to(torch.float32), index=edge_index, dim=0, dim_size=73728, reduce='max') 
  File "/raid/ismail2/miniconda3/envs/MyEnv/lib/python3.7/site-packages/torch_scatter/scatter.py", line 161, in scatter
    return scatter_max(src, index, dim, out, dim_size)[0]
  File "/raid/ismail2/miniconda3/envs/MyEnv/lib/python3.7/site-packages/torch_scatter/scatter.py", line 73, in scatter_max
    return torch.ops.torch_scatter.scatter_max(src, index, dim, out, dim_size)
RuntimeError: CUDA error: an illegal memory access was encountered

I've been stuck here for a while and would really appreciate any help on this. Thanks.

PS: AFAIU, the illegal memory error is different from the out-of-memory error.

@IsmaelElsharkawi
Copy link
Author

I've recreated the issue in a Kaggle notebook: https://www.kaggle.com/code/ismaelelsharkawi/torch-scatter-gives-an-illegal-memory-access-456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant