Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cn119 fails to allocate GPU memory #92

Open
SamuelLarkin opened this issue May 30, 2023 · 2 comments
Open

cn119 fails to allocate GPU memory #92

SamuelLarkin opened this issue May 30, 2023 · 2 comments

Comments

@SamuelLarkin
Copy link
Collaborator

Hi,
each time I submit a job on cn119 I get an CUDA error about failing to allocate memory. The same job on a different node works fine.

Log

...skipping...                                                                                                                                   17:01:47 [1305/45323]
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/jit/_trace.py", line 759, in trace
      return trace_module(
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/jit/_trace.py", line 976, in trace_module
      module._c._create_method_from_trace(
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
      return forward_call(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
      result = self.forward(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/model.py", line 337, in forward
      target = self.decoder.decode_seq(target_embed, states=states)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/decoder.py", line 256, in decode_seq
      outputs, _ = self.forward(inputs, states)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/decoder.py", line 290, in forward
      target, new_layer_autoregr_state = layer(target=target,
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
      return forward_call(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
      result = self.forward(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/transformer.py", line 244, in forward
      target_ff = self.ff(self.pre_ff(target))
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
      return forward_call(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
      result = self.forward(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/transformer.py", line 338, in forward
      h = self.ff1(x)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
      return forward_call(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
      result = self.forward(*input, **kwargs)
    File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
      return F.linear(input, self.weight, self.bias)
  torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 31.75 GiB total capacity; 6.16 GiB already allocated; 5.19 MiB free; 6.18 GiB
@NRCGavin
Copy link
Collaborator

NRCGavin commented May 31, 2023 via email

@SamuelLarkin
Copy link
Collaborator Author

Thanks for looking into this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants