-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cn119 fails to allocate GPU memory #92
Comments
It appears there was a process stuck on GPU0 that has now been terminated. I have rebooted the node and returned it to the queue. Let me know if you continue to have issues with the node.
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1395 C python 25081MiB |
+-----------------------------------------------------------------------------+
***@***.*** ~]# ps -aux | grep 1395
thomased 1395 1.7 14.4 89506440 28604880 pts/2 Tl May25 147:37 python train.py --model=TransformerSeq2Set --model_type=seq2set --times=1 --dataset=kp20k --device=cuda:0
…________________________________
From: Samuel Larkin ***@***.***>
Sent: 30 May 2023 17:09:02
To: ai4d-iasc/trixie
Cc: Subscribed
Subject: [ai4d-iasc/trixie] cn119 fails to allocate GPU memory (Issue #92)
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hi,
each time I submit a job on cn119 I get an CUDA error about failing to allocate memory. The same job on a different node works fine.
Log
...skipping... 17:01:47 [1305/45323]
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/jit/_trace.py", line 759, in trace
return trace_module(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/jit/_trace.py", line 976, in trace_module
module._c._create_method_from_trace(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
result = self.forward(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/model.py", line 337, in forward
target = self.decoder.decode_seq(target_embed, states=states)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/decoder.py", line 256, in decode_seq
outputs, _ = self.forward(inputs, states)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/decoder.py", line 290, in forward
target, new_layer_autoregr_state = layer(target=target,
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
result = self.forward(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/transformer.py", line 244, in forward
target_ff = self.ff(self.pre_ff(target))
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
result = self.forward(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/sockeye/transformer.py", line 338, in forward
h = self.ff1(x)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1182, in _slow_forward
result = self.forward(*input, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/sockeye-3.1.31/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 31.75 GiB total capacity; 6.16 GiB already allocated; 5.19 MiB free; 6.18 GiB
—
Reply to this email directly, view it on GitHub<#92>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AP4OUFCWVDOSCPWUJ77MBHDXIZOW5ANCNFSM6AAAAAAYUR5ARA>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Thanks for looking into this |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
each time I submit a job on
cn119
I get an CUDA error about failing to allocate memory. The same job on a different node works fine.Log
The text was updated successfully, but these errors were encountered: