-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
install latest version of cuda #66
Comments
That would be kind if you could also install the latest version of Armadillo, please. |
@kryczko There is ongoing work to test a new image which will allow an upgrade to CUDA 11. This will take a while still but it is in the plans. @Spationaute Armadillo is available as a module up to 9.900.2 via the Compute Canada stack. However newer versions (up to 10.6.2) can be loaded using a conda-forge environment. |
great thanks for the update! |
Any progress on this issue aka installing cuda-10.2 or cuda-11.3? |
I would also like to know if there has been an updated version of cuda -- I have a code that I am unable to run until the latest version of cuda is installed. |
Is there an ETA on this? |
@SamuelLarkin additional nodes have been reimaged and put into the Cuda11Test queue. There now 6 nodes there to run Cuda 11 jobs. Please leverage those nodes as you require. |
Thanks for the new nodes. Can't wait to get the whole cluster updated. |
Unfortunately, I think I ran into an error related to tcp/ip or friend.
|
There was a configuration issue with DNS on the new Cuda11 nodes. This has been corrected so please try again and let me know if you have any further issues.
Thank you,
Gavin
…________________________________
From: Samuel Larkin ***@***.***>
Sent: 07 April 2022 16:12
To: ai4d-iasc/trixie
Cc: Subscribed
Subject: Re: [ai4d-iasc/trixie] install latest version of cuda (#66)
***ATTENTION*** This email originated from outside of the NRC. ***ATTENTION*** Ce courriel provient de l'extérieur du CNRC
Unfortunately, I think I ran into an error related to tcp/ip or friend.
As a trace, this was encountered when trying to run comet-compare.
`
The problem only manifested itself on cn104 & cn105 but ran perfectly fine on cn102 & cn103.
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
conn.connect()
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 358, in connect
self.sock = conn = self._new_conn()
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
resp = conn.urlopen(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
retries = retries.increment(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/xlm-roberta-large (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/bin/comet-compare", line 8, in <module>
sys.exit(compare_command())
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/cli/compare.py", line 190, in compare_command
model = load_from_checkpoint(model_path)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/__init__.py", line 58, in load_from_checkpoint
model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 199, in _load_model_state
model = cls(**_cls_kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/regression/regression_metric.py", line 75, in __init__
super().__init__(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/base.py", line 109, in __init__
self.encoder = str2encoder[self.hparams.encoder_model].from_pretrained(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/encoders/xlmr.py", line 49, in from_pretrained
return XLMREncoder(pretrained_model)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/encoders/xlmr.py", line 36, in __init__
self.tokenizer = XLMRobertaTokenizer.from_pretrained(pretrained_model)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1649, in from_pretrained
fast_tokenizer_file = get_fast_tokenizer_file(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3425, in get_fast_tokenizer_file
all_files = get_list_of_files(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/file_utils.py", line 1730, in get_list_of_files
model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info(
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 867, in model_info
r = requests.get(path, headers=headers, timeout=timeout)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/xlm-roberta-large (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
—
Reply to this email directly, view it on GitHub<#66 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AP4OUFENAUWDQWIEX7BN57TVD46TDANCNFSM5E73NQ2Q>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@NRCGavin I ran the same test that discovered the previous error and everything computed fine. Network on cn104 & cn105 worked. Thanks |
@NRCGavin new problem whoami; for i in {102..108}; do ssh -t cn$i whoami; done
larkins
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn102 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn103 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn104 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn105 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn106 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn107 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn108 closed. |
Is whoami working again? Post-upgrade whoami works for my accounts on compute nodes. |
The latest version of pytorch, which has converted more numpy functions to pytorch functions requires either cuda 10.2 or 11.1. Could one of these versions be installed?
The text was updated successfully, but these errors were encountered: