Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

install latest version of cuda #66

Open
kryczko opened this issue Sep 29, 2021 · 13 comments
Open

install latest version of cuda #66

kryczko opened this issue Sep 29, 2021 · 13 comments

Comments

@kryczko
Copy link

kryczko commented Sep 29, 2021

The latest version of pytorch, which has converted more numpy functions to pytorch functions requires either cuda 10.2 or 11.1. Could one of these versions be installed?

@Spationaute
Copy link

That would be kind if you could also install the latest version of Armadillo, please.
http://arma.sourceforge.net

@ddamoursNRC
Copy link
Collaborator

@kryczko There is ongoing work to test a new image which will allow an upgrade to CUDA 11. This will take a while still but it is in the plans.

@Spationaute Armadillo is available as a module up to 9.900.2 via the Compute Canada stack. However newer versions (up to 10.6.2) can be loaded using a conda-forge environment.

@kryczko
Copy link
Author

kryczko commented Sep 29, 2021

great thanks for the update!

@SamuelLarkin
Copy link
Collaborator

Any progress on this issue aka installing cuda-10.2 or cuda-11.3?

@kryczko
Copy link
Author

kryczko commented Jan 5, 2022

I would also like to know if there has been an updated version of cuda -- I have a code that I am unable to run until the latest version of cuda is installed.

@SamuelLarkin
Copy link
Collaborator

Is there an ETA on this?
We know we have a working image so why isn't it the default?
A lot of tools are now using pytorch-1.10 which requires cuda-10.

@ddamoursNRC
Copy link
Collaborator

@SamuelLarkin additional nodes have been reimaged and put into the Cuda11Test queue. There now 6 nodes there to run Cuda 11 jobs. Please leverage those nodes as you require.
There is a major filesystem upgrade scheduled for the next few weeks. After that, re-imaging the remainder of the nodes will be undertaken.

@SamuelLarkin
Copy link
Collaborator

Thanks for the new nodes. Can't wait to get the whole cluster updated.

@SamuelLarkin
Copy link
Collaborator

Unfortunately, I think I ran into an error related to tcp/ip or friend.
As a trace, this was encountered when trying to run comet-compare.
`
The problem only manifested itself on cn104 & cn105 but ran perfectly fine on cn102 & cn103.

Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1040, in _validate_conn
    conn.connect()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/adapters.py", line 440, in send
    resp = conn.urlopen(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/xlm-roberta-large (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/bin/comet-compare", line 8, in <module>
    sys.exit(compare_command())
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/cli/compare.py", line 190, in compare_command
    model = load_from_checkpoint(model_path)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/__init__.py", line 58, in load_from_checkpoint
    model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 157, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 199, in _load_model_state
    model = cls(**_cls_kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/regression/regression_metric.py", line 75, in __init__
    super().__init__(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/models/base.py", line 109, in __init__
    self.encoder = str2encoder[self.hparams.encoder_model].from_pretrained(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/encoders/xlmr.py", line 49, in from_pretrained
    return XLMREncoder(pretrained_model)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/comet/encoders/xlmr.py", line 36, in __init__
    self.tokenizer = XLMRobertaTokenizer.from_pretrained(pretrained_model)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1649, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3425, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/transformers/file_utils.py", line 1730, in get_list_of_files
    model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info(
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 867, in model_info
    r = requests.get(path, headers=headers, timeout=timeout)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "/gpfs/projects/DT/mtp/WMT20/opt/miniconda3/envs/comet-1.0_cu113/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/xlm-roberta-large (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ed32f7c10>: Failed to establish a new connection: [Errno -2] Name or service not known'))

@NRCGavin
Copy link
Collaborator

NRCGavin commented Apr 8, 2022 via email

@SamuelLarkin
Copy link
Collaborator

@NRCGavin I ran the same test that discovered the previous error and everything computed fine. Network on cn104 & cn105 worked.

Thanks

@SamuelLarkin
Copy link
Collaborator

@NRCGavin new problem
whoami is not working and it used to work on the cudatest11 image last week

whoami; for i in {102..108}; do ssh -t cn$i whoami; done
larkins
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn102 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn103 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn104 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn105 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn106 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn107 closed.
whoami: cannot find name for user ID 171967808: No such file or directory
Connection to cn108 closed.

@nrcfieldsa
Copy link

Is whoami working again? Post-upgrade whoami works for my accounts on compute nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants