-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run torchrun aka pytorch distributed on multiple nodes & GPUs. #87
Comments
Note |
I'm setting the port to 36363 and it looks like the master is listening to 36363 and it also has connections to the slave. We can also see the slave been connected to the master on port 36363. NetstatMaster
Slave
|
I'm trying to run a script with
torchrun
to get my job running on multiple nodes and GPUs but it fails.Related to #77
The error seems to be
Logs
Master
Slave
Code & Scripts
task.slurm
task.sh
main.py
The text was updated successfully, but these errors were encountered: