Skip to content

axonn-ai/distrib-dl-tutorial

Repository files navigation

ISC 24 - Tutorial on Distributed Training of Deep Neural Networks

Join slack

All the code for the hands-on exercies can be found in this repository.

Table of Contents

Setup

To request an account on Zaratan, please join slack at the link above, and fill this Google form.

We have pre-built the dependencies required for this tutorial on Zaratan. This will be activated automatically when you run the bash scripts.

The training dataset i.e. MNIST has also been downloaded in /scratch/zt1/project/isc/shared/data/MNIST.

Basics of Model Training

Using PyTorch

cd session_1_basics/
sbatch --reservation=isc2024 run.sh

Mixed Precision

MIXED_PRECISION=true sbatch --reservation=isc2024 run.sh

Activation Checkpointing

CHECKPOINT_ACTIVATIONS=true sbatch --reservation=isc2024 run.sh

Data Parallelism

Pytorch Distributed Data Parallel (DDP)

cd session_2_data_parallelism
sbatch --reservation=isc2024 run_ddp.sh

Zero Redundancy Optimizer (ZeRO)

sbatch --reservation=isc2024 run_deepspeed.sh

Intra-layer (Tensor) Parallelism

cd session_3_intra_layer_parallelism
sbatch --reservation=isc2024 run.sh

Inter-layer (Pipeline) Parallelism

cd session_4_inter_layer_parallelism
sbatch --reservation=isc2024 run.sh

Hybrid Inter-layer (Pipeline) + Data Parallelism

HYBRID_PARR=true sbatch --reservation=isc2024 run.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •