distributed-training

Here are 156 public repositories matching this topic...

PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

python machine-learning deep-learning neural-network scalability efficiency paddlepaddle distributed-training

Updated Sep 20, 2024
C++

determined-ai / determined

Star

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

kubernetes data-science machine-learning deep-learning tensorflow keras pytorch hyperparameter-optimization hyperparameter-tuning hyperparameter-search distributed-training ml-infrastructure mlops ml-platform

Updated Sep 20, 2024
Go

skypilot-org / skypilot

Star

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Updated Sep 20, 2024
Python

A simple, easy-to-understand library for diffusion models using Flax and Jax. Includes detailed notebooks on DDPM, DDIM, and EDM with simplified mathematical explanations. Made as part of my journey for learning and experimenting with state-of-the-art generative AI.

Updated Sep 19, 2024
Jupyter Notebook

NoteDance / Note

Star

Machine learning library, Distributed training, Deep learning, Reinforcement learning, Models, TensorFlow, PyTorch

Updated Sep 19, 2024
Python

intelligent-machine-learning / dlrover

Star

DLRover: An Automatic Distributed Deep Learning System

k8s distributed-training llm-training

Updated Sep 19, 2024
Python

PaddlePaddle / PaddleNLP

Star

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm

Updated Sep 19, 2024
Python

PinJhih / ddp-trainer

Star

A simple package for distributed model training using Distributed Data Parallel (DDP) in PyTorch.

pytorch ddp distributed-training parallel-training

Updated Sep 19, 2024
Python

huggingface / pytorch-image-models

Star

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Updated Sep 18, 2024
Python

FedML-AI / FedML

Star

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

machine-learning deep-learning inference-engine model-deployment model-serving distributed-training federated-learning mlops edge-ai ai-agent on-device-training

Updated Sep 18, 2024
Python

Oneflow-Inc / libai

Star

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

nlp deep-learning transformer large-scale data-parallelism model-parallelism distributed-training self-supervised-learning oneflow pipeline-parallelism vision-transformer

Updated Sep 19, 2024
Python

pytorch / torchx

Star

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

python kubernetes components machine-learning airflow deep-learning slurm pipelines pytorch ray aws-batch distributed-training

Updated Sep 17, 2024
Python

walln / loadax

Star

Dataloading for JAX

datasets ddp distributed-training dataloading jax xla fsdp

Updated Sep 17, 2024
Python

aws / sagemaker-xgboost-container

Star

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

python training aws machine-learning inference xgboost gbm distributed-training sagemaker

Updated Sep 17, 2024
Python

chairc / Integrated-Design-Diffusion-Model

Star

IDDM (Industrial, landscape, animate...), support DDPM, DDIM, PLMS, webui and multi-GPU distributed training. Pytorch实现，生成模型，扩散模型，分布式训练

distributed-computing pytorch generative-model webui industrial unet distributed-training diffusion-models ddpm plms ddim aigc

Updated Sep 17, 2024
Python

saforem2 / ezpz

Sponsor

Star

Train across all your devices, ezpz 🍋

python machine-learning launcher rich distributed-training

Updated Sep 19, 2024
Python

tanyuqian / redco

Star

NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference

Updated Sep 15, 2024
Python

zh320 / realtime-semantic-segmentation-pytorch

Star

PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training etc.

real-time pytorch enet semantic-segmentation knowledge-distillation cityscapes distributed-training

Updated Sep 12, 2024
Python

pinpoint-apm / pinpoint-node-agent

Star

Pinpoint Node.js agent

agent performance node monitoring apm pinpoint distributed-training

Updated Sep 12, 2024
JavaScript

determined-ai / determined-examples

Star

Example ML projects that use the Determined library.

machine-learning deep-learning tensorflow keras pytorch hyperparameter-tuning distributed-training ml-infrastructure

Updated Sep 11, 2024
Python

Improve this page

Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distributed-training

Here are 156 public repositories matching this topic...

PaddlePaddle / Paddle

determined-ai / determined

skypilot-org / skypilot

AshishKumar4 / FlaxDiff

NoteDance / Note

intelligent-machine-learning / dlrover

PaddlePaddle / PaddleNLP

PinJhih / ddp-trainer

huggingface / pytorch-image-models

FedML-AI / FedML

Oneflow-Inc / libai

pytorch / torchx

walln / loadax

aws / sagemaker-xgboost-container

chairc / Integrated-Design-Diffusion-Model

saforem2 / ezpz

tanyuqian / redco

zh320 / realtime-semantic-segmentation-pytorch

pinpoint-apm / pinpoint-node-agent

determined-ai / determined-examples

Improve this page

Add this topic to your repo