Skip to content

Releases: mosaicml/composer

v0.18.0

25 Jan 20:44
Compare
Choose a tag to compare

This release has been yanked, please skip directly to Composer v0.18.1

New Features

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

Bug Fixes

Deprecations

  • Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827

What's Changed

New Contributors

Full Changelog: v0.17.2...v0.18.0

v0.17.2

14 Dec 20:02
7e0e40a
Compare
Choose a tag to compare

New Features

1. Torch 2.1.1 Support

Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.

2. Faster OCI Upload/Download

Composer now supports multi-part upload/download to OCI, which should speedup object store times.

3. Memory Profiling

We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.

Bug Fixes

1. FSDP Initialization with Meta

Previously, our FSDP integration had a bug with initializing weights when using device=meta, which resulted in an additional scaling. This has now been fixed, so device and distributed strategies should not affect parallelization strategy.

What's Changed

New Contributors

Full Changelog: v0.17.1...v0.17.2

v0.17.1

27 Nov 22:07
2b3e2a6
Compare
Choose a tag to compare

Bug Fixes

1. MosaicML Logger Robustness (#2728)

We've improved the MosaicML logger to be more robust to faulty serialization.

What's Changed

Full Changelog: v0.17.0...v0.17.1

v0.17.0

16 Nov 00:23
83a40f5
Compare
Choose a tag to compare

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

What's Changed

New Contributors

Full Changelog: v0.16.4...v0.17.0

v0.16.4

11 Oct 19:49
1c9d8d1
Compare
Choose a tag to compare

What's New

1. Torch 2.1 Support

Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.

What's Changed

New Contributors

Full Changelog: v0.16.3...v0.16.4

v0.16.3

26 Sep 18:07
c82da77
Compare
Choose a tag to compare

What's New

1. Add pass@k for HumanEval

HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.

2. log_model with MLFlow

The MLFlow integration now supports log_model at the end of the run.

What's Changed

New Contributors

Full Changelog: v0.16.2...v0.16.3

v0.16.2

14 Sep 16:09
130bde5
Compare
Choose a tag to compare

What's New

1. PyTorch Nightly Support

Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.

Bug Fixes

1. MosaicML Logger Robustness

MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM to 'False' when training on the MosaicML platform.

2. GCS Integration

GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.

3. Optimizer Monitor Norm Calculation (#2531)

Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.

What's Changed

New Contributors

Full Changelog: v0.16.1...v0.16.2

v0.16.1

05 Sep 21:44
336bf8d
Compare
Choose a tag to compare

New Features

1. HPU (Habana Gaudi) Support (#2444)

Composer now supports Habana Gaudi chips! To enable HPUs, device needs to be specified as 'hpu':

composer_model = MyComposerModel(n_layers=3)

trainer = Trainer(
    model=composer_model,
    device='hpu',
    ...
)

2. Generate Callback (#2449)

We've added a new callback which runs generate on a language model at a given frequency to visualize outputs:

from composer.callbacks import Generate

composer_model = MyComposerModel(n_layers=3)
generate_callback = Generate(prompts=['How good is my model?'], interval='5ba')

trainer = Trainer(
    model=composer_model,
    callbacks = generate_callback,
    ...
)

Bug Fixes

1. Checkpoint Fixes

Elastic sharded checkpointing now disables torchmetric saving to avoid issues with torchmetrics tensors being sharded. Additionally, checkpointing now falls back on the old path which does not convert torchmetrics tensors to numpy. Checkpointing also no longer materializes optimizer state when saving weights only.

2. MLFlow Performance Improvements

MLFlow integration has significant performance improvements in logging frequency and system metrics collected.

What's Changed

New Contributors

Full Changelog: v0.16.0...v0.16.1

v0.16.0

21 Aug 19:49
9f59487
Compare
Choose a tag to compare

What's New

1. New Events (#2264)

Composer now has the events EVAL_BEFORE_ALL and EVAL_AFTER_ALL, which lets users control logging of certain bespoke evaluation information across all evalutors.

2. Elastic Sharded Checkpointing

Traditionally, checkpoints are stored as giant monoliths. For large model training, moving the entire model to 1 node may be infeasible and writing one large file from 1 node may be slow. Composer now supports elastic sharded checkpoints with FSDP, where every rank writes a single shard of the checkpoint. This checkpointing strategy is elastic, which means even if you resume on a different number of GPUs, Composer will handle resumption. To enable sharded checkpointing, it must be specified in the FSDP Config as 'state_dict_type': 'sharded':

composer_model = MyComposerModel(n_layers=3)

fsdp_config = {
    'sharding_strategy': 'FULL_SHARD',
    'state_dict_type': 'sharded',
    'sharded_ckpt_prefix_dir': 'ba{batch}-shards' # will save each set of shards checkpoint to a unique folder based on batch
}

trainer = Trainer(
    model=composer_model,
    max_duration='4ba'
    fsdp_config=fsdp_config,
    save_folder='checkpoints',
    save_interval='2ba',
    ...
)

See the docs for more information in how to integrate this with your project.

Bug Fixes

  • Fixes runtime estimator when using multiple evaluators in #2331
  • Fix autoresume docs link in #2332
  • Use Enum value when logging hyper-parameters in #2386
  • Fix GCSObjectStore to match function signatures of other object stores in #2445
  • Cast to float32 before numpy() to avoid bf16 errors in #2441

What's Changed

New Contributors

Read more

v0.15.1

07 Jul 23:33
Compare
Choose a tag to compare

Bug Fixes

This is a patch release that mainly fixes a bug related to autoresume, and changes the default to offload_to_cpu for PyTorch version >2 sharded checkpoints.

What's Changed

Full Changelog: v0.15.0...v0.15.1