Releases · mosaicml/composer

25 Jan 20:44

b-chu

v0.18.0

844d1dc

v0.18.0

This release has been yanked, please skip directly to Composer v0.18.1

New Features

1. Improved DTensor Support

Composer now supports elastic saving and loading of DTensors at various mesh sizes.

2. Checkpoint Saving and Loading from Databricks MLFlow

Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.

composer_model = MyComposerModel(...)

trainer = Trainer(
      model=composer_model,
      save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      logger=MLFlowLogger(...),
      load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
      ...
)

Bug Fixes

Fix load_ignore_keys with rng by @mvpatel2000 in #2803
Fix mosaicml logger on close by @mvpatel2000 in #2816
Fix torch profiler error on close by @mvpatel2000 in #2818
Fix import for daily test by @snarayan21 in #2826
[S] Fix how single value tensors are logged by @aspfohl in #2831

Deprecations

Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827

What's Changed

Bump transformers version by @dakinggg in #2781
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
[UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
Update parse_uri by @irenedea in #2787
default to no torch profiler memory timeline by @cli99 in #2790
Add eot token to ICL generate kwargs by @bmosaicml in #2782
Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
Add torch nightly 12-13 by @j316chuck in #2792
Add process group as arg to FSDP by @mvpatel2000 in #2794
Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
Fix load_ignore_keys with rng by @mvpatel2000 in #2803
Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
Require sync module states with HSDP by @mvpatel2000 in #2812
Better communication computation overlap by @snarayan21 in #2811
Improve error message for speed monitor by @mvpatel2000 in #2801
Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
Bump torchvision for nightly by @mvpatel2000 in #2815
Fix mosaicml logger on close by @mvpatel2000 in #2816
Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
Fix torch profiler error on close by @mvpatel2000 in #2818
Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
All unshard streams wait on computation every step by @snarayan21 in #2823
Add encoding=utf-8 by @dakinggg in #2824
Fix import for daily test by @snarayan21 in #2826
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
code-quality timeout update by @aspfohl in #2830
[S] Fix how single value tensors are logged by @aspfohl in #2831
Adds DTensor Support by @mvpatel2000 in #2821
Remove duplicate checkpoint verifications by @eracah in #2828
Fix seed for FSDP wrap by @mvpatel2000 in #2833
Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
Allow hsdp by @mvpatel2000 in #2838
Bump torch 2.1.2 by @mvpatel2000 in #2840
Upgrade pyright to 1.1.310 by @b-chu in #2841
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
update nightly to torch 2.3 by @j316chuck in #2842
Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
Fix torch bump by @j316chuck in #2855
Torch 2.3 patch by @dakinggg in #2849
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
Rewrite to use individual state functions by @mvpatel2000 in #2860
Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
Add save_ignore_keys by @mvpatel2000 in #2868
Remome log debug by @mvpatel2000 in #2871
Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
Remove toml by @b-chu in #2872
Update license by @b-chu in #2875
Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
Convert print to log.info by @mvpatel2000 in #2876

New Contributors

@jerrychen109 made their first contribution in #2802

Full Changelog: v0.17.2...v0.18.0

Contributors

eracah, jerrychen109, and 13 other contributors

Assets 2

14 Dec 20:02

mvpatel2000

v0.17.2

7e0e40a

v0.17.2

New Features

1. Torch 2.1.1 Support

Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.

2. Faster OCI Upload/Download

Composer now supports multi-part upload/download to OCI, which should speedup object store times.

3. Memory Profiling

We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.

Bug Fixes

1. FSDP Initialization with Meta

Previously, our FSDP integration had a bug with initializing weights when using device=meta, which resulted in an additional scaling. This has now been fixed, so device and distributed strategies should not affect parallelization strategy.

What's Changed

Override NVIDIA environment variable for CUDA 12.1 images by @bandish-shah in #2742
Add NVIDIA_REQUIRE_CUDA_OVERRIDE env variable to Composer and Torch nightly Docker images by @bandish-shah in #2744
Remove duplicated for loop in lr_monitor.py by @priba in #2738
Fix console logger for small datasets. by @mvpatel2000 in #2746
Add metadata logging for wandb by @jjanezhang in #2747
Ignore load ignore keys by @mvpatel2000 in #2748
Bump torch to 2.1.1 version by @j316chuck in #2717
Add more info when run doesnt complete by @aspfohl in #2751
Lower sequence generation length on code gen to be dependent on max canonical solution length by @bmosaicml in #2682
Remove flatten params by @mvpatel2000 in #2761
Fix GPU tests by @mvpatel2000 in #2767
Fix GPU v2 by @mvpatel2000 in #2768
Use time.tokens for speedmonitor instead of dataset length by @mvpatel2000 in #2762
Remove BreakEpochException by @mvpatel2000 in #2759
time to clean up time parsing 😉 by @aspfohl in #2770
Upgrade RunConfig compute specification by @aspfohl in #2772
Use async logging in MLflowLogger by @chenmoneygithub in #2693
Fix FSDP _param_init_fn to not reinit parameters multiple times by @dakinggg in #2765
Gate FSDP param init test on torch 2.1 by @dakinggg in #2774
Parallelize OCI multipart download by @coryMosaicML in #2750
[UCVolumes] Add support for list API by @panchalhp-db in #2769
Add the memory timeline profiling support through the PyTorch profiler. by @cli99 in #2771
Improve torch memory profiling arguments processing by @cli99 in #2777
Bump aws of nccl version and enable aws platform support by @willgleich in #2776
Extend checkpoint loading to accept a validation function by @irenedea in #2726
Fix checkpoint validation tests for torch 1.13 by @irenedea in #2779
Bump version to 0.17.2 by @mvpatel2000 in #2780

New Contributors

@chenmoneygithub made their first contribution in #2693

Full Changelog: v0.17.1...v0.17.2

Contributors

j316chuck, irenedea, and 12 other contributors

Assets 2

27 Nov 22:07

mvpatel2000

v0.17.1

2b3e2a6

v0.17.1

Bug Fixes

1. MosaicML Logger Robustness (#2728)

We've improved the MosaicML logger to be more robust to faulty serialization.

What's Changed

Add train finished run event by @jjanezhang in #2714
Override nvidia env var for 11.8 by @dakinggg in #2722
Update file exists checkpointing error messages to be more helpful by @irenedea in #2668
[S] Add tag support to MLFlowLogger by @aspfohl in #2716
Use raise ... from e to preserve stack trace by @irenedea in #2725
add 0.17 to bcompat tests by @eracah in #2723
Add support for canned ACL environment variable by @nik-mosaic in #2729
Check serialization for JSON in mosaicml logger by @mvpatel2000 in #2728
Fix profiler issue by @j316chuck in #2735
Fix activation cpu offloading by @cli99 in #2724
Bump version 0.17.1 by @mvpatel2000 in #2741

Full Changelog: v0.17.0...v0.17.1

Contributors

eracah, j316chuck, and 7 other contributors

Assets 2

16 Nov 00:23

mvpatel2000

v0.17.0

83a40f5

v0.17.0

What's New

1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)

Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.

  composer_model = MyComposerModel(n_layers=3)

  fsdp_config = {
      'sharding_strategy': 'HYBRID_SHARD',
  }

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      fsdp_config=fsdp_config,
      ...
  )

HYBRID_SHARD will FULL_SHARD a model whereas _HYBRID_SHARD_ZERO2 will SHARD_GRAD_OP within the shard block.

2. Train Loss NaN Monitor (#2704)

Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.

  from composer.callbacks import NaNMonitor

  composer_model = MyComposerModel(n_layers=3)

  trainer = Trainer(
      model=composer_model,
      max_duration='4ba',
      callbacks=NaNMonitor(),
      ...
  )

Bug Fixes

Fix MPS with dict loss by @mvpatel2000 in #2706
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702

What's Changed

Add partial state dict functionality for FSDP by @b-chu in #2637
Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
Remove checkpoint on close by @mvpatel2000 in #2646
Update latest to 2.1 by @mvpatel2000 in #2650
HSDP Support by @mvpatel2000 in #2648
Log profile averages by @j316chuck in #2647
Daily API key by @mvpatel2000 in #2655
Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
Fix GCP tests by @mvpatel2000 in #2658
Allow no eval_loader when eval is disabled by @b-chu in #2657
Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
Fix FSDP arg default to match torch by @mvpatel2000 in #2660
Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
Upgrade to transformers 4.34.1 by @dakinggg in #2635
Update docker readme by @mvpatel2000 in #2669
Add script to validate remote object store paths by @irenedea in #2667
Torch 2.1 Resumption Support by @mvpatel2000 in #2665
Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
Fix dist by @mvpatel2000 in #2670
Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
Secure Code Eval changes by @mvpatel2000 in #2679
Lazy validation of code eval metric by @mvpatel2000 in #2681
Upgrade transformers to 4.35 by @dakinggg in #2684
Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
Add Kwargs to upload_object by @nik-mosaic in #2692
Add version number to composer metadata logs by @j316chuck in #2565
Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
Train loss NaN checking callback by @coryMosaicML in #2704
Adding logging and force flushing for run events by @jjanezhang in #2703
[daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
Fix MPS with dict loss by @mvpatel2000 in #2706
Update types to follow PEP 585 by @b-chu in #2697
Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
Fetching arguments for FSDP by @mvpatel2000 in #2710
Bump version to 0.17 by @mvpatel2000 in #2711

New Contributors

@willgleich made their first contribution in #2651
@jjanezhang made their first contribution in #2633

Full Changelog: v0.16.4...v0.17.0

Contributors

eracah, j316chuck, and 10 other contributors

Assets 2

11 Oct 19:49

mvpatel2000

v0.16.4

1c9d8d1

v0.16.4

What's New

1. Torch 2.1 Support

Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.

What's Changed

Add 0.16 checkpoint to backwards compatibility tests by @eracah in #2567
Updating FSDP monkeypatch by @mvpatel2000 in #2571
Add Databricks UC Volume Object Store by @panchalhp-db in #2548
Fix pytest disk space OOM issue by adding tmp_path_retention_policy=None by @j316chuck in #2583
Change daily nightly test version by @j316chuck in #2596
Add save and register wrappers to mlflow logger by @dakinggg in #2579
Missing () fo or in auto microbatching gate by @mvpatel2000 in #2574
Simplify FSDP Gradient Clipping by @mvpatel2000 in #2586
Use FSDP CustomPolicy to support custom kwargs passed to different wrapped modules by @cli99 in #2585
Free outputs callback by @mvpatel2000 in #2598
Merge branch 'dev' into spr/dev/458c4e36 by @b-chu in #2595
Fix a bug when batch type is dict and one of the values is the list by @mvpatel2000 in #2599
Readme update by @ejyuen in #2581
Add chain of thought eval by @bmosaicml in #2466
Add torch 2.1.0 by @mvpatel2000 in #2602
Change pr cpu and pr gpu test docker images by @j316chuck in #2611
Change the tokenizer json file to read binary by @dakinggg in #2608
[Docs] MLflow casing by @aspfohl in #2609
Call generate callback at end of training by @aspfohl in #2607
Refactor save interval and eval interval to share code by @dakinggg in #2600
Deprecate many datasets and models by @mvpatel2000 in #2605
Clean up gpu tests by @mvpatel2000 in #2612
Remove apex test by @j316chuck in #2616
Patch default precision by @mvpatel2000 in #2628
Add logging for generate callbacks by @aspfohl in #2630
Expose input_names and output_names when exporting to ONNX by @antoinebrl in #2601
Bump version to 0.16.4 by @mvpatel2000 in #2627

New Contributors

@panchalhp-db made their first contribution in #2548
@cli99 made their first contribution in #2585

Full Changelog: v0.16.3...v0.16.4

Contributors

eracah, ejyuen, and 9 other contributors

Assets 2

26 Sep 18:07

mvpatel2000

v0.16.3

c82da77

v0.16.3

What's New

1. Add pass@k for HumanEval

HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.

2. log_model with MLFlow

The MLFlow integration now supports log_model at the end of the run.

What's Changed

Update checkpoint.py by @b-chu in #2540
Add log image to mlflow by @eracah in #2416
Log runtime estimator units by @mvpatel2000 in #2542
Bump traitlets from 5.9.0 to 5.10.0 by @dependabot in #2547
Bump gitpython from 3.1.35 to 3.1.36 by @dependabot in #2546
Bump ipykernel from 6.25.1 to 6.25.2 by @dependabot in #2544
Add providers param to ONNX Session in tests by @nik-mosaic in #2553
Bump flash attn by @mvpatel2000 in #2551
Remove pin by @mvpatel2000 in #2554
Change filter to include pull_request_target by @mvpatel2000 in #2557
Downgrade nightly to previous version by @mvpatel2000 in #2556
MCLI Code Eval by @rishab-partha in #2479
Bump cryptography from 41.0.3 to 41.0.4 by @dependabot in #2559
Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #2560
Update numpy requirement from <1.26.0,>=1.21.5 to >=1.21.5,<1.27.0 by @dependabot in #2561
Update support for HumanEval by @mcarbin in #2550
Add log_model to MLFlowLogger by @dakinggg in #2541
Bump version to 0.16.3 by @mvpatel2000 in #2566

New Contributors

@mcarbin made their first contribution in #2550

Full Changelog: v0.16.2...v0.16.3

Contributors

mcarbin, eracah, and 6 other contributors

Assets 2

14 Sep 16:09

mvpatel2000

v0.16.2

130bde5

v0.16.2

What's New

1. PyTorch Nightly Support

Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.

Bug Fixes

1. MosaicML Logger Robustness

MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM to 'False' when training on the MosaicML platform.

2. GCS Integration

GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.

3. Optimizer Monitor Norm Calculation (#2531)

Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.

What's Changed

fix: when there is no train_metrics, do not checkpoint by @furkanbiten in #2502
Remove metric saving by @mvpatel2000 in #2514
Fix daily tests by removing gpu marker by @j316chuck in #2515
Refactor mosaic_fsdp.py by @b-chu in #2506
Disable slack notifications for PRs by @mvpatel2000 in #2517
Add custom sharding to ChunkShardingSpec by @b-chu in #2507
Update nightly docker image to torch nightly 09-03-23 by @j316chuck in #2518
Update pre-commit in setup.py by @b-chu in #2522
Add FSDP custom wrap with torch 2.1 by @mvpatel2000 in #2460
Fix GCSObjectStore bug where hmac keys auth doesn't work by @eracah in #2519
Bump gitpython from 3.1.34 to 3.1.35 by @dependabot in #2525
Bump pytest from 7.4.0 to 7.4.2 by @dependabot in #2523
Upgrade to MLFlow version 2.5.0 by @ngcgarcia in #2528
Disable cifar daily test by @mvpatel2000 in #2527
Mosaicml logger robustness improvements by @mvpatel2000 in #2530
Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation by @m1kol in #2531
Fix github actions for GCS integration testing by @mvpatel2000 in #2532
Fix GCS tests by @mvpatel2000 in #2535
Change cast for mosaicml logger by @mvpatel2000 in #2538
Bump Version to 0.16.2 by @mvpatel2000 in #2537
Bump transformers version by @dakinggg in #2539

New Contributors

@ngcgarcia made their first contribution in #2528
@m1kol made their first contribution in #2531

Full Changelog: v0.16.1...v0.16.2

Contributors

eracah, j316chuck, and 7 other contributors

Assets 2

05 Sep 21:44

mvpatel2000

v0.16.1

336bf8d

v0.16.1

New Features

1. HPU (Habana Gaudi) Support (#2444)

Composer now supports Habana Gaudi chips! To enable HPUs, device needs to be specified as 'hpu':

composer_model = MyComposerModel(n_layers=3)

trainer = Trainer(
    model=composer_model,
    device='hpu',
    ...
)

2. Generate Callback (#2449)

We've added a new callback which runs generate on a language model at a given frequency to visualize outputs:

from composer.callbacks import Generate

composer_model = MyComposerModel(n_layers=3)
generate_callback = Generate(prompts=['How good is my model?'], interval='5ba')

trainer = Trainer(
    model=composer_model,
    callbacks = generate_callback,
    ...
)

Bug Fixes

1. Checkpoint Fixes

Elastic sharded checkpointing now disables torchmetric saving to avoid issues with torchmetrics tensors being sharded. Additionally, checkpointing now falls back on the old path which does not convert torchmetrics tensors to numpy. Checkpointing also no longer materializes optimizer state when saving weights only.

2. MLFlow Performance Improvements

MLFlow integration has significant performance improvements in logging frequency and system metrics collected.

What's Changed

Hpu support by @vivekgoe in #2444
Change input_ids to a kwarg in HuggingFaceModel.generate by @dakinggg in #2459
Add log_table by @irenedea in #2437
Enable composer to work with torch nightly builds, torch 2.1.0, and cuda 12.1. by @j316chuck in #2463
Materialize only model state_dict in memory for save_weights_only by @eracah in #2450
Improve performance of MLflow logging by @dbczumar in #2442
Fail fast if scheduler warmup and max duration are incompatible by @dakinggg in #2458
Add nightly docker image by @j316chuck in #2452
Fix local eval by @rishab-partha in #2465
Add torch 2.1.0 args for github release-docker workflow by @j316chuck in #2470
Log system metrics on each event by @prithvikannan in #2412
Fix torch 2.1.0 docker tag by @j316chuck in #2472
Upstream Generate Callback by @irenedea in #2449
Bump torch nightly docker image by @j316chuck in #2476
Test pytorch 2.1.0 docker images on ci/cd by @j316chuck in #2469
Fix huggingface tokenizer loading for slow tokenizers by @dakinggg in #2483
Deprecate Fused LayerNorm by @nik-mosaic in #2475
Transformers upgrade by @dakinggg in #2489
Update RTD build config with build.os by @bandish-shah in #2490
Upgrade torch docker version and tests by @j316chuck in #2488
upgrade node by @j316chuck in #2492
Gating tying modules w/ FSDP for torch 2.0 by @bcui19 in #2467
Removing min_params by @bcui19 in #2494
Fix torchmetrics backwards compatibility issue by @eracah in #2468
Adding some fixes to FSDP tests by @bcui19 in #2495
Fail count on mosaicml logger by @mvpatel2000 in #2496
Remove PR curve metrics from backward compatibility test and skip torch 1.13 by @eracah in #2497
filter warning by @mvpatel2000 in #2500
Bump version to 0.16.1 by @mvpatel2000 in #2498
Skip metrics in state dict by @mvpatel2000 in #2501
Add peak memory stats by @mvpatel2000 in #2504
Fix sharded ckpt by @mvpatel2000 in #2505
Bump gitpython from 3.1.31 to 3.1.34 by @dependabot in #2509
Annotate torch_prof_remote_file_name as Optional by @srstevenson in #2512

New Contributors

@vivekgoe made their first contribution in #2444
@irenedea made their first contribution in #2437
@j316chuck made their first contribution in #2463
@dbczumar made their first contribution in #2442

Full Changelog: v0.16.0...v0.16.1

Contributors

eracah, srstevenson, and 12 other contributors

Assets 2

21 Aug 19:49

mvpatel2000

v0.16.0

9f59487

v0.16.0

What's New

1. New Events (#2264)

Composer now has the events EVAL_BEFORE_ALL and EVAL_AFTER_ALL, which lets users control logging of certain bespoke evaluation information across all evalutors.

2. Elastic Sharded Checkpointing

Traditionally, checkpoints are stored as giant monoliths. For large model training, moving the entire model to 1 node may be infeasible and writing one large file from 1 node may be slow. Composer now supports elastic sharded checkpoints with FSDP, where every rank writes a single shard of the checkpoint. This checkpointing strategy is elastic, which means even if you resume on a different number of GPUs, Composer will handle resumption. To enable sharded checkpointing, it must be specified in the FSDP Config as 'state_dict_type': 'sharded':

composer_model = MyComposerModel(n_layers=3)

fsdp_config = {
    'sharding_strategy': 'FULL_SHARD',
    'state_dict_type': 'sharded',
    'sharded_ckpt_prefix_dir': 'ba{batch}-shards' # will save each set of shards checkpoint to a unique folder based on batch
}

trainer = Trainer(
    model=composer_model,
    max_duration='4ba'
    fsdp_config=fsdp_config,
    save_folder='checkpoints',
    save_interval='2ba',
    ...
)

See the docs for more information in how to integrate this with your project.

Bug Fixes

Fixes runtime estimator when using multiple evaluators in #2331
Fix autoresume docs link in #2332
Use Enum value when logging hyper-parameters in #2386
Fix GCSObjectStore to match function signatures of other object stores in #2445
Cast to float32 before numpy() to avoid bf16 errors in #2441

What's Changed

Update numpy requirement from <1.25.0,>=1.21.5 to >=1.21.5,<1.26.0 by @dependabot in #2316
Bump ipykernel from 6.23.1 to 6.23.2 by @dependabot in #2317
Bump sphinxcontrib-katex from 0.9.5 to 0.9.6 by @dependabot in #2319
Pin Apex by @mvpatel2000 in #2322
CodeQL on PRs by @mvpatel2000 in #2323
Add secrets check as part of pre-commit by @karan6181 in #2324
Update local rank 0 to be elastic by @mvpatel2000 in #2321
Bump pytest from 7.3.1 to 7.4.0 by @dependabot in #2330
Bump ipykernel from 6.23.2 to 6.23.3 by @dependabot in #2329
Auto add mosaicml logger by @mvpatel2000 in #2325
Add precision config arg for FP8 by @julian-q in #2335
Fixes daily test failures with respect to autoadd mosaicml logger by @mvpatel2000 in #2339
In-line group to avoid OOM by @mvpatel2000 in #2320
Set offload_to_cpu True for state_dict_type=sharded by @eracah in #2338
Update version to 15.1 by @mvpatel2000 in #2341
Fix mapi mocking by @mvpatel2000 in #2342
Change gpu timeout by @rishab-partha in #2343
Fix test_fsdp_load_old_checkpoint test to fix daily tests by @eracah in #2347
Add spaces between sentences in eval label warning by @srstevenson in #2327
Avoid overwriting seed==0 by @tbenthompson in #2352
Small Documentation Typo Fixes by @sarthak-314 in #2349
Fix wandb errror with autoresume issue by @eracah in #2353
Bump ipykernel from 6.23.3 to 6.24.0 by @dependabot in #2360
raise min mcli by @mvpatel2000 in #2362
Add node rank to signal files by @mvpatel2000 in #2363
Move pydantic pin to deepspeed by @mvpatel2000 in #2366
Batch log metrics calls in speed_monitor.py by @prithvikannan in #2367
Read Composer run name env var by @mvpatel2000 in #2372
Fix typing for args in streaming by @dakinggg in #2373
Add distributed sync during wait_for_workers to avoid timeout for large checkpoints by @dakinggg in #2368
Update torchmetrics requirement from <0.12,>=0.10.0 to >=0.10.0,<1.1 by @dependabot in #2358
Add code eval dataset and metric by @rishab-partha in #2301
Isolate env var in unit tests by @mvpatel2000 in #2379
Add extra steps for space free up by @XiaohanZhangCMU in #2382
regex changed in time.py by @megha95 in #2378
Support no param models by making optimizer optional by @mvpatel2000 in #2374
pin identify version to resolve codequality failures by @XiaohanZhangCMU in #2391
Add ls to object stores by @dakinggg in #2376
Change transformers by @rishab-partha in #2383
Respect MLFLow experiment environment variable by @aspfohl in #2377
Change code eval apikey by @rishab-partha in #2394
Moves pytest-cpu slack notifications to issues from helpdesk by @mvpatel2000 in #2398
Add code eval docs by @rishab-partha in #2397
fixed pre-commit issues with modifications to pretty-format-json args. by @snarayan21 in #2392
Fix LOCAL_WORLD_SIZE in pytest by @rishab-partha in #2407
Add code eval secrets to workflows by @rishab-partha in #2399
Enable Elastic Sharded Checkpointing by @eracah in #2262
Remove compute_on_step from MAP by @priba in #2390
Save metadata and integration when save_weights_only is set by @eracah in #2396
remove unused Trainer docstring arg load_fsdp_monolith_rank0_only by @eracah in #2408
torch2.0.1 custom auto wrap by @vchiley in #2400
Add ruff pre-commit by @Skylion007 in #2414
Switch google cloud backend from libcloud to google cloud storage API by @XiaohanZhangCMU in #2340
Updates GPU test timeout to use mcloud flag by @mvpatel2000 in #2420
Add a EVAL_STANDALONE_START and EVAL_STANDALONE_END events and change RUD to not wait_for_workers every eval by @dakinggg in #2418
Throttle optimizer monitor by @mvpatel2000 in #2419
Adding extra condition to avoid running eval_train_metrics by @furkanbiten in #2411
fp8 on Ada by @dskhudia in #2424
Bump coverage[toml] from 7.2.7 to 7.3.0 by @dependabot in #2432
Bump cryptography from 38.0.4 to 41.0.3 by @dependabot in #2436
Bump ipykernel from 6.24.0 to 6.25.1 by @dependabot in #2434
Multilingual compatibility and batching for Code Evaluation by @rishab-partha in #2410
Update max duration on tests by @mvpatel2000 in #2429
Update timeout by @rishab-partha in #2438
add dist.barrier to rotate_checkpoints by @eracah in #2440
Bump version to 0.16 by @mvpatel2000 in #2439
Fix notebooks by @rishab-partha in #2446
Fix notebooks v2 by @rishab-partha in #2448

New Contributors

@eltociear made their first contribution in #2333
@antoinebrl made their first contribution in #2334
@julian-q made their first contribution in #2335
@srstevenson made their first contribution in #2327
@tbenthompson made their first contribution in #2352
@sarthak-314 made their first contribution in #2349
@prithvikannan made their first contribution in #2367
...

Contributors

Skylion007, tbenthompson, and 20 other contributors

Assets 2

07 Jul 23:33

dakinggg

v0.15.1

f84d309

v0.15.1

Bug Fixes

This is a patch release that mainly fixes a bug related to autoresume, and changes the default to offload_to_cpu for PyTorch version >2 sharded checkpoints.

What's Changed

Fixes daily test failures with respect to autoadd mosaicml logger by @mvpatel2000 in #2339
Set offload_to_cpu True for state_dict_type=sharded by @eracah in #2338
Update version by @mvpatel2000 in #2341
Fix MAPI mocking by @mvpatel2000 in #2342
Change GPU timeout by @rishab-partha in #2343
Add cpu call by @eracah in #2347
Add spaces between sentences in eval label warning by @srstevenson in #2327
Avoid overwriting seed=0 by @tbenthompson in #2352
Small documentation typo fixes by @sarthak-314 in #2349
Fix wandb errror with autoresume issue by @eracah in #2353

Full Changelog: v0.15.0...v0.15.1

Contributors

tbenthompson, eracah, and 4 other contributors

Assets 2

Releases: mosaicml/composer

v0.18.0

New Features

1. Improved DTensor Support

2. Checkpoint Saving and Loading from Databricks MLFlow

Bug Fixes

Deprecations

What's Changed

New Contributors

Contributors

v0.17.2

New Features

Bug Fixes

What's Changed

New Contributors

Contributors

v0.17.1

Bug Fixes

What's Changed

Contributors

v0.17.0

What's New

Bug Fixes

What's Changed

New Contributors

Contributors

v0.16.4

What's New

What's Changed

New Contributors

Contributors

v0.16.3

What's New

What's Changed

New Contributors

Contributors

v0.16.2

What's New

Bug Fixes

What's Changed

New Contributors

Contributors

v0.16.1

New Features

Bug Fixes

What's Changed

New Contributors

Contributors

v0.16.0

What's New

Bug Fixes

What's Changed

New Contributors

Contributors

v0.15.1

Bug Fixes

What's Changed

Contributors