Releases: mosaicml/composer
v0.18.0
This release has been yanked, please skip directly to Composer v0.18.1
New Features
1. Improved DTensor Support
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
2. Checkpoint Saving and Loading from Databricks MLFlow
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
Bug Fixes
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Fix import for daily test by @snarayan21 in #2826
- [S] Fix how single value tensors are logged by @aspfohl in #2831
Deprecations
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
What's Changed
- Bump transformers version by @dakinggg in #2781
- Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
- Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
- Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
- [UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
- Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
- Update parse_uri by @irenedea in #2787
- default to no torch profiler memory timeline by @cli99 in #2790
- Add eot token to ICL generate kwargs by @bmosaicml in #2782
- Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
- Add torch nightly 12-13 by @j316chuck in #2792
- Add process group as arg to FSDP by @mvpatel2000 in #2794
- Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
- Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
- Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
- Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
- Require sync module states with HSDP by @mvpatel2000 in #2812
- Better communication computation overlap by @snarayan21 in #2811
- Improve error message for speed monitor by @mvpatel2000 in #2801
- Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
- Bump torchvision for nightly by @mvpatel2000 in #2815
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
- All unshard streams wait on computation every step by @snarayan21 in #2823
- Add encoding=utf-8 by @dakinggg in #2824
- Fix import for daily test by @snarayan21 in #2826
- [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
- checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
- code-quality timeout update by @aspfohl in #2830
- [S] Fix how single value tensors are logged by @aspfohl in #2831
- Adds DTensor Support by @mvpatel2000 in #2821
- Remove duplicate checkpoint verifications by @eracah in #2828
- Fix seed for FSDP wrap by @mvpatel2000 in #2833
- Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
- Allow hsdp by @mvpatel2000 in #2838
- Bump torch 2.1.2 by @mvpatel2000 in #2840
- Upgrade pyright to 1.1.310 by @b-chu in #2841
- [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
- update nightly to torch 2.3 by @j316chuck in #2842
- Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
- Fix torch bump by @j316chuck in #2855
- Torch 2.3 patch by @dakinggg in #2849
- Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
- Rewrite to use individual state functions by @mvpatel2000 in #2860
- Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
- Add save_ignore_keys by @mvpatel2000 in #2868
- Remome log debug by @mvpatel2000 in #2871
- Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
- Remove toml by @b-chu in #2872
- Update license by @b-chu in #2875
- Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
- Convert print to log.info by @mvpatel2000 in #2876
New Contributors
- @jerrychen109 made their first contribution in #2802
Full Changelog: v0.17.2...v0.18.0
v0.17.2
New Features
1. Torch 2.1.1 Support
Composer now supports torch 2.1.1! This new release primarily fixes several small bugs that we had previously monkeypatched in Composer.
2. Faster OCI Upload/Download
Composer now supports multi-part upload/download to OCI, which should speedup object store times.
3. Memory Profiling
We've expanded the torch profiler integration to support memory profiling. Now, when the profile is enabled, you will get a trace showing how memory utilization is broken down by various components on your GPUs.
Bug Fixes
1. FSDP Initialization with Meta
Previously, our FSDP integration had a bug with initializing weights when using device=meta
, which resulted in an additional scaling. This has now been fixed, so device
and distributed strategies should not affect parallelization strategy.
What's Changed
- Override NVIDIA environment variable for CUDA 12.1 images by @bandish-shah in #2742
- Add NVIDIA_REQUIRE_CUDA_OVERRIDE env variable to Composer and Torch nightly Docker images by @bandish-shah in #2744
- Remove duplicated for loop in lr_monitor.py by @priba in #2738
- Fix console logger for small datasets. by @mvpatel2000 in #2746
- Add metadata logging for wandb by @jjanezhang in #2747
- Ignore load ignore keys by @mvpatel2000 in #2748
- Bump torch to 2.1.1 version by @j316chuck in #2717
- Add more info when run doesnt complete by @aspfohl in #2751
- Lower sequence generation length on code gen to be dependent on max canonical solution length by @bmosaicml in #2682
- Remove flatten params by @mvpatel2000 in #2761
- Fix GPU tests by @mvpatel2000 in #2767
- Fix GPU v2 by @mvpatel2000 in #2768
- Use time.tokens for speedmonitor instead of dataset length by @mvpatel2000 in #2762
- Remove BreakEpochException by @mvpatel2000 in #2759
- time to clean up time parsing 😉 by @aspfohl in #2770
- Upgrade RunConfig compute specification by @aspfohl in #2772
- Use async logging in MLflowLogger by @chenmoneygithub in #2693
- Fix FSDP _param_init_fn to not reinit parameters multiple times by @dakinggg in #2765
- Gate FSDP param init test on torch 2.1 by @dakinggg in #2774
- Parallelize OCI multipart download by @coryMosaicML in #2750
- [UCVolumes] Add support for list API by @panchalhp-db in #2769
- Add the memory timeline profiling support through the PyTorch profiler. by @cli99 in #2771
- Improve torch memory profiling arguments processing by @cli99 in #2777
- Bump aws of nccl version and enable aws platform support by @willgleich in #2776
- Extend checkpoint loading to accept a validation function by @irenedea in #2726
- Fix checkpoint validation tests for torch 1.13 by @irenedea in #2779
- Bump version to 0.17.2 by @mvpatel2000 in #2780
New Contributors
- @chenmoneygithub made their first contribution in #2693
Full Changelog: v0.17.1...v0.17.2
v0.17.1
Bug Fixes
1. MosaicML Logger Robustness (#2728)
We've improved the MosaicML logger to be more robust to faulty serialization.
What's Changed
- Add train finished run event by @jjanezhang in #2714
- Override nvidia env var for 11.8 by @dakinggg in #2722
- Update file exists checkpointing error messages to be more helpful by @irenedea in #2668
- [S] Add tag support to MLFlowLogger by @aspfohl in #2716
- Use
raise ... from e
to preserve stack trace by @irenedea in #2725 - add 0.17 to bcompat tests by @eracah in #2723
- Add support for canned ACL environment variable by @nik-mosaic in #2729
- Check serialization for JSON in mosaicml logger by @mvpatel2000 in #2728
- Fix profiler issue by @j316chuck in #2735
- Fix activation cpu offloading by @cli99 in #2724
- Bump version 0.17.1 by @mvpatel2000 in #2741
Full Changelog: v0.17.0...v0.17.1
v0.17.0
What's New
1. Hybrid Sharded Data Parallel (HSDP) Integration (#2648)
Composer now supports Hybrid Sharded Data Parallel (HSDP), where a model is both sharded and replicated across blocks of controllable size. By default, this will shard a model within a node and replicate across nodes, but Composer will accept a tuple of process groups to specify custom shard/replicate sizes. This can be specified in the FSDP config.
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'HYBRID_SHARD',
}
trainer = Trainer(
model=composer_model,
max_duration='4ba',
fsdp_config=fsdp_config,
...
)
HYBRID_SHARD
will FULL_SHARD
a model whereas _HYBRID_SHARD_ZERO2
will SHARD_GRAD_OP
within the shard block.
2. Train Loss NaN Monitor (#2704)
Composer has a new callback which will raise a value error if your loss NaNs out. This is very useful to avoid wasting compute if your training run diverges or fails for numerical reasons.
from composer.callbacks import NaNMonitor
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
max_duration='4ba',
callbacks=NaNMonitor(),
...
)
Bug Fixes
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
What's Changed
- Add partial state dict functionality for FSDP by @b-chu in #2637
- Update monai requirement from <1.3,>=0.9.1 to >=0.9.1,<1.4 by @dependabot in #2643
- Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #2645
- Remove checkpoint on close by @mvpatel2000 in #2646
- Update latest to 2.1 by @mvpatel2000 in #2650
- HSDP Support by @mvpatel2000 in #2648
- Log profile averages by @j316chuck in #2647
- Daily API key by @mvpatel2000 in #2655
- Add automatic remote uploader downloader for composer profiler by @j316chuck in #2653
- Update the AWS_OFI_NCCL version and add in the MPI HWLOC install by @willgleich in #2651
- Fix GCP tests by @mvpatel2000 in #2658
- Allow no eval_loader when eval is disabled by @b-chu in #2657
- Gate HSDP by torch 2.1.0 by @mvpatel2000 in #2656
- Fix FSDP arg default to match torch by @mvpatel2000 in #2660
- Bump pypandoc from 1.11 to 1.12 by @dependabot in #2664
- Bump vit-pytorch from 0.35.8 to 1.6.1 by @dependabot in #2662
- Upgrade to transformers 4.34.1 by @dakinggg in #2635
- Update docker readme by @mvpatel2000 in #2669
- Add script to validate remote object store paths by @irenedea in #2667
- Torch 2.1 Resumption Support by @mvpatel2000 in #2665
- Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in #2663
- Fix dist by @mvpatel2000 in #2670
- Add torch nightly for torch 2.2.0 10-24 by @j316chuck in #2671
- Adding Model Data Init and Training Progress to MosaicMLLogger by @jjanezhang in #2633
- Bump pytest from 7.4.2 to 7.4.3 by @dependabot in #2678
- Bump sphinxext-opengraph from 0.8.2 to 0.9.0 by @dependabot in #2677
- Bump traitlets from 5.10.0 to 5.12.0 by @dependabot in #2674
- Bump cryptography from 41.0.4 to 41.0.5 by @dependabot in #2675
- Secure Code Eval changes by @mvpatel2000 in #2679
- Lazy validation of code eval metric by @mvpatel2000 in #2681
- Upgrade transformers to 4.35 by @dakinggg in #2684
- Bump traitlets from 5.12.0 to 5.13.0 by @dependabot in #2687
- Bump ipykernel from 6.25.2 to 6.26.0 by @dependabot in #2686
- Add Kwargs to upload_object by @nik-mosaic in #2692
- Add version number to composer metadata logs by @j316chuck in #2565
- Add distributed barrier test fixture to ensure pytest cleans up resources properly by @j316chuck in #2694
- Properly handle empty metric_names passed to Trainer._filter_metrics by @irenedea in #2700
- Train loss NaN checking callback by @coryMosaicML in #2704
- Adding logging and force flushing for run events by @jjanezhang in #2703
- [daily-test fix] Add rank 0 gating to test_elastic_resumption state dict comparison by @eracah in #2705
- Fix MPS with dict loss by @mvpatel2000 in #2706
- Update types to follow PEP 585 by @b-chu in #2697
- Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #2708
- Update wandb requirement from <0.16,>=0.13.2 to >=0.13.2,<0.17 by @dependabot in #2709
- Squelch Memory Monitor warnings if device=meta by @hanlint in #2529
- Fix NaN monitor for loss dicts. by @coryMosaicML in #2712
- Switch mosaicml logger to use futures to enable better error handling by @j316chuck in #2702
- Fetching arguments for FSDP by @mvpatel2000 in #2710
- Bump version to 0.17 by @mvpatel2000 in #2711
New Contributors
- @willgleich made their first contribution in #2651
- @jjanezhang made their first contribution in #2633
Full Changelog: v0.16.4...v0.17.0
v0.16.4
What's New
1. Torch 2.1 Support
Composer officially supports PyTorch 2.1! We support several new features from 2.1, including CustomPolicy which supports granular wrapping with FSDP.
What's Changed
- Add 0.16 checkpoint to backwards compatibility tests by @eracah in #2567
- Updating FSDP monkeypatch by @mvpatel2000 in #2571
- Add Databricks UC Volume Object Store by @panchalhp-db in #2548
- Fix pytest disk space OOM issue by adding tmp_path_retention_policy=None by @j316chuck in #2583
- Change daily nightly test version by @j316chuck in #2596
- Add save and register wrappers to mlflow logger by @dakinggg in #2579
- Missing () fo or in auto microbatching gate by @mvpatel2000 in #2574
- Simplify FSDP Gradient Clipping by @mvpatel2000 in #2586
- Use FSDP CustomPolicy to support custom kwargs passed to different wrapped modules by @cli99 in #2585
- Free outputs callback by @mvpatel2000 in #2598
- Merge branch 'dev' into spr/dev/458c4e36 by @b-chu in #2595
- Fix a bug when batch type is dict and one of the values is the list by @mvpatel2000 in #2599
- Readme update by @ejyuen in #2581
- Add chain of thought eval by @bmosaicml in #2466
- Add torch 2.1.0 by @mvpatel2000 in #2602
- Change pr cpu and pr gpu test docker images by @j316chuck in #2611
- Change the tokenizer json file to read binary by @dakinggg in #2608
- [Docs] MLflow casing by @aspfohl in #2609
- Call generate callback at end of training by @aspfohl in #2607
- Refactor save interval and eval interval to share code by @dakinggg in #2600
- Deprecate many datasets and models by @mvpatel2000 in #2605
- Clean up gpu tests by @mvpatel2000 in #2612
- Remove apex test by @j316chuck in #2616
- Patch default precision by @mvpatel2000 in #2628
- Add logging for generate callbacks by @aspfohl in #2630
- Expose input_names and output_names when exporting to ONNX by @antoinebrl in #2601
- Bump version to 0.16.4 by @mvpatel2000 in #2627
New Contributors
- @panchalhp-db made their first contribution in #2548
- @cli99 made their first contribution in #2585
Full Changelog: v0.16.3...v0.16.4
v0.16.3
What's New
1. Add pass@k for HumanEval
HumanEval now supports pass@k. We also support first-class integration with the MosaicML platform for secure code evaluation.
2. log_model
with MLFlow
The MLFlow integration now supports log_model
at the end of the run.
What's Changed
- Update checkpoint.py by @b-chu in #2540
- Add log image to mlflow by @eracah in #2416
- Log runtime estimator units by @mvpatel2000 in #2542
- Bump traitlets from 5.9.0 to 5.10.0 by @dependabot in #2547
- Bump gitpython from 3.1.35 to 3.1.36 by @dependabot in #2546
- Bump ipykernel from 6.25.1 to 6.25.2 by @dependabot in #2544
- Add providers param to ONNX Session in tests by @nik-mosaic in #2553
- Bump flash attn by @mvpatel2000 in #2551
- Remove pin by @mvpatel2000 in #2554
- Change filter to include pull_request_target by @mvpatel2000 in #2557
- Downgrade nightly to previous version by @mvpatel2000 in #2556
- MCLI Code Eval by @rishab-partha in #2479
- Bump cryptography from 41.0.3 to 41.0.4 by @dependabot in #2559
- Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #2560
- Update numpy requirement from <1.26.0,>=1.21.5 to >=1.21.5,<1.27.0 by @dependabot in #2561
- Update support for HumanEval by @mcarbin in #2550
- Add log_model to MLFlowLogger by @dakinggg in #2541
- Bump version to 0.16.3 by @mvpatel2000 in #2566
New Contributors
Full Changelog: v0.16.2...v0.16.3
v0.16.2
What's New
1. PyTorch Nightly Support
Composer now supports PyTorch Nightly and Cuda 12! Along with new docker images based on nightly PyTorch versions and release candidates, we've updated our PyTorch monkeypatches to support the latest version of PyTorch. These monkeypatches add additional functionality in finer-grain FSDP wrapping and patch bugs related to sharded checkpoints. We are in the process of upstreaming these changes into PyTorch.
Bug Fixes
1. MosaicML Logger Robustness
MosaicML logger now is robust to platform timeouts and other errors. Additionally, it can now be disabled by setting the environment variable MOSAICML_PLATFORM
to 'False'
when training on the MosaicML platform.
2. GCS Integration
GCS authentication is now supported with HMAC keys, patching a bug in the previous implementation.
3. Optimizer Monitor Norm Calculation (#2531)
Previously, the optimizer monitor incorrectly reduced norms across GPUs. It now correctly computes norms in a distributed setting.
What's Changed
- fix: when there is no train_metrics, do not checkpoint by @furkanbiten in #2502
- Remove metric saving by @mvpatel2000 in #2514
- Fix daily tests by removing gpu marker by @j316chuck in #2515
- Refactor mosaic_fsdp.py by @b-chu in #2506
- Disable slack notifications for PRs by @mvpatel2000 in #2517
- Add custom sharding to ChunkShardingSpec by @b-chu in #2507
- Update nightly docker image to torch nightly 09-03-23 by @j316chuck in #2518
- Update pre-commit in setup.py by @b-chu in #2522
- Add FSDP custom wrap with torch 2.1 by @mvpatel2000 in #2460
- Fix GCSObjectStore bug where hmac keys auth doesn't work by @eracah in #2519
- Bump gitpython from 3.1.34 to 3.1.35 by @dependabot in #2525
- Bump pytest from 7.4.0 to 7.4.2 by @dependabot in #2523
- Upgrade to MLFlow version 2.5.0 by @ngcgarcia in #2528
- Disable cifar daily test by @mvpatel2000 in #2527
- Mosaicml logger robustness improvements by @mvpatel2000 in #2530
- Fix metrics keys sort in DecoupledAdamW for OptimizerMonitor FSDP metric agreggation by @m1kol in #2531
- Fix github actions for GCS integration testing by @mvpatel2000 in #2532
- Fix GCS tests by @mvpatel2000 in #2535
- Change cast for mosaicml logger by @mvpatel2000 in #2538
- Bump Version to 0.16.2 by @mvpatel2000 in #2537
- Bump transformers version by @dakinggg in #2539
New Contributors
- @ngcgarcia made their first contribution in #2528
- @m1kol made their first contribution in #2531
Full Changelog: v0.16.1...v0.16.2
v0.16.1
New Features
1. HPU (Habana Gaudi) Support (#2444)
Composer now supports Habana Gaudi chips! To enable HPUs, device
needs to be specified as 'hpu'
:
composer_model = MyComposerModel(n_layers=3)
trainer = Trainer(
model=composer_model,
device='hpu',
...
)
2. Generate Callback (#2449)
We've added a new callback which runs generate on a language model at a given frequency to visualize outputs:
from composer.callbacks import Generate
composer_model = MyComposerModel(n_layers=3)
generate_callback = Generate(prompts=['How good is my model?'], interval='5ba')
trainer = Trainer(
model=composer_model,
callbacks = generate_callback,
...
)
Bug Fixes
1. Checkpoint Fixes
Elastic sharded checkpointing now disables torchmetric saving to avoid issues with torchmetrics tensors being sharded. Additionally, checkpointing now falls back on the old path which does not convert torchmetrics tensors to numpy
. Checkpointing also no longer materializes optimizer state when saving weights only.
2. MLFlow Performance Improvements
MLFlow integration has significant performance improvements in logging frequency and system metrics collected.
What's Changed
- Hpu support by @vivekgoe in #2444
- Change
input_ids
to a kwarg inHuggingFaceModel.generate
by @dakinggg in #2459 - Add log_table by @irenedea in #2437
- Enable composer to work with torch nightly builds, torch 2.1.0, and cuda 12.1. by @j316chuck in #2463
- Materialize only model state_dict in memory for
save_weights_only
by @eracah in #2450 - Improve performance of MLflow logging by @dbczumar in #2442
- Fail fast if scheduler warmup and max duration are incompatible by @dakinggg in #2458
- Add nightly docker image by @j316chuck in #2452
- Fix local eval by @rishab-partha in #2465
- Add torch 2.1.0 args for github release-docker workflow by @j316chuck in #2470
- Log system metrics on each event by @prithvikannan in #2412
- Fix torch 2.1.0 docker tag by @j316chuck in #2472
- Upstream Generate Callback by @irenedea in #2449
- Bump torch nightly docker image by @j316chuck in #2476
- Test pytorch 2.1.0 docker images on ci/cd by @j316chuck in #2469
- Fix huggingface tokenizer loading for slow tokenizers by @dakinggg in #2483
- Deprecate Fused LayerNorm by @nik-mosaic in #2475
- Transformers upgrade by @dakinggg in #2489
- Update RTD build config with build.os by @bandish-shah in #2490
- Upgrade torch docker version and tests by @j316chuck in #2488
- upgrade node by @j316chuck in #2492
- Gating tying modules w/ FSDP for torch 2.0 by @bcui19 in #2467
- Removing min_params by @bcui19 in #2494
- Fix torchmetrics backwards compatibility issue by @eracah in #2468
- Adding some fixes to FSDP tests by @bcui19 in #2495
- Fail count on mosaicml logger by @mvpatel2000 in #2496
- Remove PR curve metrics from backward compatibility test and skip torch 1.13 by @eracah in #2497
- filter warning by @mvpatel2000 in #2500
- Bump version to 0.16.1 by @mvpatel2000 in #2498
- Skip metrics in state dict by @mvpatel2000 in #2501
- Add peak memory stats by @mvpatel2000 in #2504
- Fix sharded ckpt by @mvpatel2000 in #2505
- Bump gitpython from 3.1.31 to 3.1.34 by @dependabot in #2509
- Annotate
torch_prof_remote_file_name
as Optional by @srstevenson in #2512
New Contributors
- @vivekgoe made their first contribution in #2444
- @irenedea made their first contribution in #2437
- @j316chuck made their first contribution in #2463
- @dbczumar made their first contribution in #2442
Full Changelog: v0.16.0...v0.16.1
v0.16.0
What's New
1. New Events (#2264)
Composer now has the events EVAL_BEFORE_ALL
and EVAL_AFTER_ALL
, which lets users control logging of certain bespoke evaluation information across all evalutors.
2. Elastic Sharded Checkpointing
Traditionally, checkpoints are stored as giant monoliths. For large model training, moving the entire model to 1 node may be infeasible and writing one large file from 1 node may be slow. Composer now supports elastic sharded checkpoints with FSDP, where every rank writes a single shard of the checkpoint. This checkpointing strategy is elastic, which means even if you resume on a different number of GPUs, Composer will handle resumption. To enable sharded checkpointing, it must be specified in the FSDP Config as 'state_dict_type': 'sharded'
:
composer_model = MyComposerModel(n_layers=3)
fsdp_config = {
'sharding_strategy': 'FULL_SHARD',
'state_dict_type': 'sharded',
'sharded_ckpt_prefix_dir': 'ba{batch}-shards' # will save each set of shards checkpoint to a unique folder based on batch
}
trainer = Trainer(
model=composer_model,
max_duration='4ba'
fsdp_config=fsdp_config,
save_folder='checkpoints',
save_interval='2ba',
...
)
See the docs for more information in how to integrate this with your project.
Bug Fixes
- Fixes runtime estimator when using multiple evaluators in #2331
- Fix autoresume docs link in #2332
- Use Enum value when logging hyper-parameters in #2386
- Fix GCSObjectStore to match function signatures of other object stores in #2445
- Cast to float32 before numpy() to avoid bf16 errors in #2441
What's Changed
- Update numpy requirement from <1.25.0,>=1.21.5 to >=1.21.5,<1.26.0 by @dependabot in #2316
- Bump ipykernel from 6.23.1 to 6.23.2 by @dependabot in #2317
- Bump sphinxcontrib-katex from 0.9.5 to 0.9.6 by @dependabot in #2319
- Pin Apex by @mvpatel2000 in #2322
- CodeQL on PRs by @mvpatel2000 in #2323
- Add secrets check as part of pre-commit by @karan6181 in #2324
- Update local rank 0 to be elastic by @mvpatel2000 in #2321
- Bump pytest from 7.3.1 to 7.4.0 by @dependabot in #2330
- Bump ipykernel from 6.23.2 to 6.23.3 by @dependabot in #2329
- Auto add mosaicml logger by @mvpatel2000 in #2325
- Add precision config arg for FP8 by @julian-q in #2335
- Fixes daily test failures with respect to autoadd mosaicml logger by @mvpatel2000 in #2339
- In-line group to avoid OOM by @mvpatel2000 in #2320
- Set offload_to_cpu True for state_dict_type=sharded by @eracah in #2338
- Update version to 15.1 by @mvpatel2000 in #2341
- Fix mapi mocking by @mvpatel2000 in #2342
- Change gpu timeout by @rishab-partha in #2343
- Fix test_fsdp_load_old_checkpoint test to fix daily tests by @eracah in #2347
- Add spaces between sentences in eval label warning by @srstevenson in #2327
- Avoid overwriting seed==0 by @tbenthompson in #2352
- Small Documentation Typo Fixes by @sarthak-314 in #2349
- Fix wandb errror with autoresume issue by @eracah in #2353
- Bump ipykernel from 6.23.3 to 6.24.0 by @dependabot in #2360
- raise min mcli by @mvpatel2000 in #2362
- Add node rank to signal files by @mvpatel2000 in #2363
- Move pydantic pin to deepspeed by @mvpatel2000 in #2366
- Batch log metrics calls in speed_monitor.py by @prithvikannan in #2367
- Read Composer run name env var by @mvpatel2000 in #2372
- Fix typing for args in streaming by @dakinggg in #2373
- Add distributed sync during wait_for_workers to avoid timeout for large checkpoints by @dakinggg in #2368
- Update torchmetrics requirement from <0.12,>=0.10.0 to >=0.10.0,<1.1 by @dependabot in #2358
- Add code eval dataset and metric by @rishab-partha in #2301
- Isolate env var in unit tests by @mvpatel2000 in #2379
- Add extra steps for space free up by @XiaohanZhangCMU in #2382
- regex changed in time.py by @megha95 in #2378
- Support no param models by making optimizer optional by @mvpatel2000 in #2374
- pin identify version to resolve codequality failures by @XiaohanZhangCMU in #2391
- Add ls to object stores by @dakinggg in #2376
- Change transformers by @rishab-partha in #2383
- Respect MLFLow experiment environment variable by @aspfohl in #2377
- Change code eval apikey by @rishab-partha in #2394
- Moves pytest-cpu slack notifications to issues from helpdesk by @mvpatel2000 in #2398
- Add code eval docs by @rishab-partha in #2397
- fixed pre-commit issues with modifications to pretty-format-json args. by @snarayan21 in #2392
- Fix LOCAL_WORLD_SIZE in pytest by @rishab-partha in #2407
- Add code eval secrets to workflows by @rishab-partha in #2399
- Enable Elastic Sharded Checkpointing by @eracah in #2262
- Remove compute_on_step from MAP by @priba in #2390
- Save metadata and integration when save_weights_only is set by @eracah in #2396
- remove unused Trainer docstring arg load_fsdp_monolith_rank0_only by @eracah in #2408
- torch2.0.1 custom auto wrap by @vchiley in #2400
- Add ruff pre-commit by @Skylion007 in #2414
- Switch google cloud backend from libcloud to google cloud storage API by @XiaohanZhangCMU in #2340
- Updates GPU test timeout to use mcloud flag by @mvpatel2000 in #2420
- Add a
EVAL_STANDALONE_START
andEVAL_STANDALONE_END
events and change RUD to notwait_for_workers
every eval by @dakinggg in #2418 - Throttle optimizer monitor by @mvpatel2000 in #2419
- Adding extra condition to avoid running eval_train_metrics by @furkanbiten in #2411
- fp8 on Ada by @dskhudia in #2424
- Bump coverage[toml] from 7.2.7 to 7.3.0 by @dependabot in #2432
- Bump cryptography from 38.0.4 to 41.0.3 by @dependabot in #2436
- Bump ipykernel from 6.24.0 to 6.25.1 by @dependabot in #2434
- Multilingual compatibility and batching for Code Evaluation by @rishab-partha in #2410
- Update max duration on tests by @mvpatel2000 in #2429
- Update timeout by @rishab-partha in #2438
- add dist.barrier to rotate_checkpoints by @eracah in #2440
- Bump version to 0.16 by @mvpatel2000 in #2439
- Fix notebooks by @rishab-partha in #2446
- Fix notebooks v2 by @rishab-partha in #2448
New Contributors
- @eltociear made their first contribution in #2333
- @antoinebrl made their first contribution in #2334
- @julian-q made their first contribution in #2335
- @srstevenson made their first contribution in #2327
- @tbenthompson made their first contribution in #2352
- @sarthak-314 made their first contribution in #2349
- @prithvikannan made their first contribution in #2367
...
v0.15.1
Bug Fixes
This is a patch release that mainly fixes a bug related to autoresume, and changes the default to offload_to_cpu
for PyTorch version >2 sharded checkpoints.
What's Changed
- Fixes daily test failures with respect to autoadd mosaicml logger by @mvpatel2000 in #2339
- Set offload_to_cpu True for state_dict_type=sharded by @eracah in #2338
- Update version by @mvpatel2000 in #2341
- Fix MAPI mocking by @mvpatel2000 in #2342
- Change GPU timeout by @rishab-partha in #2343
- Add cpu call by @eracah in #2347
- Add spaces between sentences in eval label warning by @srstevenson in #2327
- Avoid overwriting seed=0 by @tbenthompson in #2352
- Small documentation typo fixes by @sarthak-314 in #2349
- Fix wandb errror with autoresume issue by @eracah in #2353
Full Changelog: v0.15.0...v0.15.1