Releases: mosaicml/composer
v0.21.3
Bug Fixes
1. Increased Robustness to Checkpoint Loading
We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.
What's Changed
- Raise daily test timeout by @mvpatel2000 in #3172
- fix remote file naming by @cli99 in #3173
- [fix] DTensor + SHARD_GRAD_OP + use_orig_params by @bigning in #3175
- Bump db sdk by @dakinggg in #3176
- Build latest pytorch nightly images by @dakinggg in #3179
- Add FP8 TransformerEngine activation checkpointing by @cli99 in #3156
- Enabling the computation of validation loss and other metrics when using sequence parallelism by @ShashankMosaicML in #3183
- Update mosaic_fsdp_utils.py by @vchiley in #3185
- Fix the FSDP.optim_state_dict_to_load OOM by @bigning in #3184
- Revert "Update mosaic_fsdp_utils.py" by @vchiley in #3187
- Bump databricks-sdk from 0.24.0 to 0.25.1 by @dependabot in #3190
- Add version tag to local builds by @mvpatel2000 in #3188
- Update
NeptuneLogger
by @AleksanderWWW in #3165 - Filter neptune warning in doctests by @mvpatel2000 in #3195
- Removal of metrics deepcopy before computing the metrics by @gregjauvion in #3180
- Fix MLFlow Tag Name for Resumption by @KuuCi in #3194
- Fix mistral gating by @dakinggg in #3199
- Bump version to 0.21.3 by @mvpatel2000 in #3198
New Contributors
- @gregjauvion made their first contribution in #3180
Full Changelog: v0.21.2...v0.21.3
v0.21.2
Bug Fixes
1. Enable torch 2.2.2 (#3161)
Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.
2. MPS Metric Computation on CPU (#3105)
Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.
Thank you to @hyenal for this contribution!
3. Batch Sampler Support (#3105)
Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.
Thank you to @Ghelfi for this contribution!
What's Changed
- Make codequality callable by @mvpatel2000 in #3133
- Explicitly print checkpoint downloading exception by @bigning in #3131
- Change release actions by @mvpatel2000 in #3136
- Passing rank and num_replicas to dist.get_sampler by @ShashankMosaicML in #3137
- Fix broadcast by @mvpatel2000 in #3138
- Compressor fixes by @mbway in #3142
- In case of MPS device also copy batch to CPU by @hyenal in #3105
- Composer object store download retry by @bigning in #3140
- Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #3144
- Update transformers requirement from !=4.34.0,<4.39,>=4.11 to >=4.11,!=4.34.0,<4.40 by @dependabot in #3148
- Update protobuf requirement from <3.21 to <5.27 by @dependabot in #3147
- Bump traitlets from 5.14.1 to 5.14.2 by @dependabot in #3145
- Bump to 0.21 by @mvpatel2000 in #3150
- Fixing sequence parallel error conditions and adding type float for microbatch_size in typehints by @ShashankMosaicML in #3139
- Fix torch monkeypatch version check by @dakinggg in #3155
- Update torchmetrics requirement from <1.3.2,>=0.10.0 to >=0.10.0,<1.3.3 by @dependabot in #3157
- Bump gitpython from 3.1.42 to 3.1.43 by @dependabot in #3160
- Prevent crash if signal handler cannot be set by @mbway in #3152
- Pin pillow for code quality workflow by @dakinggg in #3162
- Fix torch version check by @dakinggg in #3161
- add more retry to checkpoint downloading by @bigning in #3164
- Append to gpu rank log files instead of throwing error by @jjanezhang in #3166
- Call
set_epoch
onDataloader.batch_sampler
if defined by @Ghelfi in #3124 - Bump version to 0.21.2 by @mvpatel2000 in #3168
New Contributors
Full Changelog: v0.21.1...v0.21.2
v0.21.1
Bug Fixes
1. Fix to HSDP checkpoint loading
The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.
What's Changed
- Fix broadcast by @mvpatel2000 in #3138
Full Changelog: v0.21.0...v0.21.1
v0.21.0
What's New
1. Aggregate Memory Monitoring (#3042)
The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
MemoryMonitor(
dist_aggregate_batch_interval=10, # aggregate every 10 batches
)
],
)
2. Advanced Compression Options (#3118)
Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:
- bz2
- gz
- lz4
- lzma
- lzo
- xz
- zst
Example:
from composer import Trainer
from composer.callbacks import MemoryMonitor
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
optimizers=optimizer,
max_duration="1ep",
save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)
Thank you to @mbway for adding this support!
What's Changed
- Rename composer_run_name tag to run_name when logging to MLflow by @jerrychen109 in #3040
- enable aggregate mem monitoring by @vchiley in #3042
- Bump junitparser from 3.1.1 to 3.1.2 by @dependabot in #3056
- Add SHARD_GRAD_OP to device mesh error check by @mvpatel2000 in #3058
- Add torch 2.2.1 support by @mvpatel2000 in #3059
- Use testing repo actions for linting by @b-chu in #3060
- Link autoresume docs back to watchdog by @aspfohl in #3052
- Deprecate get_state and remove deprecations by @b-chu in #3017
- Bump version to 0.20.1 by @mvpatel2000 in #3061
- Remove s3_bucket pytest cli flag by @b-chu in #3064
- Remove s3_bucket flag from gpu test by @b-chu in #3065
- Clean Up OOM Observer Remote Uploader Download path by @j316chuck in #3070
- Fix daily test for iteration by @b-chu in #3068
- Remove "generation_length" in favor of "generation_kwargs" by @maxisawesome in #3014
- Bump packaging by @mvpatel2000 in #3072
- Use ci-testing repo for CPU and GPU tests by @b-chu in #3062
- Add new torch monkeypatches to Composer by @mvpatel2000 in #3063
- Add initial support for neuron devices by @bfontain in #3049
- Stripping whitespaces as default for QATask ICL eval by @ksreenivasan in #3073
- Add ICL base class to all by @mvpatel2000 in #3079
- pass prelimiter into ALL ICL datasets by @eitanturok in #3069
- Bump sentencepiece from 0.1.99 to 0.2.0 by @dependabot in #3083
- Add Iteration related Events to callbacks by @b-chu in #3077
- Add Iteration related Events by @b-chu in #3076
- Bump CI/CD to v3 by @mvpatel2000 in #3086
- Add docstring to _iteration_length by @b-chu in #3088
- Check FSDP module has _device_mesh before getting it by @eracah in #3091
- Bump minor version in base image by @mvpatel2000 in #3092
- Enforce async logging flush in mlflow logger at
post_close
call by @chenmoneygithub in #3093 - Warning log to info log by @aspfohl in #3096
- Bump transformers by @dakinggg in #3095
- Change style for splitting on commas by @b-chu in #3078
- Remove slash by @b-chu in #3098
- Allowing for fractional number of samples per rank by @ShashankMosaicML in #3075
- Output eval logging (batch level) by @maxisawesome in #2977
- Replace errors with warnings for eval args by @mvpatel2000 in #3100
- Ability to load sharded checkpoints with remote symlink load_path by @eracah in #3097
- Improvements to
NeptuneLogger
by @AleksanderWWW in #3085 - Revert "Improvements to
NeptuneLogger
" by @mvpatel2000 in #3111 - Bump mlflow min pin by @dakinggg in #3110
- Fix rounding issue in interval calculation by @dakinggg in #3109
- Bump coverage[toml] from 7.4.1 to 7.4.3 by @dependabot in #3102
- Uses v0.0.4 of ci-testing by @b-chu in #3112
- Add versioned deprecation warning by @irenedea in #2984
- Update Flash Attention to 2.5.5 by @Skylion007 in #3113
- Setting the max duration to current timestamp in the same units as cu… by @ShashankMosaicML in #3090
- Making default_split_batch public by @ShashankMosaicML in #3116
- Adding log exception to Mosaic Logger by @jjanezhang in #3089
- Add checks to schedulers by @b-chu in #3115
- Removed default attrs from exception class in the attrs dict by @jjanezhang in #3126
- Bump coverage[toml] from 7.4.3 to 7.4.4 by @dependabot in #3121
- Refactor initialization by @Practicinginhell in #3127
- Bump databricks sdk version by @dakinggg in #3128
- Update packaging requirement from <23.3,>=21.3.0 to >=21.3.0,<24.1 by @dependabot in #3122
- Remove rng from save_weights_only ckpt by @eracah in #3129
- More compression options by @mbway in #3118
- Only broadcast distcp files by @mvpatel2000 in #3130
- Bump version to 0.21 by @mvpatel2000 in #3132
New Contributors
- @ksreenivasan made their first contribution in #3073
- @eitanturok made their first contribution in #3069
- @Practicinginhell made their first contribution in #3127
- @mbway made their first contribution in #3118
Full Changelog: v0.20.1...v0.21.0
v0.20.1
What's New
1. Torch 2.2.1 Support
Composer now supports torch 2.2.1! We've raised the pin to allow the latest torch, and we've upstreamed all torch monkeypatches so Composer can run out of the box with the latest and greatest torch features.
What's Changed
- Add torch 2.2.1 support by @mvpatel2000 in #3059
- Bump version to 0.20.1 by @mvpatel2000 in #3061
v0.20.0
What's New
1. New Neptune Logger
Composer now supports logging training data to neptune.ai using the NeptuneLogger
. To get started:
neptune_project = 'test_project'
neptune_api_token = 'test_token'
neptune_logger = NeptuneLogger(
project=neptune_project,
api_token=neptune_api_token,
rank_zero_only=False,
mode='debug',
upload_artifacts=True,
)
We also have an example project demonstrating all the awesome things you can do with this integration!
Additional information on the NeptuneLogger
can be found in the docs.
2. OOM observer callback with memory visualizations
Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.
Example:
from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
optimizers=optimizer,
max_duration="1ep",
callbacks=[
OOMObserver(
folder="traces",
overwrite=true,
filename="rank{rank}_oom",
remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
)
],
)
OOM Visualization:
3. Log all gpu rank stdout/err to MosaicML platform
Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.
Example:
mcli logs <run-name> --node x --gpu x
Note, this defaults to node rank 0 if --node
is not provided.
Also, we can find the logs of any global gpu rank with the command:
mcli logs <run-name> --global-gpu-rank x
Bug Fixes
- Only save RNG on rank 0 by @mvpatel2000 in #2998
- [Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by @bigning in #3030
- Fix skip_first for profiler during resumption by @bigning in #2986
- Race condition fix in checkpoint loading util by @jessechancy in #3001
What's Changed
- Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #2957
- Modify UCObjectStore.list_objects to lists all files recursively by @irenedea in #2959
- Refactor MemorySnapshot by @cli99 in #2960
- Log all gpu rank stdout/err to MosaicML platform by @jjanezhang in #2839
- Add Torch 2.2 tests by @mvpatel2000 in #2970
- Memory snapshot dump pickle by @cli99 in #2968
- Neptune logger by @AleksanderWWW in #2447
- Fix torch pins in tests by @mvpatel2000 in #2973
- Add a register_model_with_run_id api to MLflowLogger by @dakinggg in #2967
- Remove bespoke codeowners by @mvpatel2000 in #2971
- Add a BEFORE_LOAD event by @snarayan21 in #2974
- More torch 2.2 fixes by @mvpatel2000 in #2975
- Adding the step argument to logger.log_table by @ShashankMosaicML in #2961
- Fix daily tests for torch 2.2 by @mvpatel2000 in #2980
- Format load_path with name by @mvpatel2000 in #2978
- Bump to 0.19.1 by @mvpatel2000 in #2979
- Fix UC object store bugfix by @nancyhung in #2982
- [Bugfix][UC] Add back the full object path by @nancyhung in #2988
- Minor cleanup of UC get_object_size by @dakinggg in #2989
- Pin UC to earlier version by @dakinggg in #2990
- Revert "fix skip_first for resumption" by @bigning in #2991
- Broadcast files for HSDP by @mvpatel2000 in #2914
- Bump ipykernel from 6.29.0 to 6.29.2 by @dependabot in #2994
- Bump yamllint from 1.33.0 to 1.34.0 by @dependabot in #2995
- Refactor
update_metric
by @maxisawesome in #2965 - Add azure integration test by @mvpatel2000 in #2996
- Fix Profiler schedule skip_first by @bigning in #2992
- Remove planner validation by @mvpatel2000 in #2985
- Fix load for non-HSDP device mesh by @mvpatel2000 in #2997
- Update NCCL arg since torch deprecated old one by @mvpatel2000 in #3000
- Add bias argument to LPLN by @mvpatel2000 in #2999
- Revert "Add bias argument to LPLN" by @mvpatel2000 in #3003
- Revert "Update NCCL arg since torch deprecated old one" by @mvpatel2000 in #3004
- Add torch 2.3 image for aws cluster by @j316chuck in #3002
- Patch torch 2.3 aws naming by @j316chuck in #3006
- Add debug log before training loop starts by @mvpatel2000 in #3005
- Deprecate ffcv code by @j316chuck in #3007
- Remove log for mosaicml logger by @mvpatel2000 in #3008
- [EASY] Always log 1st batch when resuming training by @bigning in #3009
- Use reusable actions for linting by @b-chu in #2948
- Make CodeEval respect device_eval_batch_size by @josejg in #2969
- Use Mosaic constant for GPU file prefix by @jjanezhang in #3018
- Fall back to normal logging when gpu prefix is not present by @jjanezhang in #3020
- Revert "Use reusable actions for linting" to fix CI/CD by @mvpatel2000 in #3023
- Change to pull_request_target by @b-chu in #3025
- Bump gitpython from 3.1.41 to 3.1.42 by @dependabot in #3031
- Bump yamllint from 1.34.0 to 1.35.1 by @dependabot in #3034
- Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by @dependabot in #3035
- Bump pypandoc from 1.12 to 1.13 by @dependabot in #3033
- Add tensorboard images support by @Menduist in #3021
- Add sorted to logs for checkpoint broadcast by @mvpatel2000 in #3036
- Friendlier device mesh error by @mvpatel2000 in #3039
- Upgrade to python3.11 for torch nightly by @j316chuck in #3038
- Download symlink once by @mvpatel2000 in #3043
- Add min size to OCI download by @mvpatel2000 in #3044
- Lint fix by @mvpatel2000 in #3045
- Revert "Change to pull_request_target " by @mvpatel2000 in #3047
- Bump composer version 0.19.2 by @j316chuck in #3048
- Update XLA support by @bfontain in #2964
- Bump composer version 0.20.0 by @j316chuck in #3051
- Update ruff. Fix PLE & LOG lints by @Skylion007 in #3050
New Contributors
- @AleksanderWWW made their first contribution in #2447
- @ShashankMosaicML made their first contribution in #2961
- @nancyhung made their first contribution in #2982
- @bigning made their first contribution in #2986
- @jessechancy made their first contribution in #3001
- @josejg made their first contribution in #2969
- @Menduist made their first contribution in #3021
- @bfontain made their first contribution in #2964
**Full Chang...
v0.19.1
What's New
1. New Event: BEFORE_LOAD (#2974)
Composer now has the events Event.BEFORE_LOAD
, which lets users modify state before a model is loaded. This is particularly useful for accessing certain attributes which may not exist at Event.INIT
, such as the dataloader state.
2. Registering model in MLFlow with run id (#2967)
The MLFlow logger now has register_model_with_run_id
, which allows users to register a model based on the run_id. This is a different way of registering the model which preserves the link to the mlflow runs.
What's Changed
Full Changelog: v0.19.0...v0.19.1
v0.19.0
What's New
1. Improved DTensor Support
Composer now supports elastic saving and loading of DTensors at various mesh sizes.
2. Checkpoint Saving and Loading from Databricks MLFlow
Composer now supports saving and loading checkpoints to Databricks-managed MLFlow.
composer_model = MyComposerModel(...)
trainer = Trainer(
model=composer_model,
save_folder= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
logger=MLFlowLogger(...),
load_path= 'dbfs:/databricks/mlflow-tracking/{mlflow_experiment_id}/{mlflow_run_id}/artifacts',
...
)
3. Better Communication Computation Overlap in FSDP
Composer now has improved communication/computation overlap in our FSDP code which should improve MFU across several architectures.
4. Python3.11 + Torch2.2 Support
Initial support of Python3.11 + Torch2.2 added in Composer.
5. PEFT LoRA
PEFT LoRA is now supported in the HuggingFaceModel class.
6. Refactored Evaluation
in_context_learning_evaluation.py
has a new design with cleaner abstractions and easier interfaces to work wtih.
7. Azure Checkpointing
Composer now supports saving your model in Azure.
8. MLFlow Checkpointing
Composer now supports saving your model in MLFlow.
Bug Fixes
- Fix MLFlowLogger test by @ngcgarcia in #2912
- Fix bug with CoT early stopping and LLama2 tokenizer by @bmosaicml in #2902
- Fix split_batch bug with empty generation_kwargs by @maxisawesome in #2913
- Only load RNG keys that exist by @mvpatel2000 in #2901
- Fix daily tests by @mvpatel2000 in #2891
- Fix seed for FSDP wrap by @mvpatel2000 in #2833
- Fix load_ignore_keys with rng by @mvpatel2000 in #2803
- Fix mosaicml logger on close by @mvpatel2000 in #2816
- Fix torch profiler error on close by @mvpatel2000 in #2818
- Fix import for daily test by @snarayan21 in #2826
- Fix how single value tensors are logged by @aspfohl in #2831
- Fix torch bump by @j316chuck in #2855
- Fix MPS with sequence loss by @JAEarly in #2834
What's Changed
- Bump transformers version by @dakinggg in #2781
- Bump sphinxext-opengraph from 0.9.0 to 0.9.1 by @dependabot in #2784
- Bump coverage[toml] from 7.3.0 to 7.3.3 by @dependabot in #2783
- Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 by @dependabot in #2785
- [UCVolumes] Rely on databricks-sdk auth for the right requirements by @panchalhp-db in #2789
- Enable system metrics in mosaic mlflow logger by @chenmoneygithub in #2775
- Update parse_uri by @irenedea in #2787
- default to no torch profiler memory timeline by @cli99 in #2790
- Add eot token to ICL generate kwargs by @bmosaicml in #2782
- Add nightly image for torch 2.2.0-12-20-23 by @j316chuck in #2791
- Add torch nightly 12-13 by @j316chuck in #2792
- Add process group as arg to FSDP by @mvpatel2000 in #2794
- Bump coverage[toml] from 7.3.3 to 7.3.4 by @dependabot in #2798
- Bump ipykernel from 6.26.0 to 6.28.0 by @dependabot in #2806
- Bump junitparser from 3.1.0 to 3.1.1 by @dependabot in #2805
- Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #2807
- Avoid futures on close for MosaicML logger by @mvpatel2000 in #2804
- Require sync module states with HSDP by @mvpatel2000 in #2812
- Better communication computation overlap by @snarayan21 in #2811
- Improve error message for speed monitor by @mvpatel2000 in #2801
- Bump torch version -- DO NOT RELEASE by @mvpatel2000 in #2814
- Bump torchvision for nightly by @mvpatel2000 in #2815
- Correct multi-unshard stream patching for torch 2.2.0dev, and stream waiting correctness. by @snarayan21 in #2817
- Bump traitlets from 5.13.0 to 5.14.1 by @dependabot in #2822
- All unshard streams wait on computation every step by @snarayan21 in #2823
- Add encoding=utf-8 by @dakinggg in #2824
- [MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore by @jerrychen109 in #2802
- Remove fused layernorm (already deprecated for 2 versions) by @mvpatel2000 in #2827
- checkpoint saver tracks all checkpoints/intervals in state by @aspfohl in #2819
- code-quality timeout update by @aspfohl in #2830
- Adds DTensor Support by @mvpatel2000 in #2821
- Remove duplicate checkpoint verifications by @eracah in #2828
- Remove fsdp patch for comm overlap by @mvpatel2000 in #2836
- Allow hsdp by @mvpatel2000 in #2838
- Bump torch 2.1.2 by @mvpatel2000 in #2840
- Upgrade pyright to 1.1.310 by @b-chu in #2841
- [MLFlowObjectStore] [2/2] Support checkpointing with MLFlow by @jerrychen109 in #2810
- update nightly to torch 2.3 by @j316chuck in #2842
- Pin sphinxcontrib applehelp by @mvpatel2000 in #2854
- Torch 2.3 patch by @dakinggg in #2849
- Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 by @dependabot in #2866
- Rewrite to use individual state functions by @mvpatel2000 in #2860
- Add custom stopping criteria to ICL generate tasks by @bmosaicml in #2800
- Add save_ignore_keys by @mvpatel2000 in #2868
- Remome log debug by @mvpatel2000 in #2871
- Update monkeypatch to put barrier in optim load by @mvpatel2000 in #2874
- Remove toml by @b-chu in #2872
- Update license by @b-chu in #2875
- Add ignore_metrics field to the MLflow logger by @ngcgarcia in #2869
- Convert print to log.info by @mvpatel2000 in #2876
- Bump version to 0.18.0 by @irenedea in #2877
- Removed commented-out unshard streams patching. by @snarayan21 in #2873
- Make code quality workflow reusable by @b-chu in #2878
- Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
- Bump torchmetrics by @mvpatel2000 in #2890
- Bump transformers to 4.37 by @dakinggg in #2894
- Azure checkpointing support by @mvpatel2000 in #2893
- Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Bump version to 0.18.1 by @b-chu in #2905
- Refactor in_context_learning_evaluation.py by @maxisawesome in #2713
- Fix FP8 checkpoint resumption with onnx export flag by @j316chuck in #2907
- Add Python 3.11 + FA 2.5.0 + Torch 2.3.0 Image by @KuuCi in #2898
- Add yamllint to pre commit by @b-chu in #2909
- Add ignore_hyperparameters to MLFlowLogger by @ngcgarcia in #2908
- Bump coverage[toml] from 7.3.4 to 7.4.1 by @dependabot in #2915
- Add checkpoint test for 0.18.1 by @b-chu in #2906
- Integrate PEFT LoRA with HuggingFaceModel by @dakinggg in #2829
New Contributors
- @jerrychen109 made their first contribution in #2802
- @JAEarly made their first contribution in https://github.com/mosa...
v0.18.2
Bug Fixes
- Fix lp layernorm weight by @snarayan21 in #2954
What's Changed
- Fix lp layernorm weight by @snarayan21 in #2954
- Bump version to 0.18.2 by @b-chu
Full Changelog: v0.18.1...v0.18.2
v0.18.1
Bug Fixes
- Fix MPS with sequence loss by @JAEarly in #2834
- Fix daily tests by @mvpatel2000 in #2891
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Only load RNG keys that exist by @mvpatel2000 in #2901
What's Changed
- Bump version to 0.18.0 by @irenedea in #2877
- Removed commented-out unshard streams patching. by @snarayan21 in #2873
- Make code quality workflow reusable by @b-chu in #2878
- Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #2885
- Fix MPS with sequence loss by @JAEarly in #2834
- Bump torchmetrics by @mvpatel2000 in #2890
- Fix daily tests by @mvpatel2000 in #2891
- Bump transformers to 4.37 by @dakinggg in #2894
- Azure checkpointing support by @mvpatel2000 in #2893
- Pass PG into checkpoint load and load rng with state_dict by @mvpatel2000 in #2897
- Remove monkeypatch and new state dict APIs for torch 2.2 by @mvpatel2000 in #2899
- Only load RNG keys that exist by @mvpatel2000 in #2901
- Bump version to 0.18.1 by @b-chu in #2905
New Contributors
Full Changelog: v0.18.0...v0.18.1