Skip to content

Releases: mosaicml/streaming

v0.9.0

25 Sep 02:34
Compare
Choose a tag to compare

🚀 Streaming v0.9.0

Streaming v0.9.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.9.0

Whats new

1. Improved compatibility for ndarray and json types (#776, #777)

It is now possible to have columns including a map type successfully convert to JSON in an MDS file if the given type for the column is specified as 'json', and allows the JSON encoder to handle ndarray types.

What's Changed

Full Changelog: v0.8.1...v0.9.0

v0.8.1

23 Aug 20:26
a9a7d04
Compare
Choose a tag to compare

🚀 Streaming v0.8.1

Streaming v0.8.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.8.1

🔧 Improvements

Dataloader hanging between epochs has now been resolved! We've seen training time improvements of up to 40% for some many-epoch training jobs. If this was impacting your runs and has now been fixed, please let us know!

  • Fix dataloader hang at the end of an epoch by @XiaohanZhangCMU in #741
  • Add default compression, and warning about local paths to dataframe_to_mds by @srowen in #748
  • Throw exception when event.is_set() after write()s by @srowen in #754

🐛 Bug Fixes

  • Ensure deterministic sample order between epochs when shuffle=False by @snarayan21 in #750

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.8.1

v0.8.0

30 Jul 17:00
b14cd7a
Compare
Choose a tag to compare

✨ What's New ✨

1. HF File System Streaming (#711)

Streaming now supports streaming data from HF file system! This adds another popular backend as an option to host your data.

What's Changed

New Contributors

Full Changelog: v0.7.6...v0.8.0

v0.7.6

10 May 22:22
97eae28
Compare
Choose a tag to compare

🚀 Streaming v0.7.6

Streaming v0.7.6 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.6

💎 New Features

1. device_per_stream batching method

Users can now construct batches such that each device sees only samples from a single stream. This is very useful in cases where different data sources have samples/tensors of different sizes, but the model should still see samples from these different data sources at each optimizer step.

2. Add ndarray type for Spark dataframes.

Enable parsing Spark's ArrayType (of ShortType, LongType, IntegerType, FloatType, DoubleType) when converting a Spark dataframe to MDS.

3. Support for Alipan storage

Adds support for Alipan, Alibaba's cloud storage service.

What's Changed

New Contributors

Full Changelog: v0.7.5...v0.7.6

v0.7.5

09 Apr 00:35
3ba9301
Compare
Choose a tag to compare

🚀 Streaming v0.7.5

Streaming v0.7.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.5

💎 New Features

1. Tensor/Sequence Parallelism Support

Using the replication argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.

  • Replicating samples across devices (SP / TP enablement) by @knighton in #597
  • Expanded replication testing + documentation by @snarayan21 in #607
  • Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619

2. Overhauled Streaming Documentation

New and improved streaming documentation can be found here -- please submit issues with any feedback.

3. batch_size is now required for StreamingDataset

As we have seen multiple errors and performance degradations from users not setting the batch_size argument to StreamingDataset, we are making it a requirement to iterate over the dataset.

3. Support for Python 3.11, deprecate Python 3.8

  • Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586

🐛 Bug Fixes

  • [easy typo fix] fix f-string by @bigning in #596
  • Change comparison in partitions to include equals by @JAEarly in #587
  • Use type int when initializing SharedMemory size by @bchiang2 in #604
  • COCO Dataset fix -- avoids allow_unsafe_types=True by @snarayan21 in #647

🔧 Improvements

What's Changed

New Contributors

Full Changelog: v0.7.4...v0.7.5

v0.7.4

08 Feb 22:00
a0443bb
Compare
Choose a tag to compare

🚀 Streaming v0.7.4

Streaming v0.7.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.4

🐛 Bug Fixes

  • Download to temporary path from azure by @philipnrmn in #566
  • fix(merge_index): scheme was not well formatted by @fwertel in #576
  • Update misplaced params of _format_remote_index_files by @lsongx in #584
  • Modifications to resumption shared memory allowing load_state_dict multiple times. by @snarayan21 in #593

What's Changed

New Contributors

Full Changelog: v0.7.3...v0.7.4

v0.7.3

12 Jan 18:12
47efc9d
Compare
Choose a tag to compare

🚀 Streaming v0.7.3

Streaming v0.7.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.3

🐛 Bug Fixes

  • Logging messages for new defaults only show once per rank. (#543)
  • Fixed padding calculation for repeat samples in the partition. (#544)

🔧 Other improvements

  • Update copyright license year from 2023 -> 2022-2024. (#560)

What's Changed

Full Changelog: v0.7.2...v0.7.3

v0.7.2

14 Dec 17:26
fac84b4
Compare
Choose a tag to compare

🚀 Streaming v0.7.2

Streaming v0.7.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.2

💎 New Features

1. Canned ACL Support (#512)

Add support for the Canned ACL using the environment variable S3_CANNED_ACL for AWS S3. Checkout Canned ACL document on how to use it.

2. Allow/reject datasets containing unsafe types (#519)

The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag allow_unsafe_types in the StreamingDataset class to allow or reject datasets containing Pickle.

🐛 Bug Fixes

  • Retrieve batch size correctly from vision yamls for the streaming simulator (#501)
  • Fix for CVE-2023-47248 (#504)
  • Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)
    • Proportion of None instead of a string 'None' is now handled correctly.
    • Repeat of None instead of a string 'None' is now handled correctly.
    • Added warning for StreamingDataset subclass defaults
  • Fix sample partitioning algorithm bug for tiny datasets (#517)

🔧 Improvements

  • Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)

What's Changed

New Contributors

Full Changelog: v0.7.1...v0.7.2

v0.7.1

06 Nov 23:03
4c33ad3
Compare
Choose a tag to compare

🚀 Streaming v0.7.1

Streaming v0.7.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.1

🐛 Bug Fixes

  • Simulation from command line with simulator is fixed (#499)

What's Changed

  • Fixing simulator command with simulation directories being included in package by @snarayan21 in #499

Full Changelog: v0.7.0...v0.7.1

v0.7.0

06 Nov 01:23
4e8c944
Compare
Choose a tag to compare

🚀 Streaming v0.7.0

Streaming v0.7.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.0

📈 Better Defaults for StreamingDataset (#479)

  • The default values for StreamingDataset have been updated to be more performant and are applicable for most use cases, detailed below:
Parameter Old Value New Value Benefit
shuffle_algo py1s py1e Better shuffle and balanced downloading
num_canonical_nodes 64 * physical nodes if py1s or py2s, 64 * physical_nodes, otherwise physical_nodes Consistently good shuffle for all shuffle algos
shuffle_block_size 262,144 4,000,000 / num_canonical_nodes Consistently good shuffle for all num_canonical_nodes values
predownload max(batch_size, 256 * batch_size // num_canonical_nodes) 8 * batch_size Better balanced downloading
partition_algo orig relaxed More flexible deterministic resumptions on nodes

💎 New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)

  • After installing this version of streaming, simply run the command simulator in your terminal to open the simulation interface.
  • Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
  • Easily de-risk runs and find performant parameter settings.
  • Check out the docs for more information!

🔢 More flexible deterministic training and resumption (#476)

  • Deterministic training and resumptions are now possible on more numbers of nodes!
  • Previously, the num_canonical_nodes parameter had to divide or be a multiple of the number of physical nodes for determinism.
  • Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

  • Check for invalid hash algorithm names (#486)

What's Changed

Full Changelog: v0.6.1...v0.7.0