Skip to content

Releases: mosaicml/streaming

v0.2.3

31 Jan 20:36
6a30df6
Compare
Choose a tag to compare

🚀 Streaming v0.2.3

Streaming v0.2.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.3

New Features

  • Add scalar MDS encodings data types (#130)
  • Support of WebVid-10M dataset (#132)
  • Support of LAION-400M dataset (#87)
  • Make StreamingDataset[sample_id] block to download the given sample's shard if it is not present, so that the dataset can be used lazily (#118)
  • Support of a Streaming benchmarking script to get time taken by the individual component (#121)

Bug Fixes

  • Nuke concat option in C4 dataset (#129)
  • Fixed bug report markdown doc (#140)
  • Fixed ADE20K dataset conversion script (#133)

What's Changed

Full Changelog: v0.2.2...v0.2.3

v0.2.2

09 Jan 22:11
f29bac1
Compare
Choose a tag to compare

🚀 Streaming v0.2.2

Streaming v0.2.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.2

New Features

  • Add in-browser partitioning visualizer (#108)
  • Add command-line partitioning visualizer (#115)

Bug Fixes

  • Get dataloader worker multiprocessing working with spawn, removing Mac OSX fork requirement (#97)
  • Improve error messaging (#100)
  • Fix CUDA OOM (#103)
  • Fix broken source code links in docs (#104)
  • Reference the shared memory object in a worker process when using spawn multiprocessing method (#106)
  • Release all the StreamingDataset resources during job termination (#107)

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1

22 Dec 23:09
0dec354
Compare
Choose a tag to compare

🚀 Streaming v0.2.1

Streaming v0.2.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.1

Bug Fixes

  • Make StreamingDataset smarter about when to init dist itself, fixing env var rendezvous problem (#94).
  • Shorten shared memory names for Mac OSX (#95).
  • Reduce memory usage in StreamingDataset, alleviating inscrutable worker OOMs with large datasets (#96).
  • Better exception handling in downloading (#98).
  • Hard require fork for dataloader multiprocessing in Mac OSX due to unpickleable objects (#101).

What's Changed

  • Also check if dist env vars are set. If not set, don't init dist. by @knighton in #94
  • Shorten the names of shared memory objects to make OSX happy. by @knighton in #95
  • Just do the partitioning/shuffling in the local leader worker. by @knighton in #96
  • propagate the actual exception and raise by @karan6181 in #98
  • Set multiprocessing method as fork for Mac OS by @karan6181 in #101
  • Bump version to 0.2.1 by @karan6181 in #102

Full Changelog: v0.2.0...v0.2.1

v0.2.0

09 Dec 06:44
1067f1b
Compare
Choose a tag to compare

🚀 Streaming v0.2.0

Streaming v0.2.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.0

New Features

  1. Elastic world size deterministic shuffle

    Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.

  2. Instant Mid-Epoch Resumption

    Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.

  3. NEW StreamingDataLoader
    A StreamingDataLoader is a drop-in replacement for your PyTorch DataLoader with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader.

  4. Support for Oracle Cloud Infrastructure (OCI) blob storage

    Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either oci://<bucket_name>@<namespace>/<folder_name>/<filename> or oci://<bucket_name>/<folder_name>/<filename> to a StreamingDataset class. For example:

    from streaming import StreamingDataset
    
    remote = 'oci://<bucket>@<namespace>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')

    Streaming expects the credentials to be present in ~/.oci/config path.

  5. Support for public AWS S3 buckets

    Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the StreamingDataset class with an AWS S3 bucket as follows

    from streaming import StreamingDataset
    
    remote = 's3://<bucket>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')
    

API changes

  • The class Dataset has been renamed as class StreamingDataset (#37).
    • Similarly, built-in most popular datasets class has also been renamed. For example,
      • C4 renamed as StreamingC4
      • EnWiki renamed as StreamingEnWiki
      • Pile renamed as StreamingEnWiki
      • ADE20K renamed as StreamingADE20K
      • CIFAR10 renamed as StreamingCIFAR10
      • COCO renamed as StreamingCOCO
      • ImageNet renamed as StreamingImageNet
  • The parameter prefetch in class Dataset has been renamed as predownload in class StreamingDataset (#37).
  • The parameter retry in class Dataset has been renamed as download_retry in class StreamingDataset (#37).
  • The parameter timeout in class Dataset has been renamed as download_timeout in class StreamingDataset (#37).
  • The parameter hash in class Dataset has been renamed as validate_hash in class StreamingDataset (#37).

What's Changed

Full Changelog: v0.1.2...v0.2.0

v0.1.2

14 Nov 22:14
0c34652
Compare
Choose a tag to compare

🚀 Streaming v0.1.2

Streaming v0.1.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.1.2

What's Changed

Full Changelog: v0.1.1...v0.1.2

v0.1.1

24 Oct 22:15
2035a72
Compare
Choose a tag to compare

🚀 Streaming v0.1.1

Streaming v0.1.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.1.1

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/streaming/commits/v0.1.1