Skip to content

v0.3.0

Compare
Choose a tag to compare
@karan6181 karan6181 released this 01 Mar 08:30
· 350 commits to main since this release
11d0944

🚀 Streaming v0.3.0

Streaming v0.3.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.3.0

New Features

☁️ Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter. Track the progress of individual uploads with progress_bar=True, and tune background upload workers with max_workers=4.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out parameter as a cloud provider bucket location as part of Writer class. Below is the example to upload output files to AWS S3 bucket

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can choose to keep a output shard files locally by providing a local directory path as part of Writer. For example,

output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can see the progress of the cloud upload file by setting progress_bar=True as part of Writer. For example,

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
    for sample in samples:
        pass

User can control the number of background upload threads via parameter max_workers as part of Writer who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4, maximum 4 threads would be active at a same time uploading one shard file each.

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
    for sample in samples:
        pass

🔀 2x faster shuffling

We’ve added a new shuffling algorithm py1s which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo (old behavior: py2s). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo argument to StreamingDataset. You will experience this as faster start and resumption for large datasets.

🔗 Extensible downloads

We provide examples of modifying StreamingDataset to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

API changes

  • Class Writer and its derived classes (MDSWriter, XSVWriter, TSVWriter, CSVWriter, and JSONWriter) parameter has been changed from dirname to out with the following advanced functionalities:

    • If out is a local directory, shard files are saved locally. For example, out=/tmp/mds/.
    • If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, out=s3://bucket/dir/path.
    • If out is a tuple of (local_dir, remote_dir), shard files are saved in the
      local_dir and also uploaded to a remote location. For example, out=('/tmp/mds/', 's3://bucket/dir/path').
  • Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of Writer and its subclasses (like MDSWriter) and StreamingDataset to require kwargs.

Bug Fixes

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.3.0