🚀 Streaming v0.3.0

Streaming v0.3.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.3.0

New Features

☁️ Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter. Track the progress of individual uploads with progress_bar=True, and tune background upload workers with max_workers=4.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out parameter as a cloud provider bucket location as part of Writer class. Below is the example to upload output files to AWS S3 bucket

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can choose to keep a output shard files locally by providing a local directory path as part of Writer. For example,

output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can see the progress of the cloud upload file by setting progress_bar=True as part of Writer. For example,

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
    for sample in samples:
        pass

User can control the number of background upload threads via parameter max_workers as part of Writer who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4, maximum 4 threads would be active at a same time uploading one shard file each.

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
    for sample in samples:
        pass

🔀 2x faster shuffling

We’ve added a new shuffling algorithm py1s which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo (old behavior: py2s). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo argument to StreamingDataset. You will experience this as faster start and resumption for large datasets.

🔗 Extensible downloads

We provide examples of modifying StreamingDataset to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

API changes

Class Writer and its derived classes (MDSWriter, XSVWriter, TSVWriter, CSVWriter, and JSONWriter) parameter has been changed from dirname to out with the following advanced functionalities:
- If out is a local directory, shard files are saved locally. For example, out=/tmp/mds/.
- If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, out=s3://bucket/dir/path.
- If out is a tuple of (local_dir, remote_dir), shard files are saved in the
  local_dir and also uploaded to a remote location. For example, out=('/tmp/mds/', 's3://bucket/dir/path').
Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of Writer and its subclasses (like MDSWriter) and StreamingDataset to require kwargs.

Bug Fixes

Fix broken blog post link and community email link in the README (#177).
Download the shard files as tmp extension until it finishes for OCI blob storage (#178).
Supported cloud providers documentation (#169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
  doc on how to configure cloud storage credentials.
Make setup.py deterministic by sorting dependencies (#165).
Fix overlong lines for better readability (#163).

What's Changed

Bump fastapi from 0.89.1 to 0.91.0 by @dependabot in #154
Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by @dependabot in #155
Compare arrow vs mds vs parquet. by @knighton in #160
Improve serialization format comparison. by @knighton in #161
WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by @knighton in #143
Update download badge link to pepy by @karan6181 in #162
CloudWriter interface: local=, remote=, keep=. by @knighton in #148
Fix overlong lines. by @knighton in #163
Make setup.py deterministic by sorting dependencies. by @nharada1 in #165
Bump pydantic from 1.10.4 to 1.10.5 by @dependabot in #166
Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #167
Bump fastapi from 0.91.0 to 0.92.0 by @dependabot in #168
Adjust StreamingDataset arguments by @knighton in #170
add 2x faster shuffle algorithm; add shuffle bench/plot by @knighton in #137
Docstring fix by @knighton in #173
Add a supported cloud providers documentation by @karan6181 in #169
Add callout fence to Configure Cloud Storage Credentials guide by @karan6181 in #174
Fix broken links in the README by @knighton in #177
Download the shard files as tmp extension until it finishes for OCI by @karan6181 in #178
Add a support of uploading shard files to a cloud as part of Writer by @karan6181 in #171
Refactor partitioning to be much faster. by @knighton in #179
Bump version to 0.3.0 by @karan6181 in #180

New Contributors

@nharada1 made their first contribution in #165

Full Changelog: v0.2.5...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0