v0.3.0
🚀 Streaming v0.3.0
Streaming v0.3.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.3.0
New Features
☁️ Cloud uploading
Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter
. Track the progress of individual uploads with progress_bar=True
, and tune background upload workers with max_workers=4
.
User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out
parameter as a cloud provider bucket location as part of Writer
class. Below is the example to upload output files to AWS S3 bucket
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can choose to keep a output shard files locally by providing a local directory path as part of Writer
. For example,
output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can see the progress of the cloud upload file by setting progress_bar=True
as part of Writer
. For example,
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
for sample in samples:
pass
User can control the number of background upload threads via parameter max_workers
as part of Writer
who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4
, maximum 4 threads would be active at a same time uploading one shard file each.
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
for sample in samples:
pass
🔀 2x faster shuffling
We’ve added a new shuffling algorithm py1s
which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo
(old behavior: py2s
). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.
📨 2x faster partitioning
We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo
argument to StreamingDataset
. You will experience this as faster start and resumption for large datasets.
🔗 Extensible downloads
We provide examples of modifying StreamingDataset
to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.
API changes
-
Class
Writer
and its derived classes (MDSWriter
,XSVWriter
,TSVWriter
,CSVWriter
, andJSONWriter
) parameter has been changed fromdirname
toout
with the following advanced functionalities:- If
out
is a local directory, shard files are saved locally. For example,out=/tmp/mds/
. - If
out
is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example,out=s3://bucket/dir/path
. - If
out
is a tuple of(local_dir, remote_dir)
, shard files are saved in the
local_dir
and also uploaded to a remote location. For example,out=('/tmp/mds/', 's3://bucket/dir/path')
.
- If
-
Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of
Writer
and its subclasses (likeMDSWriter
) andStreamingDataset
to require kwargs.
Bug Fixes
- Fix broken blog post link and community email link in the README (#177).
- Download the shard files as tmp extension until it finishes for OCI blob storage (#178).
- Supported cloud providers documentation (#169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
doc on how to configure cloud storage credentials.
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
- Make setup.py deterministic by sorting dependencies (#165).
- Fix overlong lines for better readability (#163).
What's Changed
- Bump fastapi from 0.89.1 to 0.91.0 by @dependabot in #154
- Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by @dependabot in #155
- Compare arrow vs mds vs parquet. by @knighton in #160
- Improve serialization format comparison. by @knighton in #161
- WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by @knighton in #143
- Update download badge link to pepy by @karan6181 in #162
- CloudWriter interface: local=, remote=, keep=. by @knighton in #148
- Fix overlong lines. by @knighton in #163
- Make setup.py deterministic by sorting dependencies. by @nharada1 in #165
- Bump pydantic from 1.10.4 to 1.10.5 by @dependabot in #166
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #167
- Bump fastapi from 0.91.0 to 0.92.0 by @dependabot in #168
- Adjust StreamingDataset arguments by @knighton in #170
- add 2x faster shuffle algorithm; add shuffle bench/plot by @knighton in #137
- Docstring fix by @knighton in #173
- Add a supported cloud providers documentation by @karan6181 in #169
- Add callout fence to Configure Cloud Storage Credentials guide by @karan6181 in #174
- Fix broken links in the README by @knighton in #177
- Download the shard files as tmp extension until it finishes for OCI by @karan6181 in #178
- Add a support of uploading shard files to a cloud as part of Writer by @karan6181 in #171
- Refactor partitioning to be much faster. by @knighton in #179
- Bump version to 0.3.0 by @karan6181 in #180
New Contributors
Full Changelog: v0.2.5...v0.3.0