18 Oct 21:28

8827d7a

v0.6.1

🚀 Streaming v0.6.1

Streaming v0.6.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.1

💎 New Features

🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)

Addition of the merge_index() utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.
Checkout dataset conversion and Spark Dataframe to MDS jupyter notebook for an example in action.

🔁 Retry uploading a file to a cloud provider path. (#448)

Added upload retry logic with backoff and jitter during dataset conversion as part of parameter retry in Writer.

from streaming import MDSWriter

with MDSWriter(
               ...,
               retry=3) as out:
    for sample in dataset:
        out.write(sample)

🐛 Bug Fixes

Validate Writer arguments and raise a ValueError exception if argument(s) is/are invalid. (#434)
Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (#448)

🔧 Improvements

More balancing inter-node downloading for the py1e shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (#442)

What's Changed

Validate writer arguments by @karan6181 in #434
Bump pytest from 7.4.1 to 7.4.2 by @dependabot in #428
Bump gitpython from 3.1.34 to 3.1.36 by @dependabot in #435
Fix stylistic issues (mostly 100col, docstring conventions) by @knighton in #439
Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in #436
py1e randomized by @snarayan21 in #442
Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in #446
Fix BatchFeature of Transformers not handled by StreamingDataloader by @Hubert-Bonisseur in #450
Add a retry logic with backoff and jitter by @karan6181 in #448
Fix broken bibtext by @Skylion007 in #452
Update integration test to include sample order comparison by @karan6181 in #456
Bump pydantic from 2.3.0 to 2.4.2 by @dependabot in #455
Update MCLI credential page for Databricks by @karan6181 in #466
Add merge index file utility by @XiaohanZhangCMU in #449
Add py1e warning when Shuffle block size is smaller than shard size by @snarayan21 in #463
Fix doc strings by @XiaohanZhangCMU in #469
Bump fastapi from 0.103.1 to 0.103.2 by @dependabot in #454
Maintain order for merge_index_from_list by @XiaohanZhangCMU in #472
Fixed codeql out of disk space issue by @karan6181 in #473
Bump version to 0.6.1 by @karan6181 in #474

New Contributors

@Hubert-Bonisseur made their first contribution in #450

Full Changelog: v0.6.0...v0.6.1

Contributors

Skylion007, knighton, and 5 other contributors

Assets 2

13 Sep 20:11

XiaohanZhangCMU

v0.6.0

65ac4ca

v0.6.0

🚀 Streaming v0.6.0

Streaming v0.6.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.6.0

New Features

🆕 Databricks File System and Databricks Unity Catalog (#362)

Support for reading and writing data from and to the Databricks File System (DBFS) and Unity Catalog (UC) Volumes. This means that you can now use DBFS and UC Volumes as a source or sink for your streaming data pipelines or model training. Below is the path structure:

Databricks File System (DBFS)

DBFS path structure is a hierarchical namespace that is organized into directories and files. The DBFS prefix must starts with dbfs:/.

UC Volumes

The path structure for UC Volumes is similar to the path structure for DBFS, but with a few key differences.

The root of the UC Volumes namespace is dbfs:/Volumes/<catalog>/<schema>/<volume>, where:

<catalog> is the name of the catalog where the volume is created.
<schema> is the name of the schema where the volume is created.
<volume> is the name of the volume.

Hence, use a dbfs://Volumes prefix to specify a UC Volumes path.

💽 Spark Dataframe to MDS convertor (#363)

Introducing the new DataFrameToMDS API, empowering users to effortlessly leverage Spark's capabilities for handling diverse datasets in various formats. This API enables seamless conversion of Spark DataFrames into MDS datasets, with the flexibility to specify output locations to both local and cloud storage. Index files are optionally merged. Additionally, users can add data preprocessing steps by defining custom iterator functions and arguments. All these features are seamlessly bundled into a single Spark job, ensuring an efficient and streamlined workflow for data transformation. An example notebook is provided to help users get started.

🔀 Randomize and offset shuffle blocks algorithm (#373)

The new py1br shuffle algorithm helps mitigate download spikes that occur when using the py1b algorithm. With py1b, shuffle blocks are all the same size, so when progressing through training, nodes will have to download many shards at the same time. In contrast, with py1br, shuffle blocks are offset from each other and are variably sized. This results in more balanced downloads over time. The py1br algorithm is a replacement for the py1b algorithm, which will be deprecated soon.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1br',
    ...
)

🔀 Expanded range shuffle algorithm (#394)

The new py1e shuffle algorithm helps reduce the minimum cache limit needed for training, and results in much smoother downloads than both py1br and py1e. However, its shuffle quality is slightly lower. Rather than shuffling all samples in blocks of size shuffle_block_size, it instead spreads the samples of each shard over a range of maximum size shuffle_block_size, retaining most of the shuffle quality from py1b and py1br while reducing download spikes across the duration of training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    shuffle_algo='py1e',
    ...
)

🔥 Per-Stream Batching (#407)

Users are now able to ensure that each batch comes has samples from only a single stream. You can now set the new parameter batching_method to per_stream to access this functionality. Per-stream batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose. To make batches contain only samples from a group of streams, merge streams’ index.json files to create a single one for each group.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='per_stream',
    ...
)

🔥 Stratified Batching (#408)

Users are now able to ensure that each batch has a consistent number of samples from every stream. Previously, stream proportions were satisfied in the aggregate but not at the batch level. You can now set the new parameter batching_method to stratified to access this functionality. Stratified batching will still take into account upsampling and downsampling of streams, set by proportion, repeat, or choose.

from streaming import StreamingDataset

dataset = StreamingDataset(
    batching_method='stratified',
    ...
)

💪 Download-Efficient Sparse Sampling (#391)

Previous versions of StreamingDataset implement downsampling/upsampling by giving each sample equal probability of being selected (plus or minus one due when sampling is fractional), without regard to what shard a sample is on. This means that no matter how small your desired downsampling is, StreamingDataset will still use each shard at as equal a rate as possible. This is problematic for downloading performance.

In this version of Streaming, we have added a new optional StreamingDataset argument sampling_granularity which can be used to configure how sampling is done. It is an integer, defaulting to 1, that determines how many samples are to be drawn at a time from a single random shard until we have enough samples.

Note that the default setting of 1 is equivalent to the old non-shard-aware behavior. Setting it high, e.g. the number of samples in a full shard or more, means it will draw all the samples in a randomly chosen (without replacement) shard until it has enough samples, which is much more download-effiicient but results in the samples of each shard always being seen close together in training, which may have implications to convergence depending on your workload. Setting sampling granularity to half a shard means, roughly speaking, you'll see half the samples of a shard at a time during training.

from streaming import StreamingDataset

dataset = StreamingDataset(
    sampling_granularity=1,
    ...
)

📑 Reusable local directory (#406)

Users can now instantiate more than one StreamingDataset with same local directory and remote=None. This would be useful if there is a high-speed storage mounted on a node and multiple folks are trying to read the dataset directly from mount storage on the same node without having to copy the data on local disk.

from streaming import StreamingDataset

local = '<local disk directory or a mount point directory>'
dataset_0 = StreamingDataset(local=local, remote=None)
dataset_1 = StreamingDataset(local=local, remote=None)

🐛 Bug Fixes

Terminate the worker threads when process terminates to avoid deadlock. (#425)
Raise an exception if cache_limit is lower than the size of a single shard file to avoid deadlock. (#420)
Fixed predownload value to zero issue where users can now provide predownload=0 in StreamingDataset. (#383)

🔧 Improvements

Add google Application Default Credentials (#376).
- The order of authentication has changed and added a new App Engine or Compute Engine authentication channel if these are available. The order of authentication is as follows:
  1. HMAC
  2. Google service account
  3. App Engine
  4. Compute Engine
  5. Raise an error
Check if index.json exists locally before downloading to avoid duplicate downloads (#372).

What's Changed

Bump fastapi from 0.100.0 to 0.101.0 by @dependabot in #367
Bump uvicorn from 0.23.1 to 0.23.2 by @dependabot in #368
Check if index.json exists locally before downloading by @karan6181 in #372
Bench/plot sample access times across data and across formats by @knighton in #365
Apply ruff pre-commit hook by @Skylion007 in #364
Add a regression test for shuffling sample order by @b-chu in #359
Epoch size default behavior by @snarayan21 in #374
Stream unspecified docstring change by @snarayan21 in #377
fixed comments by @snarayan21 in #378
Add google Application Default Credentials to download by @fgerzer in #376
Fixed fake AWS credentials by @karan6181 in #382
Fixed predownload value to zero issue by @karan6181 in #383
Bump fastapi from 0.101.0 to 0.101.1 by @dependabot in #387
Bump pydantic from 2.1.1 to 2.2.1 by @dependabot in #389
Add a regression test for mixing of different dataset streams by @b-chu in #375
Add support for Databricks File System backend by @maddiedawson in #362
Add support for downloading from Unity Catalog volumes by @maddiedawson in #361
Fix MosaicML platform credential setup links by @karan6181 in #396
Plug hole in MDS type system: add arbitrary-precision decimal by @knighton in #390
Bump fastapi from 0.101.1 to 0.103.0 by @dependabot in #402
Bump pydantic from 2.2.1 to 2.3.0 by @dependabot in #403
Bump databricks-sdk from 0.3.1 to 0.6.0 by @dependabot in #404
Py1br algorithm implementation by @snarayan21 in #373...

Contributors

acutkosky, Skylion007, and 9 other contributors

Assets 2

19 Jun 05:58

karan6181

v0.5.2

a301cd0

v0.5.2

🚀 Streaming v0.5.2

Streaming v0.5.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.2

New features

Allow authentication with GCS for service accounts #315
human-readable suffixes for size_limit and epoch_size #333
static sampling #348

Documentation changes

Update contribution guide and improved unittest logic #343
static sampling #348

Testing

Add a regression test for StreamingDataset instantiation and iteration #318
Fixed accidental shard delete test #341
Add a regression test for StreamingDataset using cloud providers #319
Add iteration time test as part of regression testing #358

Bug fix

Fix init local dir zip-only shard handling #330
added default behavior if no streams and epoch_size specified #348

What's Changed

Bump myst-parser from 1.0.0 to 2.0.0 by @dependabot in #309
Added files to support azure datalake storage by @shivshandilya in #311
Add secrets check as part of pre-commit by @karan6181 in #312
Bump pytest from 7.3.2 to 7.4.0 by @dependabot in #313
Bump fastapi from 0.97.0 to 0.98.0 by @dependabot in #314
Add GCS authentication for service accounts by @b-chu in #315
Bump fastapi from 0.98.0 to 0.100.0 by @dependabot in #322
Bump uvicorn from 0.22.0 to 0.23.0 by @dependabot in #327
Bump gitpython from 3.1.31 to 3.1.32 by @dependabot in #329
Bump pydantic from 1.10.9 to 1.10.11 by @dependabot in #328
Sync tmp directory by @b-chu in #316
Add a regression test for StreamingDataset instantiation and iteration by @b-chu in #318
human-readable suffixes for size_limit and epoch_size by @snarayan21 in #333
Updated pre commit packages by @snarayan21 in #340
Fix init local dir zip-only shard handling by @knighton in #330
Fixed accidental shard delete test by @karan6181 in #341
Bump uvicorn from 0.23.0 to 0.23.1 by @dependabot in #338
Download the index.json file as tmp extension until it finishes by @karan6181 in #346
Update contribution guide and improved unittest logic by @karan6181 in #343
Bump fastapi from 0.100.0 to 0.100.1 by @dependabot in #351
Bump uvicorn from 0.23.1 to 0.23.2 by @dependabot in #352
Bump furo from 2023.5.20 to 2023.7.26 by @dependabot in #354
Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in #353
added default behavior if no streams and epoch_size specified by @snarayan21 in #348
Add a regression test for StreamingDataset using cloud providers by @b-chu in #319
Fixed sampling by @snarayan21 in #356
mds ndarray int conversion by @snarayan21 in #357
Add iteration time test as part of regression testing by @karan6181 in #358
Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in #366
Fixed CI test to perform proper directory cleanup by @karan6181 in #369
version bump to 0.5.2 by @snarayan21 in #370

New Contributors

@shivshandilya made their first contribution in #311
@b-chu made their first contribution in #315
@snarayan21 made their first contribution in #333

Full Changelog: v0.5.1...v0.5.2

Contributors

knighton, karan6181, and 4 other contributors

Assets 2

08 Aug 18:59

snarayan21

v0.5.1

ac53002

v0.5.1

What's Changed

Improved shard eviction test execution time by @karan6181 in #291
Bump fastapi from 0.96.0 to 0.97.0 by @dependabot in #294
Bump pytest from 7.3.1 to 7.3.2 by @dependabot in #295
Bump pydantic from 1.10.8 to 1.10.9 by @dependabot in #296
Terminate the main process if thread died unexpectedly by @karan6181 in #297
Improved existing exception and exception messages by @karan6181 in #298
Round drop_first to be divisible by num_physical_nodes. by @knighton in #301
Added a utility method to clean stale shared memory by @karan6181 in #299
Propagate exception between threads and processes and improved error message by @karan6181 in #304
Fix LocalDataset (property size for fancy getitem). by @knighton in #305
Natively support encoding and decoding ndarrays in MDS by @knighton in #82
Bump version to 0.5.1 by @karan6181 in #308

Full Changelog: v0.5.0...v0.5.1

Contributors

knighton, karan6181, and dependabot

Assets 2

06 Jun 13:31

karan6181

v0.5.0

8e16aa9

v0.5.0

🚀 Streaming v0.5.0

Streaming v0.5.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.0

New Features

🆕 Cold Shard Eviction. ( #219 )

Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument cache_limit. See the shuffling guide for more details.

from streaming import StreamingDataset

dataset = StreamingDataset(
    cache_limit='100gb',
    ...
)

🤙 Fetch sample using NumPy style indexing. ( #120 )

Users can now randomly access samples using NumPy-style indexing with StreamingDataset. For example,

import numpy as np
from streaming import StreamingDataset

dataset = StreamingDataset(local=local, remote=remote)

dataset[0]  # Fetch sample 0
dataset[-1]  # Fetch last sample
dataset[[10, 20]]  # Fetch sample 10 and 20
dataset[slice(1, 10, 2)]  # Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1]  # Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])]  # Fetch sample 4 and 7

🦾 Any S3 compatible object store. ( #265 )

Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are Cloudflare R2, Coreweave, Backblaze b2, etc. User needs to provide an environment variable S3_ENDPOINT_URL based on the object store that you are using. Details on how to configure credentials can be found here.

🦾 Azure cloud blob storage. ( #256 )

Support of Azure cloud blob storage. Details on how to configure credentials can be found here.

Bug Fixes

Wait for download and ready thread to finish before terminating job. ( #286 )
Fixed length calculation to use resampled epoch size, not underlying num samples. ( #278 )
Fixed mypy errors by adding a py.typed marker file. ( #245 )
Create a new boto3 session per thread to avoid sharing resources. ( #241 )

🔧 API changes

The argument samples_per_epoch has been renamed to epoch_size in StreamingDatasetto better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).
The argument samples has been renamed to choose in Stream to better distinguish the underlying sample vs resampled data.
The argument keep_raw has been removed in StreamingDataset in the process of finalizing the design for shard eviction (see the newly-added cache_limit parameter).
The default value of predownload in StreamingDataset was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of 100_000. This is to prevent predownloaded shards from getting evicted before ever being used.
The default value of num_canonical_nodes in StreamingDataset was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.
The default value of shuffle_algo in StreamingDataset was changed from py1b to py1s as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found here.

What's Changed

Redesign shard index by @knighton in #236
Propagate an exception raise by a thread to its caller by @karan6181 in #241
Raise descriptive error message when index.json is corrupted by @karan6181 in #242
Rename "samples" to "choose" (distinguish underlying vs resampled) by @knighton in #243
Added py.typed to indicate that the repository has typing annotations by @karan6181 in #245
Add "Array" base class, which provides numpy-style indexing. by @knighton in #120
Better organize code by @knighton in #246
Update readthedocs python version to 3.9 by @karan6181 in #249
Create a new boto3 session per thread by @karan6181 in #251
Bump uvicorn from 0.21.1 to 0.22.0 by @dependabot in #253
Add support for Cloudflare R2 cloud storage by @hlky in #255
Fix typo in documentation's conversion pile.py link by @ouhenio in #259
Add support for Azure cloud storage by @hlky in #256
Fix slack link in readme by @growlix in #262
Bugfix in user_guide.md sample code by @tginart in #263
Add Stream usage example to README by @hanlint in #266
Update Stream documentation by @karan6181 in #267
Update README.md - slack by @ejyuen in #273
Bump fastapi from 0.95.1 to 0.95.2 by @dependabot in #269
Cold shard eviction by @knighton in #219
Update slack link with a URL shortener by @karan6181 in #274
Bump pydantic from 1.10.7 to 1.10.8 by @dependabot in #276
Bump yamllint from 1.31.0 to 1.32.0 by @dependabot in #277
Fix SD length calculation when resampling by @knighton in #278
Fixed performance degradation when not doing shard eviction by @karan6181 in #279
Derived predownload value using batch size and NCN by @karan6181 in #280
Support any S3-compatible object store (R2, Coreweave, Backblaze, etc.) by @abhi-mosaic in #265
Update docs pypi package and Improved documentation by @karan6181 in #281
Change the default number of canonical nodes by @karan6181 in #282
Set predownload value correctly for all usecase by @karan6181 in #283
Add documentation for MDSWriter, conversion scripts, and supported format by @karan6181 in #232
Ensure int64 by @knighton in #284
Wait for thread job to finish and Fixed filelock directory structure by @karan6181 in #286
Bump fastapi from 0.95.2 to 0.96.0 by @dependabot in #287
Bump version to 0.5.0 by @karan6181 in #289
Remove github action workflow concurrency check by @karan6181 in #290

New Contributors

@hlky made their first contribution in #255
@ouhenio made their first contribution in #259
@growlix made their first contribution in #262
@tginart made their first contribution in #263
@hanlint made their first contribution in #266
@abhi-mosaic made their first contribution in #265

Full Changelog: v0.4.1...v0.5.0

Contributors

knighton, ejyuen, and 8 other contributors

Assets 2

25 Apr 14:01

karan6181

v0.4.1

552994a

v0.4.1

🚀 Streaming v0.4.1

Streaming v0.4.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.4.1

New Feature

Support of Torch 2.0. (#234)
Addition of two new sample shuffling algorithm. (#223)
Support of AWS S3 requester payers bucket permission for streaming. (#231)

Documentation

Added a streaming installation guide and a streaming environment guide. (#221)
Added a instruction guide for converting a multimodal dataset into a MDS format. (#220)
Streaming documentation now support Algolia search. (#224)

What's Changed

Refactor StreamingDataset shared memory prefix setup by @knighton in #218
Bump pytest from 7.2.2 to 7.3.0 by @dependabot in #222
Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). by @knighton in #223
Add installation and environments documentation by @karan6181 in #221
Add a readme for multimodal convert script modal type by @karan6181 in #220
Bump sphinx-copybutton from 0.5.1 to 0.5.2 by @dependabot in #229
Bump pytest from 7.3.0 to 7.3.1 by @dependabot in #230
Bump sphinxext-opengraph from 0.8.1 to 0.8.2 by @dependabot in #228
Bump fastapi from 0.95.0 to 0.95.1 by @dependabot in #227
Virtually split the repeats of repeated shards by @knighton in #226
Switch documentation search to use Algolia by @bandish-shah in #224
Add a requester pays bucket permission args to boto3 for s3 download file by @karan6181 in #231
Bump yamllint from 1.30.0 to 1.31.0 by @dependabot in #233
Support of torch 2.0 by @karan6181 in #234
Removed pushing auto release branch due to GH action permission by @karan6181 in #235
Fixed local directory check by @karan6181 in #238
Skip distributed all_gather test since CI non-deterministically hangs by @karan6181 in #240
Bump version to 0.4.1 by @karan6181 in #239

Full Changelog: v0.4.0...v0.4.1

Contributors

knighton, karan6181, and 2 other contributors

Assets 2

31 Mar 02:22

karan6181

v0.4.0

d428ed8

v0.4.0

🚀 Streaming v0.4.0

Streaming v0.4.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.4.0

New Feature

🔀 Dataset Mixing

Weighted mixing of sub-datasets on the fly during model training (#184). StreamingDataset now support an optional streams parameter which takes one or more sub-datasets and it intelligently fetches samples across sub-datasets. You can mix (upsample or downsample) datasets by defining each either relatively (proportion) or absolutely (repeat or samples or none of them to sample 1:1).

Documentation

Added a README which shows how to convert a raw dataset into an MDS format for Text and Vision dataset. (#183)

Bug Fixes

Raise an exception if the cloud storage bucket does not exist during shard file upload. (#212)
Remove unsupported ThreadPoolExecutor shutdown param for python38. (#199)

What's Changed

Update GCS cloud storage credential document by @karan6181 in #181
Update API reference doc to be compatible with sphinx by @karan6181 in #182
Add a readme for text and vision convert script modal type by @karan6181 in #183
Fix docstrings by @knighton in #185
Synchronize before destroying process group by @coryMosaicML in #186
Bump pytest from 7.2.1 to 7.2.2 by @dependabot in #187
Bump pypandoc from 1.10 to 1.11 by @dependabot in #188
White-box weighted mixing of streaming datasets by @knighton in #184
Organize partitioning code by @knighton in #190
Bump pydantic from 1.10.5 to 1.10.6 by @dependabot in #194
Bump uvicorn from 0.20.0 to 0.21.0 by @dependabot in #196
Bump fastapi from 0.92.0 to 0.94.0 by @dependabot in #198
Remove unsupported ThreadPoolExecutor shutdown param in python38 by @karan6181 in #199
Fix doctstrings (maybe?) by @Landanjs in #200
Demo: crawling, converting, and iterating weighted dataset subsets by @knighton in #191
Update WebVid README.md by @knighton in #202
Fix leftover test dirs and improve dataset method and variable names by @knighton in #201
Bump fastapi from 0.94.0 to 0.95.0 by @dependabot in #205
Bump uvicorn from 0.21.0 to 0.21.1 by @dependabot in #206
Raise an exception if bucket does not exist during upload by @karan6181 in #212
Bump yamllint from 1.29.0 to 1.30.0 by @dependabot in #209
Bump pydantic from 1.10.6 to 1.10.7 by @dependabot in #211
Register atexit handler for resource cleanup by @karan6181 in #215
Bump version to 0.4.0 by @karan6181 in #216

New Contributors

@coryMosaicML made their first contribution in #186
@Landanjs made their first contribution in #200

Full Changelog: v0.3.0...v0.4.0

Contributors

knighton, karan6181, and 3 other contributors

Assets 2

01 Mar 08:30

karan6181

v0.3.0

11d0944

v0.3.0

🚀 Streaming v0.3.0

Streaming v0.3.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.3.0

New Features

☁️ Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter. Track the progress of individual uploads with progress_bar=True, and tune background upload workers with max_workers=4.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out parameter as a cloud provider bucket location as part of Writer class. Below is the example to upload output files to AWS S3 bucket

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can choose to keep a output shard files locally by providing a local directory path as part of Writer. For example,

output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can see the progress of the cloud upload file by setting progress_bar=True as part of Writer. For example,

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
    for sample in samples:
        pass

User can control the number of background upload threads via parameter max_workers as part of Writer who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4, maximum 4 threads would be active at a same time uploading one shard file each.

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
    for sample in samples:
        pass

🔀 2x faster shuffling

We’ve added a new shuffling algorithm py1s which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo (old behavior: py2s). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo argument to StreamingDataset. You will experience this as faster start and resumption for large datasets.

🔗 Extensible downloads

We provide examples of modifying StreamingDataset to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

API changes

Class Writer and its derived classes (MDSWriter, XSVWriter, TSVWriter, CSVWriter, and JSONWriter) parameter has been changed from dirname to out with the following advanced functionalities:
- If out is a local directory, shard files are saved locally. For example, out=/tmp/mds/.
- If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, out=s3://bucket/dir/path.
- If out is a tuple of (local_dir, remote_dir), shard files are saved in the
  local_dir and also uploaded to a remote location. For example, out=('/tmp/mds/', 's3://bucket/dir/path').
Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of Writer and its subclasses (like MDSWriter) and StreamingDataset to require kwargs.

Bug Fixes

Fix broken blog post link and community email link in the README (#177).
Download the shard files as tmp extension until it finishes for OCI blob storage (#178).
Supported cloud providers documentation (#169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
  doc on how to configure cloud storage credentials.
Make setup.py deterministic by sorting dependencies (#165).
Fix overlong lines for better readability (#163).

What's Changed

Bump fastapi from 0.89.1 to 0.91.0 by @dependabot in #154
Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by @dependabot in #155
Compare arrow vs mds vs parquet. by @knighton in #160
Improve serialization format comparison. by @knighton in #161
WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by @knighton in #143
Update download badge link to pepy by @karan6181 in #162
CloudWriter interface: local=, remote=, keep=. by @knighton in #148
Fix overlong lines. by @knighton in #163
Make setup.py deterministic by sorting dependencies. by @nharada1 in #165
Bump pydantic from 1.10.4 to 1.10.5 by @dependabot in #166
Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #167
Bump fastapi from 0.91.0 to 0.92.0 by @dependabot in #168
Adjust StreamingDataset arguments by @knighton in #170
add 2x faster shuffle algorithm; add shuffle bench/plot by @knighton in #137
Docstring fix by @knighton in #173
Add a supported cloud providers documentation by @karan6181 in #169
Add callout fence to Configure Cloud Storage Credentials guide by @karan6181 in #174
Fix broken links in the README by @knighton in #177
Download the shard files as tmp extension until it finishes for OCI by @karan6181 in #178
Add a support of uploading shard files to a cloud as part of Writer by @karan6181 in #171
Refactor partitioning to be much faster. by @knighton in #179
Bump version to 0.3.0 by @karan6181 in #180

New Contributors

@nharada1 made their first contribution in #165

Full Changelog: v0.2.5...v0.3.0

Contributors

knighton, nharada1, and 2 other contributors

Assets 2

14 Feb 05:29

karan6181

v0.2.5

4bf9c1c

v0.2.5

🚀 Streaming v0.2.5

Streaming v0.2.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.5

Bug Fixes

Fixed CPU crash (#153)
Update example notebooks (#157)

What's Changed

Update README.md by @knighton in #152
Fix typo by @dakinggg in #156
Fixed CPU crash by @karan6181 in #153
Update example notebooks by @karan6181 in #157
bump version to 0.2.5 by @karan6181 in #158

Full Changelog: v0.2.4...v0.2.5

Contributors

knighton, karan6181, and dakinggg

Assets 2

10 Feb 00:52

knighton

v0.2.4

f392223

v0.2.4

🚀 Streaming v0.2.4

Streaming v0.2.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.4

What's Changed

Fix Lossy JPEG reencoding for MDS format by @JJGO in #142
Add message to size assert & change to KeyError by @samhavens in #146
Synchronize prefix_int across all ranks to resolve hang issue by @karan6181 in #147
Pin setuptools in build requirements by @dakinggg in #136
Graphics. by @knighton in #150
bump version to 0.2.4 by @karan6181 in #151

New Contributors

@JJGO made their first contribution in #142
@samhavens made their first contribution in #146

Full Changelog: v0.2.3...v0.2.4

Contributors

knighton, JJGO, and 3 other contributors

Assets 2

Releases: mosaicml/streaming

v0.6.1

🚀 Streaming v0.6.1

💎 New Features

🚃 Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)

🔁 Retry uploading a file to a cloud provider path. (#448)

🐛 Bug Fixes

🔧 Improvements

What's Changed

New Contributors

Contributors

v0.6.0

🚀 Streaming v0.6.0

New Features

🆕 Databricks File System and Databricks Unity Catalog (#362)

💽 Spark Dataframe to MDS convertor (#363)

🔀 Randomize and offset shuffle blocks algorithm (#373)

🔀 Expanded range shuffle algorithm (#394)

🔥 Per-Stream Batching (#407)

🔥 Stratified Batching (#408)

💪 Download-Efficient Sparse Sampling (#391)

📑 Reusable local directory (#406)

🐛 Bug Fixes

🔧 Improvements

What's Changed

Contributors

v0.5.2

🚀 Streaming v0.5.2

New features

Documentation changes

Testing

Bug fix

What's Changed

New Contributors

Contributors

v0.5.1

What's Changed

Contributors

v0.5.0

🚀 Streaming v0.5.0

New Features

🆕 Cold Shard Eviction. ( #219 )

🤙 Fetch sample using NumPy style indexing. ( #120 )

🦾 Any S3 compatible object store. ( #265 )

🦾 Azure cloud blob storage. ( #256 )

Bug Fixes

🔧 API changes

What's Changed

New Contributors

Contributors

v0.4.1

🚀 Streaming v0.4.1

New Feature

Documentation

What's Changed

Contributors

v0.4.0

🚀 Streaming v0.4.0

New Feature

🔀 Dataset Mixing

Documentation

Bug Fixes

What's Changed

New Contributors

Contributors

v0.3.0

🚀 Streaming v0.3.0

New Features

☁️ Cloud uploading

🔀 2x faster shuffling

📨 2x faster partitioning

🔗 Extensible downloads

API changes

Bug Fixes

What's Changed

New Contributors

Contributors

v0.2.5

🚀 Streaming v0.2.5

Bug Fixes