How to use bucketing for a dataset in a multi-host setting #930

marcvanzee · 2021-01-19T13:52:16Z

marcvanzee
Jan 19, 2021
Maintainer

Answered by marcvanzee

Some notes from @levskaya (slightly reworded by me):

If you are running on TPU then use packing instead, because it performs better. Our WMT example uses this by default in input_pipeline.py
You have to be very careful using bucketing with multihost pmap. Each host must always receive the same bucket size. if they ever receive buckets of different sizes you'll get a deadlock (JAX is working on improving this aspect of multihost pmap).
If you need to do multihost buckets, I'd recommend creating a dataset per-bucket (with the right length filters to keep them disjoint, etc.) and using a shared seed to sample from the bucket-sizes randomly while still having the buckets internally shuffled

marcvanzee · 2021-01-22T13:41:54Z

Some notes from @levskaya (slightly reworded by me):

If you are running on TPU then use packing instead, because it performs better. Our WMT example uses this by default in input_pipeline.py
You have to be very careful using bucketing with multihost pmap. Each host must always receive the same bucket size. if they ever receive buckets of different sizes you'll get a deadlock (JAX is working on improving this aspect of multihost pmap).
If you need to do multihost buckets, I'd recommend creating a dataset per-bucket (with the right length filters to keep them disjoint, etc.) and using a shared seed to sample from the bucket-sizes randomly while still having the buckets internally shuffled

0 replies