Skip to content

How to use bucketing for a dataset in a multi-host setting #930

Answered by marcvanzee
marcvanzee asked this question in Q&A
Discussion options

You must be logged in to vote

Some notes from @levskaya (slightly reworded by me):

  • If you are running on TPU then use packing instead, because it performs better. Our WMT example uses this by default in input_pipeline.py
  • You have to be very careful using bucketing with multihost pmap. Each host must always receive the same bucket size. if they ever receive buckets of different sizes you'll get a deadlock (JAX is working on improving this aspect of multihost pmap).
  • If you need to do multihost buckets, I'd recommend creating a dataset per-bucket (with the right length filters to keep them disjoint, etc.) and using a shared seed to sample from the bucket-sizes randomly while still having the buckets internally shuffled

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by marcvanzee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant
Converted from issue

This discussion was converted from issue #894 on January 22, 2021 13:39.