Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_vocab keeps on running for a few HPLT datasets #13

Open
bhavitvyamalik opened this issue Mar 11, 2024 · 0 comments
Open

generate_vocab keeps on running for a few HPLT datasets #13

bhavitvyamalik opened this issue Mar 11, 2024 · 0 comments
Labels
bug Something isn't working need investigation Unknown scope

Comments

@bhavitvyamalik
Copy link

The generate_vocab step keeps on running for some datasets. Interestingly, this happens for HPLT datasets mostly as we sub-sample 10M sentences for larger datasets to generate vocab. However for HPLT we don't have that many sentences and we end up using all sentences for generating vocabulary.

This is a dataset related issue but I feel our pipeline should be robust enough to handle such problems.

@rggdmonk rggdmonk added bug Something isn't working need investigation Unknown scope labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need investigation Unknown scope
Projects
None yet
Development

No branches or pull requests

2 participants