Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate data_files when named <split>/<split>.parquet #6272

Closed
lhoestq opened this issue Oct 1, 2023 · 7 comments · Fixed by #6704
Closed

Duplicate data_files when named <split>/<split>.parquet #6272

lhoestq opened this issue Oct 1, 2023 · 7 comments · Fixed by #6704
Labels
bug Something isn't working

Comments

@lhoestq
Copy link
Member

lhoestq commented Oct 1, 2023

e.g. with u23429/stock_1_minute_ticker

In [1]: from datasets import *

In [2]: b = load_dataset_builder("u23429/stock_1_minute_ticker")
Downloading readme: 100%|██████████████████████████| 627/627 [00:00<00:00, 246kB/s]

In [3]: b.config.data_files
Out[3]: 
{NamedSplit('train'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/train/train.parquet'],
 NamedSplit('validation'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/validation/validation.parquet'],
 NamedSplit('test'): ['hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet',
  'hf://datasets/u23429/stock_1_minute_ticker@65c973cf4ec061f01a363b40da4c1bb128ba4166/test/test.parquet']}

This bug issue is present in the current datasets 2.14.5 and also on main even after #6244 cc @mariosasko

@lhoestq lhoestq added the bug Something isn't working label Oct 1, 2023
@mariosasko
Copy link
Collaborator

Also reported in #6259

@mariosasko
Copy link
Collaborator

I think it's best to drop duplicates with a set (as a temporary fix) and improve the patterns when/if fsspec/filesystem_spec#1382 gets merged. @lhoestq Do you have some other ideas?

@lhoestq
Copy link
Member Author

lhoestq commented Oct 2, 2023

Alternatively we could just use this no ?

if config.FSSPEC_VERSION < version.parse("2023.9.0"):
    KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = [
        "{keyword}[{sep}/]**",
        "**[{sep}]{keyword}[{sep}/]**",
        "**/{keyword}[{sep}/]**",
    ]
else:
    KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = [
        "{keyword}[{sep}/]**",
        "**/*[{sep}]{keyword}[{sep}/]**",
        "**/*/{keyword}[{sep}/]**",
    ]

This way no need to implement sets, which would require a bit of work since we've always considered a list of pattern to be resolved as the concatenated list of resolved files for each pattern (including duplicates)

@lhoestq
Copy link
Member Author

lhoestq commented Oct 2, 2023

Arf "**/*/{keyword}[{sep}/]**" does return data/keyword.txt in latest fsspec but not in glob.glob

EDIT: actually forgot to set recursive=True

@lhoestq
Copy link
Member Author

lhoestq commented Oct 2, 2023

Actually glob.glob does return it with recursive=True ! my bad

@lhoestq
Copy link
Member Author

lhoestq commented Oct 4, 2023

Pff just tested and my idea sucks, pattern 1 and 3 obviously give duplicates

@lhoestq
Copy link
Member Author

lhoestq commented Oct 5, 2023

I think it's best to drop duplicates with a set (as a temporary fix)

I started #6278 to use DataFilesSet objects instead of DataFilesList

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants