You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#35 successfully deployed production runs for both ClimSim recipes:
both of these jobs failed with caching-related errors:
# mli pipeline RuntimeError: aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host huggingface.co:443 ssl:default [Network is unreachable] [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0002-02/E3SM-MMF.mli.0002-02-08-81600.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0008-03/E3SM-MMF.mli.0008-03-05-24000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0006-05/E3SM-MMF.mli.0006-05-30-36000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']
# mlo pipeline RuntimeError: aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host huggingface.co:443 ssl:default [Network is unreachable] [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0007-10/E3SM-MMF.mlo.0007-10-10-03600.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0005-05/E3SM-MMF.mlo.0005-05-11-06000.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']FileNotFoundError: https://huggingface.co/datasets/LEAP/ClimSim_high-res/resolve/main/train/0003-06/E3SM-MMF.mlo.0003-06-28-80400.nc [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']
(There is also this error; I need to double check which pipeline it's associated with...)
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed [while running 'Create|OpenAndPreprocess|StoreToZarr/OpenAndPreprocess/OpenURLWithFSSpec/Open with fsspec-ptransform-81']
AFAICT, all of the urls listed as FileNotFound are in fact available from hugging face:
# this code runs without an error# and also, I've manually downloaded a few of these files, which also worked fineforurlinurls: # here, urls is the list of FileNotFound urls from the errors abover=requests.head(url)
ifnotr.status_code==302: # http 302 means 'Found'raiseFileNotFoundError
I therefore take these errors to be a symptom of rate-limiting by hugging face, because Dataflow scaled a cluster of 500-800 workers for each of these jobs.
The best solution for a rate limiting I've seen so far would be to implement some version of the PTransform linked by @alxmrs in pangeo-forge/pangeo-forge-recipes#389 (comment). That will take development work, though, further delaying this job.
To unblock this, I'll try setting max_num_workers, to a more modest number, maybe 50 to start? If this works for caching, we can cancel the job after caching is complete, and then re-start it with more workers, since after the caching is complete we should not have networking issues to access the data. This is a bit awkward but I believe it's the fastest path to get this data built ASAP. Assuming this works, I'll re-visit the RateLimit transform as my next work item.
The text was updated successfully, but these errors were encountered:
Very interesting! I actually ran into the same issues with a CMIP6 recipe of only 4 files, even when running these with only a single worker (all of these were using the local bakery and pgf-runner).
I therefore take these errors to be a symptom of rate-limiting by hugging face, because Dataflow scaled a cluster of 500-800 workers for each of these jobs.
So maybe that is not the root cause after all? It might still be a compounded issue, but to me this smells like a more general issue (maybe a version problem with fsspec?).
On a different note, I am not sure if we ever need 1000 workers (I think our data ingestion does not have to scale out as beastly as our analysis for instance)!
So maybe we can have a more sensible global config option?
#35 successfully deployed production runs for both ClimSim recipes:
both of these jobs failed with caching-related errors:
AFAICT, all of the urls listed as FileNotFound are in fact available from hugging face:
I therefore take these errors to be a symptom of rate-limiting by hugging face, because Dataflow scaled a cluster of 500-800 workers for each of these jobs.
The best solution for a rate limiting I've seen so far would be to implement some version of the PTransform linked by @alxmrs in pangeo-forge/pangeo-forge-recipes#389 (comment). That will take development work, though, further delaying this job.
To unblock this, I'll try setting
max_num_workers
, to a more modest number, maybe 50 to start? If this works for caching, we can cancel the job after caching is complete, and then re-start it with more workers, since after the caching is complete we should not have networking issues to access the data. This is a bit awkward but I believe it's the fastest path to get this data built ASAP. Assuming this works, I'll re-visit the RateLimit transform as my next work item.The text was updated successfully, but these errors were encountered: