Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Herbarium encoding issues - dr10574, dr376 #1105

Open
rosemaryjoconnor opened this issue Sep 10, 2024 · 5 comments
Open

Herbarium encoding issues - dr10574, dr376 #1105

rosemaryjoconnor opened this issue Sep 10, 2024 · 5 comments
Assignees

Comments

@rosemaryjoconnor
Copy link
Contributor

rosemaryjoconnor commented Sep 10, 2024

dr10574 Tasmania Herbarium - TMAG uploads directory
dr376 - Melbourne Herbarium - IPT

Both failing with encoding errors. TMAG is likely a non-utf8 line break character, dr376 just a non-utf8 character at a specific location.
Option to clean up the data in preingestion prior to load is not implemented yet.

Solution: run load_dataset with Herbarium/IPT datasets as per NZ herbarium, don't use pre-ingestion.

@rosemaryjoconnor rosemaryjoconnor self-assigned this Sep 10, 2024
@rosemaryjoconnor
Copy link
Contributor Author

10/09/2024

  • Both loaded in databox successfully
  • NK to check before load to Production

@rosemaryjoconnor
Copy link
Contributor Author

rosemaryjoconnor commented Sep 11, 2024

11/09/2024

Databox Load

  • dr10574: Tasmanian Herbarium
  • dr376 - Melbourne Herbarium

Production Load

  • dr10574: Tasmanian Herbarium
  • dr376 - Melbourne Herbarium

@rosemaryjoconnor
Copy link
Contributor Author

rosemaryjoconnor commented Sep 12, 2024

13/09/2024

Check data resource after SOLR Index

  • dr10574: Tasmanian Herbarium
  • dr376 - Melbourne Herbarium

Record counts

  • dr10574: Tasmanian Herbarium Old: 277, 493 New: 277, 493
  • dr376 - Melbourne Herbarium Old: 1,070,469 New: 1,068,409

@rosemaryjoconnor
Copy link
Contributor Author

rosemaryjoconnor commented Sep 12, 2024

13/09/2024

Issue with encoding seems to be due to Pandas. Niels has said that using duckDB there is no problem reading the data.
This may be something we need to look into.

Counts for new records are not correct. Have rerun with Load_dataset, made a mistake and ran ingest_large_dataset.

  • dr376 loaded successfully just need to wait for index run on Monday night
  • dr10574 still having issue

@rosemaryjoconnor
Copy link
Contributor Author

rosemaryjoconnor commented Sep 16, 2024

14/09/2024

  • dr10574 - successfully loaded via load_dataset

  • dr1376 - successfully loaded via load_dataset

  • Check index tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant