Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVA date update #1111

Open
sadeghim opened this issue Sep 25, 2024 · 2 comments
Open

NVA date update #1111

sadeghim opened this issue Sep 25, 2024 · 2 comments
Assignees

Comments

@sadeghim
Copy link
Member

NVA eventDate has an Z at the end of some dates within their dataset which needs updating to be parsed by pipelines.

@cha801p cha801p self-assigned this Sep 26, 2024
@cha801p
Copy link
Contributor

cha801p commented Sep 26, 2024

Ticket Update: September 26 2024

Issue: Fix data format for the Tasmanian Natural Values Atlas (NVA) dr710.

Solution: Successfully remove the "Z" from date entries (e.g., changed "01-06-2017Z" to "01-06-2017").

Actions Taken:

  • Downloaded the data from DwCA-exports
  • Conducted a thorough review of the data.
  • Corrected the date format.
  • Adjusted multiple columns, specifically individualCount, from float to integer data type.
  • Resolved issues with the DwCA format, including correcting headers and performing necessary manipulations.
  • Attempted to load data onto collectory-test; however, the data load failed during the DwCA to Verbatim step.

Error Log:
INFO [2024-09-25 08:00:37,649+0000] [main] au.org.ala.pipelines.util.VersionInfo: git.remote.origin.url=https://github.com/gbif/pipelines
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Adding step 1: Options
INFO [2024-09-25 08:00:38,776+0000] [main] au.org.ala.pipelines.beam.ALADwcaToVerbatimPipeline: Non-HDFS Input path: /data/biocache-load/dr710
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Unexpected error during DWCA-AVRO conversion dr710 step
25-Sep [0;90m08:00:38[0m [[0;35mLA-PIPELINES[0m] [[0;34mdr710[0m] [[0;31mERROR[0m] Error 1 occurred on 1

Issues Encountered:

  • Identified multiple issues with the DwCA file, including:
  1. Unidentified columns.
  2. Duplicate columns that were empty.
  • Reworked the DwCA to create a new TSV file.
  • Created the DwCA locally and loaded the data onto collectory-test again.

Successfully loaded the data onto Databox and production environments.

Loaded Data for Review:
Test: Collections Test - DR710
Production: Collections Production - DR710

@cha801p
Copy link
Contributor

cha801p commented Sep 26, 2024

Prod UUID count logs:
24/09/26 08:20:48 INFO SparkContext: Successfully stopped SparkContext
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs:
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: newUuids: 0.0, preservedUuids: 1121933.0, orphanedUniqueKeys: 0.0
24/09/26 08:20:48 INFO ALAUUIDMintingPipeline: Percentage UUID change: 0, allowed percentage: 50, override percentage check: false

  • Status: Awaiting indexing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants