Skip to content
This repository has been archived by the owner on Apr 12, 2023. It is now read-only.

Find or build tool to mine the header for metadata #5

Open
lesserwhirls opened this issue Mar 29, 2016 · 4 comments
Open

Find or build tool to mine the header for metadata #5

lesserwhirls opened this issue Mar 29, 2016 · 4 comments

Comments

@lesserwhirls
Copy link
Collaborator

Design a way for users to automatically mine the header of their data file.

Some implementation thoughts...

Suppose the header contains:

Site ID: SNRP, Latitude 35.0N, Longitude 105.2 degrees_west

User could specify:

Station Name: "Site ID: [],"
Latitude: "Latitude []N,"
Latitude Units: "Latitude [-1],"
Longitude: "Longitude [] degrees_west"
Longitude Units: "Longitude [-12:]"

[] means grab all the stuff between the surrounding text (i.e. Latitude []N" means get the stuff between "Latitude" and "N")

[ind], or [start:stop] will grab the stuff between the strings, but index into that 'stuff'
[0] means first
[-1] means last index
[0:5] means indicies 0,1,2,3,4
[-12:] means 12th index from the end, to the end

@aleksandervines
Copy link

+1 :)

@aleksandervines
Copy link

aleksandervines commented Mar 13, 2017

I'm looking into an implementation of this which is based on regex instead.
Your specific examples could be matched like this:

Station Name: "Site ID:([^,]+)"
Latitude: "Latitude([^,NS]*)[NS],"
Latitude Units: "Latitude[^,NS]*([NS]),"
Longitude: "Longitude ([0-9]+[.]?[0-9]*) degrees_west"
Longitude Units: "Longitude [0-9]+[.]?[0-9]* (.{12})$"

Notes:
Needs to be handled on the server side if we want non-javascript connections e.g. from a simple bulk processing tool or another front-end.

edit: Updated the below, based on the actual implementation I added a PR for
Implemented this "simplest" implementation:

  • Used regex capturing group as it's already implemented and a well known language.
  • Tries to match one header line at a time, and break on the first match.

Details:

  • Checkbox at each element in site specific/general information? Which says if its a search pattern
    • Is repopulated properly from sessionStorage
    • Adds the name of the attribute to sessionStorage.parseHeaderForMetadata, and from there it is fetched with getAllDataInSession() and input into the convert request to /parse
    • Update the sessionstorage properly if the name of a custom attribute is updated in general metadata.
  • A validation is made of the string to verify if its a leagal pattern:
    • An "onchange" even triggers validation when checkbox is changed
    • It is also triggeres on focusout on the textfield and in checkAndExposeNext
    • Validation function is adapted to take this into account and provide regex validation if it should be a regex
    • It also checks if there is a '(' present in the regex, as a simple, but not complete, test to see if it does contain a capturing group
  • The server then needs to parse the header with the patterns and save result in the selected attributes.
    • AsciiFile stores parseHeaderForMetadataList with metadata tags that shall be handled as regex
    • The controller uses fileparsemanager to extract the header as a list of strings
    • this list is added as argument to netcdffilemanager
    • NetcdfFileManager initiates parseHeaderForMetadataList before setting the metadataMaps
    • NetcdfFileManager processes the header and replaces the regex with the matched string in the metadataMaps
  • A very simple error message is given to the user if the pattern does not match, or has "invalid" result.
    • An IllegalArgumentException is thrown if there are no matches
    • Catched in the controller and message is returned to the user of the first pattern that did not match.
  • If no errors, the attribute should have been written as any other global attribute from metadataMaps.

Possible improvements:

  1. It would be more logical to implement this e.g. under the header tab - but that requires more implementation work, e.g. to synchronize with the other tabs to avoid duplicate attributes.
  2. It would be practical to have the header displayed on the same page as you enter the search pattern - more work, but a nice improvement, 1 would solve it
  3. It would be useful for the user to get feedback on what result the matching will be without having to submit the conversion request - yet again, more to implement, it could be implemented on client side only, or via a call to the server
  4. More important, it would be useful to validate that it would actually match to a valid value - more to implement, would probably be solved by 3.
  5. It would be useful to be able to specify a data type it is, e.g. string, integer, float. - this goes for all attributes. Now they all just default to string. separate issue really
  6. An "optional" value could be useful. e.g. if bulk processing, and you want to add this to an attribute for those files where it exists - and for the others you wouldn't want to give an error, just output the netcdf without this attribute.
  7. A "default" value, if the pattern has no matches?
  8. Multivalue, if the pattern has multiple matches?
  9. Match multiple lines on once?
  10. A static part? So the value will be +regex match
  11. Alternative pattern? Like the one lw suggested, which is very similar to Pythons array-syntax?
  12. Add to variable attributes?
  13. If the pattern is not valid, the user will lose the old value as it is removed from sessionStorage. Do we want to handle this differently, for better user experience?
  14. remove the need for :true in sessionstorage by implementing new sessionFunctions to handle a list like this. This design choice was just for convenience since sessionFunctions had functions to solve it this way. It could be used if we decide to allow different pattern languages on each field.
  15. Should the processing of header happen in asciifile instead of netcdffilemanager? I find it a bit weird the way init() works, but it seemed logical that this would fit there.
  16. Create a better exception to use than IllegalArgumentException.

Other notes:
The whole header breaks the option of always being able to 100% reverse-engineer a csv file from netcdf.

@aleksandervines
Copy link

This allows the user to write regex to extract metadata from header:
platform_info

Simple validation of the pattern entered is implemented in the UI.
error_messages

If the conversion process fails to find a valid match to a regex, the user will get this error message (only the first failed pattern will show, as it immediately casts exception and returns to the user).
parse_error

@lesserwhirls
Copy link
Collaborator Author

@aleksandervines - very nice 👍 I was initially trying to avoid regular expressions, as the users I have been targeting do not necessarily understand what a regex is. However, I think with appropriate documentation with many examples, regex's could work for most users (and since it's optional, it's no big deal if they opt out of using them). I'll go ahead and merge this is as is, but will open another issue with the list of possible improvements your outline. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants