Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding factors in fetch* functions #241

Open
dylanbeaudette opened this issue Mar 8, 2022 · 1 comment
Open

encoding factors in fetch* functions #241

dylanbeaudette opened this issue Mar 8, 2022 · 1 comment

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Mar 8, 2022

A couple of thoughts:

  • there will only be an expectation / possibility of encoding factors in high-level functions such as fetchNASIS, fetchSDA, etc.
  • split functionality for further / better customization
  • uncode() performs the de-coding of values in NASIS, using the latest version of the metadata table from the local database (if possible)
  • a new function or suite of functions would convert specific variables to factors, set the desired levels, and upgrade to ordered factors when appropriate. these functions will include an argument for dropping unused levels
  • behavior of this "second-pass" over the uncoded data can be controlled via argument to fetchNASIS or global preference set with option()

Chain of functionality:

  1. read data as text
  2. uncode() all coded columns, optionally converting to factors using metadata (`encodeFactors='all')
  3. selective encoding (encodeFactors='some') or none (encodeFactors='none')
  4. ???
# in all functions get data from NASIS
x<- query()
y <- uncode(x, encodeFactors)
return(y)
# high level functions like fetchNASIS()
x <- getXXX_from_NASIS(encodeFactors)
if(encodeFactors='some') {
  .setupNASIS_factors(...)
}

Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An invert argument can be added to reverse factor levels which is sometimes handy. That said, I don't think that we should attempt to convert all character data → factors (e.g. parent material origin) by default, just those that are most commonly used as factors (texture class, hillslope position, drainage class, etc.).

The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs.

# x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.setupNASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {
  
  # all = TRUE
  # use NASIS metadata

  # all = FALSE
  # use column-specific rules as follows
  # ...
  
  # drop = TRUE
  # drop unused levels, no matter the encoding strategy above

  # modified data.frame is returned
  return(res)
}

Finally, I suggest that fetchNASIS() should default to:

  • convert most of the commonly used nominal / ordinal data to factors / ordered factors
  • this would exclude such things as IDs, date/time, names, taxonomic information, or cases with >n unique values
  • factor levels should be set manually in the to-be-written function whenever possible
  • unused levels should be dropped
@brownag
Copy link
Member

brownag commented Mar 10, 2022

I added two new domain attributes to the query used by uncode() (in .get_NASIS_metadata()) for use in future functions.

MetadataDomainMaster.DomainRanked

  • Of the 439 domains in NASIS metadata, 143 are "ranked" where MetadataDomainDetail.ChoiceSequence denotes the order. Note that the uncode() query orders result by ChoiceValue, not ChoiceSequence, by default.
capability_class, corrosion_concrete, corrosion_uncoated_steel, flooding_duration_class, flooding_ponding_month, potential_frost_action, soil_erodibility_factor, wind_erodibility_index, drainage_class, excavation_difficulty_class, soil_slippage_potential, ponding_duration_class, pore_continuity_vertical, rupture_resist_block_cem, wildlife_rating, mapunit_hel_class, flooding_frequency_class, ponding_frequency_class, date_time_interval_qualifier, erosion_class, fl_soil_leaching_potential, fl_soil_runoff_potential, runoff, taxonomic_family_c_e_act_class, va_soil_management_group, va_soil_productivity_group, bedrock_fracture_interval_class, boundary_distinctness, color_chroma, color_value, concen_redox_boundary, effervescence_class, concen_rmf_mottle_contrast, penetration_resistance, permeability_class, plasticity, pore_root_size, pvsf_distinctness, rupture_resist_block_dry, rupture_resist_block_moist, rupture_resist_plate, stickiness, structure_grade, structure_size, toughness_class, weathering, dmu_investigation_intensity, soil_taxonomy_edition, ia_subsoil_k, ia_subsoil_p, nj_farmland_assessment, Datetime Precision (NASIS 6 Metadata), sat_hyd_conductivity_class, soil_odor_intensity, texture_structure_category, crust_development_class, carbonate_dev_stage_cf, carbonate_dev_stage_fe, pore_quantity_class, abundance_class, canopy_cover_class, cryptogam_cover_class_legacy, cultivation_extent, current_year_precip, damage_degree, daubenmire_canopy_cover_class, decadent_plant_abundance, disturbance_impact, forest_stand_quality, ground_cover_class, ground_cover_extent, growing_season_rating, gully_rill_presence, invading_plants, pci_concentration_areas, pci_desirable_plants, pci_ground_cover_residue, pci_gully_erosion, pci_legume_pct_class, pci_plant_cover, pci_plant_diversity, pci_plant_vigor, pci_sheet_rill_erosion, pci_soil_compaction, pci_standing_dead_forage, pci_stream_shore_erosion, pci_use_uniformity, pci_wind_erosion, plant_density_class, reference_yield_rank, reproduction_abundance_class, rhi_annual_production, rhi_bare_ground, rhi_compaction_layer, rhi_erosion_resistance, rhi_functional_struct_groups, rhi_gullies, rhi_infiltration_runoff, rhi_invasive_plants, rhi_litter_amount, rhi_litter_movement, rhi_pedestals_terracettes, rhi_plant_mortality, rhi_reproductive_capability, rhi_rills, rhi_soil_surf_degradation, rhi_summary, rhi_water_flow_patterns, rhi_wind_scour_areas, salinity_class, sampling_intensity, seedling_abundance, sociability_class, soil_compaction, soil_crusting, soil_degradation, soil_surface_erosion, stocking_rate, suppression_degree, tree_condition, vigor_class, ak_ecological_site_status, ak_stratum_cover_class, ak_functional_group, ak_crown_class, ak_grazing_plant_group, rosgen_stream_subclass, ak_grazing_impact, observation_intensity, von_post_humification_scale, osd_text_kind, burn_intensity, crop_arrangement, dominant_vegetation, growth_status, harvest_skidding_method, type_of_burn, years_in, yrs_since_harvest, yrs_since_last_burn, burn_frequemcy, fertility_tests_done, dsp_site_type

See for example ponding frequency class the ChoiceSequence is not the same as the ChoiceValue. Notably the ordering includes the obsolete values. In this case the obsolete class "Common" has a value (5) that does not match sequence position (4) in the set.

image


MetadataDomainMaster.DisplayLabel

  • 30 of the 439 domains have DisplayLabel value of 1 which means that the ChoiceLabel could/should be used rather than the ChoiceName; which is generally a difference of capitalization.
hydric_condition, nasis_site_office_type, farmland_classification, state_fips_code_alpha, texture_class, texture_modifier, unified_soil_classification, terms_used_in_lieu_of_texture, mapunit_hel_class, erosion_class, nh_important_forest_soil_group, logical_data_type_nasis, sort_type, site_index_curves, legend_suitability_for_use, mou_agency_responsible, ecological_site_mlra, mapunit_text_kind, legend_certification_status, dmu_certification_status, export_certification_status, hydric_soil_indicator, farmland_class_secondary, mapunit_type, cardinality_nasis, column_alignment, default_type, saf_cover_type, sort_direction, soil_type_conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants