Merge pull request #72 from program--/docs-catalog

Add Catalogs Article
mikejohnson51 · Jul 14, 2023 · c6aaab7 · c6aaab7
2 parents 74317f8 + a00acaa
commit c6aaab7
Show file tree

Hide file tree

Showing 5 changed files with 292 additions and 1 deletion.
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -18,6 +18,7 @@ articles:
   contents:
   - 00-geopiper
   - 01-intro
+  - 02-catalogs
 
 reference:
 - title: 
@@ -36,4 +37,4 @@ reference:
 - subtitle: Exported climateR-catalog
   desc: Parquet file exported to package dataset
   contents:
-  - has_concept("catalog")
+  - has_concept("catalog")
diff --git a/docs/articles/02-catalogs.html b/docs/articles/02-catalogs.html
diff --git a/docs/reference/figures/catalogs-overview.png b/docs/reference/figures/catalogs-overview.png
diff --git a/man/figures/catalogs-overview.png b/man/figures/catalogs-overview.png
diff --git a/vignettes/02-catalogs.Rmd b/vignettes/02-catalogs.Rmd
@@ -0,0 +1,88 @@
+---
+title: "climateR Catalogs"
+author:
+  - name: "Justin Singh-Mohudpur"
+    url: https://github.com/program--
+    affiliation: Lynker
+    affiliation_url: https://lynker.com
+
+  - name: "Mike Johnson"
+    url: https://github.com/mikejohnson51
+    affiliation: Lynker
+    affiliation_url: https://lynker.com
+
+output: distill::distill_article
+---
+
+# Catalogs
+
+In order to provide an evolving, federated collection of datasets, `climateR` makes use of a
+a preprocessed catalog, updated on a monthly cycle. This catalog is hosted and generated from
+the [climateR-catalogs repository](https://github.com/mikejohnson51/climateR-catalogs).
+
+This catalog contains over 100,000 thousand datasets from over 2,000 data providers/archives.
+The following section describes the design of the catalog and its data pipeline.
+
+## Design
+
+```{r, echo = FALSE}
+knitr::include_graphics("../man/figures/catalogs-overview.png")
+```
+
+The catalog data pipeline uses the [targets](https://docs.ropensci.org/targets/) package to establish
+a declarative workflow using *data sources* as target creators. In particular, data sources are 
+treated as *dynamic plugins* to the data pipeline, such that data sources are composable within the pipeline
+through a framework utilizing [R6](https://r6.r-lib.org/index.html) classes.
+
+The data source R6 classes expose a simple interface to plugin creators, where adding a new data source
+is defined by giving a data source three things:
+
+1. an `id`
+2. a `pull` function
+3. a `tidy` function
+
+The `id` represents a unique identifier for the data source that is contained with the final catalog.
+The `pull` function is a function containing any number of arguments that should gather catalog items
+from an endpoint, and collect them into a `data.frame`. The `tidy` function is a function that accepts
+*at least* a single argument for the output of the `pull` function. The function should perform any
+necessary actions to conform the argument as close to the catalog schema as possible.
+
+Using the data sources built on top of this R6-based framework, the pipeline is then given targets that
+correspond to (1) loading the R6 class, (2) calling the `pull` function, and (3) calling the `tidy` function.
+These three steps are mapped across all available data sources that are loaded into the pipeline environment,
+and joined together to create a seamless table representing the catalog. Finally, the schema of the table is
+handled to ensure it conforms to the catalog specification, and outputs for JSON and Parquet are released.
+
+### Technical Details
+
+
+#### Targets Serialization
+
+A key point to highlight is that with the targets R package, individual targets are serialized to a specific
+format when completed. Dependent targets also read from this serialization format back into R as necessary.
+The default format for targets is to use the R RDS format. However, since this pipeline already requires an
+[Apache Arrow](https://arrow.apache.org/) dependency due to a Parquet output, we take advantage of the
+[Arrow IPC file/stream formats](https://arrow.apache.org/docs/python/ipc.html) for serialization of these targets.
+Specifically, the `pull` and `tidy` targets always return the data source R6 class, and the succeeding targets for
+the catalog generation return a data frame. For the targets returning R6 classes, a custom serializer that performs
+I/O between the R6 class and its metadata to Arrow IPC Stream format is implemented. For the targets returning data frames,
+we use the Arrow IPC File format.
+
+The Arrow IPC formats were chosen in this fashion due to the smaller memory footprint and the performance
+gained from zero-copy pass between targets. This also enables data sources to be built in various programming
+languages and access the same data if needed, again due to the zero-copy property of Arrow's IPC formats.
+
+#### Pipeline Infrastructure
+
+With the catalog data pipeline built on top of R and the targets package, to aid in generating the catalog,
+we utilize [GitHub Actions](https://github.com/features/actions). Despite it being primarily for
+[CI/CD](https://en.wikipedia.org/wiki/CI/CD) workflows, the concept of CI/CD can be generalized to data as well.
+For example, in data engineering, [Apache Airflow](https://airflow.apache.org/) is a predominant application for
+constructing data workflows. Between the two, the primary difference is that GitHub Actions is further generalized,
+and offers less direct integrations for data engineering.
+
+With that context in mind, the GitHub Actions workflow for the catalog data pipeline is, in essence, a runner that
+calls `targets::tar_make()` to run the pipeline. When all of the targets are complete, the workflow takes the outputted
+catalog files and uploads them to the GitHub repository as a release. Furthermore, this workflow is scheduled to run on
+a monthly basis, ensuring that the catalog stays consistently up to date with the latest datasets offered by the data providers
+that are described in the data source plugins.