Skip to content

Commit

Permalink
Merge branch 'main' into rmdrone
Browse files Browse the repository at this point in the history
  • Loading branch information
rkingsbury committed Jun 6, 2022
2 parents 10480c3 + 84fba26 commit 88a47aa
Show file tree
Hide file tree
Showing 12 changed files with 267 additions and 43 deletions.
8 changes: 8 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Changelog

## [v0.47.2](https://github.com/materialsproject/maggma/tree/v0.47.2) (2022-05-27)

[Full Changelog](https://github.com/materialsproject/maggma/compare/v0.47.1...v0.47.2)

**Merged pull requests:**

- Docs updates: add FileStore and misc edits [\#668](https://github.com/materialsproject/maggma/pull/668) ([rkingsbury](https://github.com/rkingsbury))

## [v0.47.1](https://github.com/materialsproject/maggma/tree/v0.47.1) (2022-05-24)

[Full Changelog](https://github.com/materialsproject/maggma/compare/v0.47.0...v0.47.1)
Expand Down
20 changes: 13 additions & 7 deletions docs/concepts.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,29 @@
# Concepts

## MSONable

One challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.

Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.

## Store

Another challenge is dealing with all the different types of databases out there. Maggma was originally built off MongoDB, so it's interface looks a lot like `PyMongo`. Still, there are a number of usefull new `object` databases that can be used to store large quantities of data you don't need to search in such as Amazon S3 and Google Cloud. It would be nice to have a single interface to all of these so you could write your datapipeline only once.
A major challenge in building scalable data piplines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However,
Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or even files on disk.

Stores are databases containing organized document-based data. They represent either a data source or a data sink. They are modeled around the MongoDB collection although they can represent more complex data sources that auto-alias keys without the user knowing, or even providing concatenation or joining of Stores. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma that help in efficient document processing: the `key` and the `last_updated_field`. `key` is the field that is used to uniquely index the underlying data source. `last_updated_field` is the timestamp of when that document was last modified.

## Builder

Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`:
Builders represent a data processing step, analogous to an extrac-transform-load (ETL) operation in a data
warehouse model. Much like `Store`, the `Builder` class provides a consistent interface for writing data
transformations, which are each broken into 3 phases: `get_items`, `process_item`, and `update_targets`:

1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase
2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage.
3. `update_target`: Add the processed item to the target Store(s).

Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system.

## MSONable

Another challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.

Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
64 changes: 34 additions & 30 deletions docs/getting_started/stores.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,52 @@
# Using `Store`

A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario of how to transform data and the choice of `Store` lets you control where the data comes from and goes.
A `Store` is just a wrapper to access data from a data source. That data source is typically a MongoDB collection, but it could also be an Amazon S3 bucket, a GridFS collection, or folder of files on disk. `maggma` makes interacting with all of these data sources feel the same (see the `Store` interface, below). `Store` can also perform logic, concatenating two or more `Store` together to make them look like one data source for instance.

The benefit of the `Store` interface is that you only have to write a `Builder` once. As your data moves or evolves, you simply point it to different `Store` without having to change your processing code.

## List of Stores

Current working and tested Stores include:

- MongoStore: interfaces to a MongoDB Collection
- MongoURIStore: MongoDB Introduced advanced URIs including their special "mongodb+srv://" which uses a combination of SRV and TXT DNS records to fully setup the client. This store is to safely handle these kinds of URIs.
- MemoryStore: just a Store that exists temporarily in memory
- JSONStore: builds a MemoryStore and then populates it with the contents of the given JSON files
- GridFSStore: interfaces to GridFS collection in MongoDB
- MongograntStore: uses Mongogrant to get credentials for MongoDB database
- VaulStore: uses Vault to get credentials for a MongoDB database
- AliasingStore: aliases keys from the underlying store to new names
- SandboxStore: provides permission control to documents via a `_sbxn` sandbox key
- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- JointStore: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection
- ConcatStore: concatenates several MongoDB collections in series so they look like one collection
Current working and tested `Store` include:

- `MongoStore`: interfaces to a MongoDB Collection
- `MemoryStore`: just a Store that exists temporarily in memory
- `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files
- `FileStore`: query and add metadata to files stored on disk as if they were in a databsae
- `GridFSStore`: interfaces to GridFS collection in MongoDB
- `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- `ConcatStore`: concatenates several Stores together so they look like one Store
- `MongoURIStore`: MongoDB Introduced advanced URIs including their special "mongodb+srv://" which uses a combination of SRV and TXT DNS records to fully setup the client. This store is to safely handle these kinds of URIs.
- `MongograntStore`: uses Mongogrant to get credentials for MongoDB database
- `VaultStore`: uses Vault to get credentials for a MongoDB database
- `AliasingStore`: aliases keys from the underlying store to new names
- `SandboxStore: provides permission control to documents via a `_sbxn` sandbox key
- `JointStore`: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection

## The `Store` interface

All `Store` provide a number of basic methods that facilitate querying, updating, and removing data:

- `query`: Standard mongo style `find` method that lets you search the store.
- `query_one`: Same as above but limits returned results to just the first document that matches your query. Very useful for understanding the structure of the returned data.
- `update`: Update the documents into the collection. This will override documents if the key field matches.
- `ensure_index`: This creates an index for the underlying data-source for fast querying.
- `distinct`: Gets distinct values of a field.
- `groupby`: Similar to query but performs a grouping operation and returns sets of documents.
- `remove_docs`: Removes documents from the underlying data source.
- `last_updated`: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is.
- `newer_in`: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing.

### Initializing a Store

All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into it obeys some schema.

### Using a Store

You must connect to a store by running `store.connect()` before querying and updating the store.
If you are operating on the stores inside of another code it is recommended to use the built-in context manager:
You must connect to a store by running `store.connect()` before querying or updating the store.
If you are operating on the stores inside of another code it is recommended to use the built-in context manager,
which will take care of the `connect()` automatically, e.g.:

```python
with MongoStore(...) as store:
store.query()
```

Stores provide a number of basic methods that make easy to use:

- query: Standard mongo style `find` method that lets you search the store.
- query_one: Same as above but limits returned results to just the first document that matches your query.
- update: Update the documents into the collection. This will override documents if the key field matches.
- ensure_index: This creates an index for the underlying data-source for fast querying.
- distinct: Gets distinct values of a field.
- groupby: Similar to query but performs a grouping operation and returns sets of documents.
- remove_docs: Removes documents from the underlying data source.
- last_updated: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is.
- newer_in: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing.
Loading

0 comments on commit 88a47aa

Please sign in to comment.