diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index 1dbc9b9a6..7377fcec2 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -1,5 +1,13 @@ # Changelog +## [v0.47.2](https://github.com/materialsproject/maggma/tree/v0.47.2) (2022-05-27) + +[Full Changelog](https://github.com/materialsproject/maggma/compare/v0.47.1...v0.47.2) + +**Merged pull requests:** + +- Docs updates: add FileStore and misc edits [\#668](https://github.com/materialsproject/maggma/pull/668) ([rkingsbury](https://github.com/rkingsbury)) + ## [v0.47.1](https://github.com/materialsproject/maggma/tree/v0.47.1) (2022-05-24) [Full Changelog](https://github.com/materialsproject/maggma/compare/v0.47.0...v0.47.1) diff --git a/docs/concepts.md b/docs/concepts.md index 663a525db..be6758448 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -1,23 +1,29 @@ # Concepts -## MSONable - -One challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of. - -Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object. ## Store -Another challenge is dealing with all the different types of databases out there. Maggma was originally built off MongoDB, so it's interface looks a lot like `PyMongo`. Still, there are a number of usefull new `object` databases that can be used to store large quantities of data you don't need to search in such as Amazon S3 and Google Cloud. It would be nice to have a single interface to all of these so you could write your datapipeline only once. +A major challenge in building scalable data piplines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data +sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However, +Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or even files on disk. Stores are databases containing organized document-based data. They represent either a data source or a data sink. They are modeled around the MongoDB collection although they can represent more complex data sources that auto-alias keys without the user knowing, or even providing concatenation or joining of Stores. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma that help in efficient document processing: the `key` and the `last_updated_field`. `key` is the field that is used to uniquely index the underlying data source. `last_updated_field` is the timestamp of when that document was last modified. ## Builder -Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`: +Builders represent a data processing step, analogous to an extrac-transform-load (ETL) operation in a data +warehouse model. Much like `Store`, the `Builder` class provides a consistent interface for writing data +transformations, which are each broken into 3 phases: `get_items`, `process_item`, and `update_targets`: 1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase 2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage. 3. `update_target`: Add the processed item to the target Store(s). Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system. + +## MSONable + +Another challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of. + +Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object. + diff --git a/docs/getting_started/file_store_dir_structure.png b/docs/getting_started/file_store_dir_structure.png new file mode 100644 index 000000000..4916deeca Binary files /dev/null and b/docs/getting_started/file_store_dir_structure.png differ diff --git a/docs/getting_started/stores.md b/docs/getting_started/stores.md index e21b65311..3ad59e9c9 100644 --- a/docs/getting_started/stores.md +++ b/docs/getting_started/stores.md @@ -1,48 +1,52 @@ # Using `Store` -A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario of how to transform data and the choice of `Store` lets you control where the data comes from and goes. +A `Store` is just a wrapper to access data from a data source. That data source is typically a MongoDB collection, but it could also be an Amazon S3 bucket, a GridFS collection, or folder of files on disk. `maggma` makes interacting with all of these data sources feel the same (see the `Store` interface, below). `Store` can also perform logic, concatenating two or more `Store` together to make them look like one data source for instance. + +The benefit of the `Store` interface is that you only have to write a `Builder` once. As your data moves or evolves, you simply point it to different `Store` without having to change your processing code. ## List of Stores -Current working and tested Stores include: - -- MongoStore: interfaces to a MongoDB Collection -- MongoURIStore: MongoDB Introduced advanced URIs including their special "mongodb+srv://" which uses a combination of SRV and TXT DNS records to fully setup the client. This store is to safely handle these kinds of URIs. -- MemoryStore: just a Store that exists temporarily in memory -- JSONStore: builds a MemoryStore and then populates it with the contents of the given JSON files -- GridFSStore: interfaces to GridFS collection in MongoDB -- MongograntStore: uses Mongogrant to get credentials for MongoDB database -- VaulStore: uses Vault to get credentials for a MongoDB database -- AliasingStore: aliases keys from the underlying store to new names -- SandboxStore: provides permission control to documents via a `_sbxn` sandbox key -- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md)) -- JointStore: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection -- ConcatStore: concatenates several MongoDB collections in series so they look like one collection +Current working and tested `Store` include: + +- `MongoStore`: interfaces to a MongoDB Collection +- `MemoryStore`: just a Store that exists temporarily in memory +- `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files +- `FileStore`: query and add metadata to files stored on disk as if they were in a databsae +- `GridFSStore`: interfaces to GridFS collection in MongoDB +- `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md)) +- `ConcatStore`: concatenates several Stores together so they look like one Store +- `MongoURIStore`: MongoDB Introduced advanced URIs including their special "mongodb+srv://" which uses a combination of SRV and TXT DNS records to fully setup the client. This store is to safely handle these kinds of URIs. +- `MongograntStore`: uses Mongogrant to get credentials for MongoDB database +- `VaultStore`: uses Vault to get credentials for a MongoDB database +- `AliasingStore`: aliases keys from the underlying store to new names +- `SandboxStore: provides permission control to documents via a `_sbxn` sandbox key +- `JointStore`: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection ## The `Store` interface +All `Store` provide a number of basic methods that facilitate querying, updating, and removing data: + +- `query`: Standard mongo style `find` method that lets you search the store. +- `query_one`: Same as above but limits returned results to just the first document that matches your query. Very useful for understanding the structure of the returned data. +- `update`: Update the documents into the collection. This will override documents if the key field matches. +- `ensure_index`: This creates an index for the underlying data-source for fast querying. +- `distinct`: Gets distinct values of a field. +- `groupby`: Similar to query but performs a grouping operation and returns sets of documents. +- `remove_docs`: Removes documents from the underlying data source. +- `last_updated`: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is. +- `newer_in`: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing. + ### Initializing a Store -All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema. +All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into it obeys some schema. ### Using a Store -You must connect to a store by running `store.connect()` before querying and updating the store. -If you are operating on the stores inside of another code it is recommended to use the built-in context manager: +You must connect to a store by running `store.connect()` before querying or updating the store. +If you are operating on the stores inside of another code it is recommended to use the built-in context manager, +which will take care of the `connect()` automatically, e.g.: ```python with MongoStore(...) as store: store.query() ``` - -Stores provide a number of basic methods that make easy to use: - -- query: Standard mongo style `find` method that lets you search the store. -- query_one: Same as above but limits returned results to just the first document that matches your query. -- update: Update the documents into the collection. This will override documents if the key field matches. -- ensure_index: This creates an index for the underlying data-source for fast querying. -- distinct: Gets distinct values of a field. -- groupby: Similar to query but performs a grouping operation and returns sets of documents. -- remove_docs: Removes documents from the underlying data source. -- last_updated: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is. -- newer_in: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing. diff --git a/docs/getting_started/using_file_store.md b/docs/getting_started/using_file_store.md new file mode 100644 index 000000000..07fe56ad8 --- /dev/null +++ b/docs/getting_started/using_file_store.md @@ -0,0 +1,203 @@ +# Using `FileStore` for files on disk + +The first step in any `maggma` pipeline is creating a `Store` so that data can be queried +and transformed. Often times your data will originate as files on disk (e.g., calculation output +files, files generated by instruments, etc.). `FileStore` provides a convenient way +to access this type of data as if it were in a database, making it possible to `query`, add metadata, +and run `Builder` on it. + +Suppose you have some data files organized in the following directory structure: + +![Example directory structure](file_store_dir_structure.png){ width="300" } + +## Creating the `FileStore` + +To create a `Filestore`, simply pass the path to the top-level directory that contains the files. + +```python +>>> fs = FileStore('/path/to/file_store_test/') +>>> fs.connect() +``` + +On `connect()`, `FileStore` iterates through all files in the base directory and +all subdirectories. For each file, it creates dict-like record based on the file's metadata such as name, size, last modification date, etc. These records are kept in +memory using an internal `MemoryStore`. An example record is shown below. + +```python +{'_id': ObjectId('625e581113cef6275a992abe'), + 'name': 'input.in', + 'path': '/test_files/file_store_test/calculation1/input.in', + 'parent': 'calculation1', + 'size': 90, + 'file_id': '2d12e9803fa0c6eaffb065c8dc3cf4fe', + 'last_updated': datetime.datetime(2022, 4, 19, 5, 23, 54, 109000), + 'hash': 'd42c9ff24dc2fde99ed831ec767bd3fb', + 'orphan': False, + 'contents': 'This is the file named input.in\nIn directory calculation1\nin the FileStore test directory.'} +``` + +### Choosing files to index + +To restrict which files are indexed by the Store (which can improve performance), the optional keyword arguments `max_depth` and `file_filters` can be used. For example, to index only files ending in ".in", use + +```python +>>> fs = FileStore('/path/to/my/data', file_filters=["*.in"]) +``` + +You can pass multiple `file_filters` and use regex-like [fnmatch](https://docs.python.org/3/library/fnmatch.html) patterns as well. For example, to index all files ending in ".in" or named "test-X.txt" where X is any single letter between a and d, use + +```python +>>> fs = FileStore('/path/to/my/data', file_filters=["*.in","test-[abcd].txt"]) +``` + +If you only want to index the root directory and exclude all subdirectories, use `max_depth=0`, e.g. + +```python +>>> fs = FileStore('/path/to/my/data', max_depth=0) +``` + +### Write access + +By default, the `FileStore` is read-only. However you can set `read_only=False` if +you want to add additional metadata to the data (See ["Adding Metadata"](#adding-metadata) below). This +metadata is stored in a .json file placed in the root directory of the `FileStore` +(the name of the file can be customized with the `json_name` keyword argument.) + +```python +>>> fs = FileStore('/path/to/my/data', read_only=False, json_name='my_store.json') +``` + +Several methods that modify the contents of the `FileStore` such as `add_metadata`, `update`, and `remove_docs` will not work unless the store is writable (i.e., `read_only=False`). + +### File identifiers (`file_id`) + +Each file is uniquely identified by a `file_id` key, which is computed +from the hash of the file's path relative to the base `FileStore` directory. Unique +identifiers for every file are necessary to enable `Builder` to work correctly +and for associating custom metadata (See ["Adding Metadata"](#adding-metadata) below). By using the relative path instead of the absolute path makes it possible to move the entire `FileStore` to a new location on disk without changing `file_id` (as long as the relative paths don't change). + + +## Connecting and querying + +As with any `Store`, you have to `connect()` before you can query any data from a `FileStore`. After that, you can use `query_one()` to examine a single document or +`query()` to return an interator of matching documents. For example, let's print the +parent directory of each of the files named "input.in" in our example `FileStore`: + +```python +>>> fs.connect() +>>> [d["parent"] for d in fs.query({"name":"input.in"})] +['calculation2', 'calculation1'] +``` + +### Performance + +**_NOTE_** `FileStore` can take a long time to `connect()` when there are more than a few hundred files in the directory. This is due to limitations of the `mongomock` package that powers the internal `MemoryStore`. We hope to identify a more performant +alternative in the near future. In the mean time, use `file_filters` and `max_depth` to limit the total number of files in the `FileStore`. + +### File Contents + +When you `query()` data, `FileStore` attempts to read the contents of each matching file and include them in the `contents` key of the returned dictionary, as you can +see in the example above. There is an optional keyword argument `contents_size_limit` which specifies the maximum size of file that `FileStore` will attempt to read. + +At present, this only works with text files and the entire file contents are returned as a single string. If a file is too large to read, or if `FileStore` was unable to open the file (because it is a binary file, etc.), then you will see `contents` populated with a message that beings with `"Unable to read:`. **This behavior may change in the future.** + +## Adding metadata + +As long as a store is not read-only (see #write-access), you can `update()` documents in it just like any +other `Store`. This is a great way to associate additional information with raw data +files. For example, if you have a store of files generated by an instrument, you can add metadata related to the environmental conditions, the sample that was tested, etc. + +### `update` method + +You can use `update()` to add keys to the `FileStore` records. For example, to add +some tags to the files named "input.in", use: + +```python +docs = [d for d in fs.query({"name":"input.in"})] +for d in docs: + d["tags"] = ["preliminary"] +fs.update(docs) +``` + +The above steps will result in the following contents being added to the .json file. This metadata will be automatically read back in next time you connect to the Store. + +```json +[{"path":".../file_store_test/calculation2/input.in", +"file_id":"3c3012f84c162e9ff9bb834c53dd1f58", +"tags":["preliminary"]}, +{"path":".../file_store_test/calculation1/input.in", +"file_id":"fde43ea119034eb8732d6f3f0d9802ce", +"tags":["preliminary"]}] +``` + +Notice that only the items modified with extra keys are written to the JSON (i.e., if you have 10 items in the store but add metadata to just two, only the two items will be written to the JSON). The purpose of this behavior is to prevent any duplication of data. The `file_id` and `path` are retained in the JSON file to make each metadata record manually identifiable. + +### `add_metadata` convenience method + +A more convenient way to add metadata is via the `add_metadata` method. To use it, just pass a query to identify the documents you want to update, and a dict to add to the document. Here is what the [example above](#the-update-method) would look like using `add_metadata` + +```python +fs.add_metadata({"name":"input.in"}, {"tags":["preliminary"]}) +``` + +### Automatic metadata + +You can even define a function to automatically crate metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to +extract information from any key in a `FileStore` record and pass the function as an argument to `add_metadata`. + +For example, to extract the date from files named like '2022-05-07_experiment.csv' +and add it to the 'date' field: + +```python +>>> def get_date_from_filename(d): + """ + Args: + d: An item returned from the `FileStore` + """ + return {"date": d["name"].split("_")[0], + "test_name": d["name"].split("_")[1] + } + +>>> fs.add_metadata({}, auto_data=get_date_from_filename) +``` + +### Protected Keys + +Note that when using any of the above methods, you cannot modify any keys that are populated by default (e.g. `name`, `parent`, `file_id`), because they are derived directly from the files on disk. + +### Orphaned Metadata + +In the course of working with `FileStore` you may encounter a situation where there are metadata records stored in the JSON file that no longer match files on disk. This can happen if, for example, you init a `FileStore` and later delete a file, or if you init the store with the default arguments but later restrict the file selection with `max_depth` or `file_filters`. + +These orphaned metadata records will appear in the `FileStore` with the field `{"orphan": True}`. The goal with this behavior is to preserve all metadata the user may have added and prevent data loss. + +By default, **orphaned metadata is excluded from query results**. There is an `include_orphans` keyword argument you can set on init if you want orphaned metadata +to be returned in queries. + +## Deleting files + +For consistency with the `Store` interface, `FileStore` provides the `remove_docs` method whenever `read_only=False`. **This method will delete files on disk**, because +`FileStore` documents are simply representations of those files. It has an additional guard argument `confirm` which must be set to the non-default value `True` for the method to actually do anything. + +```python +>>> fs.remove_docs({"name":"input.in"}) +Traceback (most recent call last): + File "", line 1, in + File ".../maggma/src/maggma/stores/file_store.py", line 496, in remove_docs + raise StoreError( +maggma.core.store.StoreError: (StoreError(...), 'Warning! This command is about ' + 'to delete 2 items from disk! If this is what you want, reissue ' + 'this command with confirm=True.') +``` + +## Processing files with a `Builder` + +Now that you can access your files on disk via a `FileStore`, it's time to write a `Builder` to read and process the data (see [Writing a Builder](simple_builder.md)). +Keep in mind that `get_items` will return documents like the one shown in (#creating-the-filestore). You can then use `process_items` to + +- Create strucured data from the `contents` +- Open the file for reading using a custom piece of code +- etc. + +Once you can process data on your disk with a `Builder`, you can send that data +to any kind of `Store` you like - another `FileStore`, a database, etc. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 0768337eb..8c160e06d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,9 +4,9 @@ ## What is Maggma -Maggma is a framework to build data pipelines from files on disk all the way to a REST API in scientific environments. Maggma has been developed by the Materials Project (MP) team at Lawrence Berkeley Labs. +Maggma is a framework to build data pipelines from files on disk all the way to a REST API in scientific environments. Maggma has been developed by the Materials Project (MP) team at Lawrence Berkeley National Laboratory. -Maggma is written in [Python](http://docs.python-guide.org/en/latest/) and supports Python 3.6+. +Maggma is written in [Python](http://docs.python-guide.org/en/latest/) and supports Python 3.7+. ## Installation from PyPI diff --git a/docs/reference/stores.md b/docs/reference/stores.md index 870980182..78209268f 100644 --- a/docs/reference/stores.md +++ b/docs/reference/stores.md @@ -1,5 +1,7 @@ ::: maggma.stores.mongolike +::: maggma.stores.file_store + ::: maggma.stores.gridfs ::: maggma.stores.aws diff --git a/mkdocs.yml b/mkdocs.yml index 35bbab209..cef503417 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,7 @@ nav: - Core Concepts: concepts.md - Getting Started: - Using Stores: getting_started/stores.md + - Working with FileStore: getting_started/using_file_store.md - Writing a Builder: getting_started/simple_builder.md - Running a Builder Pipeline: getting_started/running_builders.md - Advanced Builders: getting_started/advanced_builder.md diff --git a/requirements-docs.txt b/requirements-docs.txt index ca94c288e..6565f0a67 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -1,5 +1,5 @@ mkdocs==1.3.0 -mkdocs-material==8.2.14 +mkdocs-material==8.2.16 mkdocs-minify-plugin==0.5.0 mkdocstrings==0.18.1 jinja2<3.2.0 diff --git a/requirements-testing.txt b/requirements-testing.txt index 92c5409a9..57edfb3cd 100644 --- a/requirements-testing.txt +++ b/requirements-testing.txt @@ -11,4 +11,4 @@ mypy==0.950 mypy-extensions==0.4.3 responses<0.21.0 types-PyYAML==6.0.7 -types-setuptools==57.4.14 \ No newline at end of file +types-setuptools==57.4.17 \ No newline at end of file diff --git a/setup.py b/setup.py index 61f068bbb..082cdff64 100644 --- a/setup.py +++ b/setup.py @@ -52,7 +52,7 @@ }, classifiers=[ "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.6", + "Programming Language :: Python :: 3.7", "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Science/Research", "Intended Audience :: System Administrators", diff --git a/src/maggma/stores/file_store.py b/src/maggma/stores/file_store.py index 3cc1e9a76..f78d5e876 100644 --- a/src/maggma/stores/file_store.py +++ b/src/maggma/stores/file_store.py @@ -494,7 +494,7 @@ def remove_docs(self, criteria: Dict, confirm: bool = False): if len(docs) > 0 and not confirm: raise StoreError( - f"Warning! This command is about to delete {len(docs)} items from disk!" + f"Warning! This command is about to delete {len(docs)} items from disk! " "If this is what you want, reissue this command with confirm=True." )