Skip to content

Commit

Permalink
run pre-commit, fix errs and spelling
Browse files Browse the repository at this point in the history
  • Loading branch information
rkingsbury committed Jul 31, 2023
1 parent 081c500 commit a3b90a7
Show file tree
Hide file tree
Showing 76 changed files with 685 additions and 1,560 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
run: |
pip install pre-commit
pre-commit run
test:
needs: lint
services:
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ s2 -- Builder 3-->s4(Store 4)

## Store

A major challenge in building scalable data piplines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
A major challenge in building scalable data pipelines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However,
Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or even files on disk.

Expand All @@ -34,4 +34,4 @@ Both `get_items` and `update_targets` can perform IO (input/output) to the data

Another challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.

Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the original Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
2 changes: 1 addition & 1 deletion docs/getting_started/advanced_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,4 @@ Since `maggma` is designed around Mongo style data sources and sinks, building i
`maggma` implements templates for builders that have many of these advanced features listed above:

- [MapBuilder](map_builder.md) Creates one-to-one document mapping of items in the source Store to the transformed documents in the target Store.
- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the traget Store
- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the target Store
6 changes: 3 additions & 3 deletions docs/getting_started/group_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ class ResupplyBuilder(GroupBuilder):
super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs)
```

Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like.
Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more useful names is a good idea in writing builders to make it clearer what the underlying data should look like.

`GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters.

Expand All @@ -65,7 +65,7 @@ Note that unlike the previous `MapBuilder` example, we didn't call the source an
- store_process_timeout: adds the process time into the target document for profiling
- retry_failed: retries running the process function on previously failed documents

One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.
One parameter that doesn't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.

Finally let's get to the hard part which is running our function. We do this by defining `unary_function`

Expand All @@ -81,4 +81,4 @@ Finally let's get to the hard part which is running our function. We do this by
return {"resupply": resupply}
```

Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names`
Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values will be put together and kept in `names`
4 changes: 2 additions & 2 deletions docs/getting_started/running_builders.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ my_builder = MultiplyBuilder(source_store,target_store,multiplier=3)
my_builder.run()
```

A better way to run this builder would be to use the `mrun` command line tool. Since evrything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:
A better way to run this builder would be to use the `mrun` command line tool. Since everything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:

``` python
from monty.serialization import dumpfn
Expand All @@ -29,7 +29,7 @@ Then we can run the builder using `mrun`:
mrun my_builder.json
```

`mrun` has a number of usefull options:
`mrun` has a number of useful options:

``` shell
mrun --help
Expand Down
4 changes: 2 additions & 2 deletions docs/getting_started/simple_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The `__init__` for a builder can have any set of parameters. Generally, you want

Python type annotations provide a really nice way of documenting the types we expect and being able to later type check using `mypy`. We defined the type for `source` and `target` as `Store` since we only care that implements that pattern. How exactly these `Store`s operate doesn't concern us here.

Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributess:
Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributes:

``` python
self.source = source
Expand Down Expand Up @@ -243,4 +243,4 @@ Then we can define a prechunk method that modifies the `Builder` dict in place t
}
```

When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all posible keys.
When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all possible keys.
2 changes: 1 addition & 1 deletion docs/getting_started/stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Current working and tested `Store` include:
- `MongoStore`: interfaces to a MongoDB Collection
- `MemoryStore`: just a Store that exists temporarily in memory
- `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files
- `FileStore`: query and add metadata to files stored on disk as if they were in a databsae
- `FileStore`: query and add metadata to files stored on disk as if they were in a database
- `GridFSStore`: interfaces to GridFS collection in MongoDB
- `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- `ConcatStore`: concatenates several Stores together so they look like one Store
Expand Down
6 changes: 3 additions & 3 deletions docs/getting_started/using_file_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ and for associating custom metadata (See ["Adding Metadata"](#adding-metadata) b
## Connecting and querying

As with any `Store`, you have to `connect()` before you can query any data from a `FileStore`. After that, you can use `query_one()` to examine a single document or
`query()` to return an interator of matching documents. For example, let's print the
`query()` to return an iterator of matching documents. For example, let's print the
parent directory of each of the files named "input.in" in our example `FileStore`:

```python
Expand Down Expand Up @@ -142,7 +142,7 @@ fs.add_metadata({"name":"input.in"}, {"tags":["preliminary"]})

### Automatic metadata

You can even define a function to automatically crate metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
You can even define a function to automatically create metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
extract information from any key in a `FileStore` record and pass the function as an argument to `add_metadata`.

For example, to extract the date from files named like '2022-05-07_experiment.csv'
Expand Down Expand Up @@ -195,7 +195,7 @@ maggma.core.store.StoreError: (StoreError(...), 'Warning! This command is about
Now that you can access your files on disk via a `FileStore`, it's time to write a `Builder` to read and process the data (see [Writing a Builder](simple_builder.md)).
Keep in mind that `get_items` will return documents like the one shown in (#creating-the-filestore). You can then use `process_items` to

- Create strucured data from the `contents`
- Create structured data from the `contents`
- Open the file for reading using a custom piece of code
- etc.

Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,5 @@ explicit_package_bases = true
no_implicit_optional = false

[tool.codespell]
ignore-words-list = "ot"
ignore-words-list = "ot,nin"
skip = 'docs/CHANGELOG.md,tests/test_files/*'
1 change: 0 additions & 1 deletion src/maggma/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# coding: utf-8
""" Primary Maggma module """
from pkg_resources import DistributionNotFound, get_distribution

Expand Down
6 changes: 3 additions & 3 deletions src/maggma/api/API.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ def __init__(
version: str = "v0.0.0",
debug: bool = False,
heartbeat_meta: Optional[Dict] = None,
description: str = None,
tags_meta: List[Dict] = None,
description: Optional[str] = None,
tags_meta: Optional[List[Dict]] = None,
):
"""
Args:
Expand All @@ -33,7 +33,7 @@ def __init__(
version: the version for this API
debug: turns debug on in FastAPI
heartbeat_meta: dictionary of additional metadata to include in the heartbeat response
description: decription of the API to be used in the generated docs
description: description of the API to be used in the generated docs
tags_meta: descriptions of tags to be used in the generated docs
"""
self.title = title
Expand Down
15 changes: 4 additions & 11 deletions src/maggma/api/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,15 @@ class Meta(BaseModel):

api_version: str = Field(
__version__,
description="a string containing the version of the Materials API "
"implementation, e.g. v0.9.5",
description="a string containing the version of the Materials API implementation, e.g. v0.9.5",
)

time_stamp: datetime = Field(
description="a string containing the date and time at which the query was executed",
default_factory=datetime.utcnow,
)

total_doc: Optional[int] = Field(
None, description="the total number of documents available for this query", ge=0
)
total_doc: Optional[int] = Field(None, description="the total number of documents available for this query", ge=0)

class Config:
extra = "allow"
Expand All @@ -56,9 +53,7 @@ class Response(GenericModel, Generic[DataT]):
"""

data: Optional[List[DataT]] = Field(None, description="List of returned data")
errors: Optional[List[Error]] = Field(
None, description="Any errors on processing this query"
)
errors: Optional[List[Error]] = Field(None, description="Any errors on processing this query")
meta: Optional[Meta] = Field(None, description="Extra information for the query")

@validator("errors", always=True)
Expand Down Expand Up @@ -92,8 +87,6 @@ class S3URLDoc(BaseModel):
description="Pre-signed download URL",
)

requested_datetime: datetime = Field(
..., description="Datetime for when URL was requested"
)
requested_datetime: datetime = Field(..., description="Datetime for when URL was requested")

expiry_datetime: datetime = Field(..., description="Expiry datetime of the URL")
2 changes: 1 addition & 1 deletion src/maggma/api/query_operator/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

class QueryOperator(MSONable, metaclass=ABCMeta):
"""
Base Query Operator class for defining powerfull query language
Base Query Operator class for defining powerful query language
in the Materials API
"""

Expand Down
48 changes: 10 additions & 38 deletions src/maggma/api/query_operator/dynamic.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,7 @@ def __init__(
self.excluded_fields = excluded_fields

all_fields: Dict[str, ModelField] = model.__fields__
param_fields = fields or list(
set(all_fields.keys()) - set(excluded_fields or [])
)
param_fields = fields or list(set(all_fields.keys()) - set(excluded_fields or []))

# Convert the fields into operator tuples
ops = [
Expand All @@ -49,9 +47,7 @@ def query(**kwargs) -> STORE_PARAMS:
try:
criteria.append(self.mapping[k](v))
except KeyError:
raise KeyError(
f"Cannot find key {k} in current query to database mapping"
)
raise KeyError(f"Cannot find key {k} in current query to database mapping")

final_crit = {}
for entry in criteria:
Expand All @@ -74,26 +70,22 @@ def query(**kwargs) -> STORE_PARAMS:
for op in ops
]

setattr(query, "__signature__", inspect.Signature(signatures))
query.__signature__ = inspect.Signature(signatures)

self.query = query # type: ignore

def query(self):
"Stub query function for abstract class"
pass

@abstractmethod
def field_to_operator(
self, name: str, field: ModelField
) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
def field_to_operator(self, name: str, field: ModelField) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
"""
Converts a PyDantic ModelField into a Tuple with the
- query param name,
- query param type
- FastAPI Query object,
- and callable to convert the value into a query dict
"""
pass

@classmethod
def from_dict(cls, d):
Expand All @@ -115,9 +107,7 @@ def as_dict(self) -> Dict:
class NumericQuery(DynamicQueryOperator):
"Query Operator to enable searching on numeric fields"

def field_to_operator(
self, name: str, field: ModelField
) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
def field_to_operator(self, name: str, field: ModelField) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
"""
Converts a PyDantic ModelField into a Tuple with the
query_param name,
Expand Down Expand Up @@ -181,11 +171,7 @@ def field_to_operator(
default=None,
description=f"Query for {title} being any of these values. Provide a comma separated list.",
),
lambda val: {
f"{field.name}": {
"$in": [int(entry.strip()) for entry in val.split(",")]
}
},
lambda val: {f"{field.name}": {"$in": [int(entry.strip()) for entry in val.split(",")]}},
),
(
f"{field.name}_neq_any",
Expand All @@ -195,11 +181,7 @@ def field_to_operator(
description=f"Query for {title} being not any of these values. \
Provide a comma separated list.",
),
lambda val: {
f"{field.name}": {
"$nin": [int(entry.strip()) for entry in val.split(",")]
}
},
lambda val: {f"{field.name}": {"$nin": [int(entry.strip()) for entry in val.split(",")]}},
),
]
)
Expand All @@ -210,9 +192,7 @@ def field_to_operator(
class StringQueryOperator(DynamicQueryOperator):
"Query Operator to enable searching on numeric fields"

def field_to_operator(
self, name: str, field: ModelField
) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
def field_to_operator(self, name: str, field: ModelField) -> List[Tuple[str, Any, Query, Callable[..., Dict]]]:
"""
Converts a PyDantic ModelField into a Tuple with the
query_param name,
Expand Down Expand Up @@ -253,11 +233,7 @@ def field_to_operator(
default=None,
description=f"Query for {title} being any of these values. Provide a comma separated list.",
),
lambda val: {
f"{field.name}": {
"$in": [entry.strip() for entry in val.split(",")]
}
},
lambda val: {f"{field.name}": {"$in": [entry.strip() for entry in val.split(",")]}},
),
(
f"{field.name}_neq_any",
Expand All @@ -266,11 +242,7 @@ def field_to_operator(
default=None,
description=f"Query for {title} being not any of these values. Provide a comma separated list",
),
lambda val: {
f"{field.name}": {
"$nin": [entry.strip() for entry in val.split(",")]
}
},
lambda val: {f"{field.name}": {"$nin": [entry.strip() for entry in val.split(",")]}},
),
]

Expand Down
4 changes: 1 addition & 3 deletions src/maggma/api/query_operator/pagination.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ def query(
),
_limit: int = Query(
default_limit,
description="Max number of entries to return in a single query."
f" Limited to {max_limit}.",
description=f"Max number of entries to return in a single query. Limited to {max_limit}.",
),
) -> STORE_PARAMS:
"""
Expand Down Expand Up @@ -82,7 +81,6 @@ def query(

def query(self):
"Stub query function for abstract class"
pass

def meta(self) -> Dict:
"""
Expand Down
Loading

0 comments on commit a3b90a7

Please sign in to comment.