Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linting, CI, dependencies, and infrastructure updates #810

Merged
merged 11 commits into from
Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions .flake8

This file was deleted.

32 changes: 26 additions & 6 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,29 @@
*Start with a description of this PR. Then edit the list below to the items that make sense for your PR scope, and check off the boxes as you go!*
## Summary

## Contributor Checklist
Major changes:

- [ ] I have broken down my PR scope into the following TODO tasks
- [ ] task 1
- [ ] task 2
- feature 1: ...
- fix 1: ...

## Todos

If this is work in progress, what else needs to be done?

- feature 2: ...
- fix 2:

## Checklist

- [ ] Google format doc strings added.
- [ ] Code linted with `ruff`. (For guidance in fixing rule violates, see [rule list](https://beta.ruff.rs/docs/rules/))
- [ ] Type annotations included. Check with `mypy`.
- [ ] Tests added for new features/fixes.
- [ ] I have run the tests locally and they passed.
- [ ] I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR
<!-- - [ ] If applicable, new classes/functions/modules have [`duecredit`](https://github.com/duecredit/duecredit) `@due.dcite` decorators to reference relevant papers by DOI ([example](https://github.com/materialsproject/pymatgen/blob/91dbe6ee9ed01d781a9388bf147648e20c6d58e0/pymatgen/core/lattice.py#L1168-L1172)) -->

Tip: Install `pre-commit` hooks to auto-check types and linting before every commit:

```sh
pip install -U pre-commit
pre-commit install
```
29 changes: 14 additions & 15 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,21 @@ on:
jobs:
lint:
runs-on: ubuntu-latest

strategy:
max-parallel: 1
steps:
- uses: actions/checkout@v3

- name: Set up Python 3.8
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install dependencies
run: |
pip install pre-commit

- name: Run pre-commit
run: |
pre-commit run --all-files --show-diff-on-failure
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.11
cache: pip
- name: Run pre-commit
run: |
pip install pre-commit
pre-commit run

test:
needs: lint
Expand Down
89 changes: 89 additions & 0 deletions .github/workflows/upgrade-dependencies.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# https://www.oddbird.net/2022/06/01/dependabot-single-pull-request/
# https://github.com/materialsproject/MPContribs/blob/master/.github/workflows/upgrade-dependencies.yml
name: upgrade dependencies

on:
workflow_dispatch: # Allow running on-demand
schedule:
# Runs every Monday at 8:00 UTC (4:00 Eastern)
- cron: '0 8 * * 1'

jobs:
upgrade:
name: ${{ matrix.package }} (${{ matrix.os }}/py${{ matrix.python-version }})
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: ['ubuntu-latest', 'macos-latest', windows-latest]
package: ["maggma"]
python-version: ["3.8", "3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Upgrade Python dependencies
shell: bash
run: |
python${{ matrix.python-version }} -m pip install --upgrade pip pip-tools
cd ${{ matrix.package }}
python${{ matrix.python-version }} -m piptools compile -q --upgrade --resolver=backtracking -o requirements/${{ matrix.os }}_py${{ matrix.python-version }}.txt
python${{ matrix.python-version }} -m piptools compile -q --upgrade --resolver=backtracking --all-extras -o requirements/${{ matrix.os }}_py${{ matrix.python-version }}_extras.txt
- name: Detect changes
id: changes
shell: bash
run: |
#git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | awk '{print $4}' | sort -u
#sha1=$(git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | awk '{print $4}' | sort -u | head -n1)
#[[ $sha1 == "0000000000000000000000000000000000000000" ]] && git update-index --really-refresh ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt
echo "count=$(git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | wc -l | xargs)" >> $GITHUB_OUTPUT
echo "files=$(git ls-files --exclude-standard --others ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | wc -l | xargs)" >> $GITHUB_OUTPUT
- name: commit & push changes
if: steps.changes.outputs.count > 0 || steps.changes.outputs.files > 0
shell: bash
run: |
git config user.name github-actions
git config user.email github-actions@github.com
git add ${{ matrix.package }}/requirements
git commit -m "update dependencies for ${{ matrix.package }} (${{ matrix.os }}/py${{ matrix.python-version }})"
git push -f origin ${{ github.ref_name }}:auto-dependency-upgrades-${{ matrix.package }}-${{ matrix.os }}-py${{ matrix.python-version }}

pull_request:
name: Merge all branches and open PR
runs-on: ubuntu-latest
needs: upgrade
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: detect auto-upgrade-dependency branches
id: changes
run: echo "count=$(git branch -r | grep auto-dependency-upgrades- | wc -l | xargs)" >> $GITHUB_OUTPUT
- name: merge all auto-dependency-upgrades branches
if: steps.changes.outputs.count > 0
run: |
git config user.name github-actions
git config user.email github-actions@github.com
git checkout -b auto-dependency-upgrades
git branch -r | grep auto-dependency-upgrades- | xargs -I {} git merge {}
git rebase ${GITHUB_REF##*/}
git push -f origin auto-dependency-upgrades
git branch -r | grep auto-dependency-upgrades- | cut -d/ -f2 | xargs -I {} git push origin :{}
- name: Open pull request if needed
if: steps.changes.outputs.count > 0
env:
GITHUB_TOKEN: ${{ secrets.PAT }}
# Only open a PR if the branch is not attached to an existing one
run: |
PR=$(gh pr list --head auto-dependency-upgrades --json number -q '.[0].number')
if [ -z $PR ]; then
gh pr create \
--head auto-dependency-upgrades \
--title "Automated dependency upgrades" \
--body "Full log: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
else
echo "Pull request already exists, won't create a new one."
fi
24 changes: 19 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,31 @@ default_stages: [commit]

default_install_hook_types: [pre-commit, commit-msg]

ci:
autoupdate_schedule: monthly
# skip: [mypy]
autofix_commit_msg: pre-commit auto-fixes
autoupdate_commit_msg: pre-commit autoupdate

repos:
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.261
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.0.280
hooks:
- id: ruff
args: [--fix, --ignore, "D,E501"]
args: [--fix, --ignore, D]

- repo: https://github.com/psf/black
rev: 23.3.0
rev: 23.7.0
hooks:
- id: black

- repo: https://github.com/codespell-project/codespell
rev: v2.2.5
hooks:
- id: black-jupyter
- id: codespell
stages: [commit, commit-msg]
exclude_types: [html]
additional_dependencies: [tomli] # needed to read pyproject.toml below py3.11

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ s2 -- Builder 3-->s4(Store 4)

## Store

A major challenge in building scalable data piplines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
A major challenge in building scalable data pipelines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However,
Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or even files on disk.

Expand All @@ -34,4 +34,4 @@ Both `get_items` and `update_targets` can perform IO (input/output) to the data

Another challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.

Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the original Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
2 changes: 1 addition & 1 deletion docs/getting_started/advanced_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,4 @@ Since `maggma` is designed around Mongo style data sources and sinks, building i
`maggma` implements templates for builders that have many of these advanced features listed above:

- [MapBuilder](map_builder.md) Creates one-to-one document mapping of items in the source Store to the transformed documents in the target Store.
- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the traget Store
- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the target Store
6 changes: 3 additions & 3 deletions docs/getting_started/group_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ class ResupplyBuilder(GroupBuilder):
super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs)
```

Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like.
Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more useful names is a good idea in writing builders to make it clearer what the underlying data should look like.

`GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters.

Expand All @@ -65,7 +65,7 @@ Note that unlike the previous `MapBuilder` example, we didn't call the source an
- store_process_timeout: adds the process time into the target document for profiling
- retry_failed: retries running the process function on previously failed documents

One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.
One parameter that doesn't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.

Finally let's get to the hard part which is running our function. We do this by defining `unary_function`

Expand All @@ -81,4 +81,4 @@ Finally let's get to the hard part which is running our function. We do this by
return {"resupply": resupply}
```

Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names`
Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values will be put together and kept in `names`
4 changes: 2 additions & 2 deletions docs/getting_started/running_builders.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ my_builder = MultiplyBuilder(source_store,target_store,multiplier=3)
my_builder.run()
```

A better way to run this builder would be to use the `mrun` command line tool. Since evrything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:
A better way to run this builder would be to use the `mrun` command line tool. Since everything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:

``` python
from monty.serialization import dumpfn
Expand All @@ -29,7 +29,7 @@ Then we can run the builder using `mrun`:
mrun my_builder.json
```

`mrun` has a number of usefull options:
`mrun` has a number of useful options:

``` shell
mrun --help
Expand Down
4 changes: 2 additions & 2 deletions docs/getting_started/simple_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The `__init__` for a builder can have any set of parameters. Generally, you want

Python type annotations provide a really nice way of documenting the types we expect and being able to later type check using `mypy`. We defined the type for `source` and `target` as `Store` since we only care that implements that pattern. How exactly these `Store`s operate doesn't concern us here.

Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributess:
Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributes:

``` python
self.source = source
Expand Down Expand Up @@ -243,4 +243,4 @@ Then we can define a prechunk method that modifies the `Builder` dict in place t
}
```

When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all posible keys.
When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all possible keys.
2 changes: 1 addition & 1 deletion docs/getting_started/stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Current working and tested `Store` include:
- `MongoStore`: interfaces to a MongoDB Collection
- `MemoryStore`: just a Store that exists temporarily in memory
- `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files
- `FileStore`: query and add metadata to files stored on disk as if they were in a databsae
- `FileStore`: query and add metadata to files stored on disk as if they were in a database
- `GridFSStore`: interfaces to GridFS collection in MongoDB
- `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- `ConcatStore`: concatenates several Stores together so they look like one Store
Expand Down
6 changes: 3 additions & 3 deletions docs/getting_started/using_file_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ and for associating custom metadata (See ["Adding Metadata"](#adding-metadata) b
## Connecting and querying

As with any `Store`, you have to `connect()` before you can query any data from a `FileStore`. After that, you can use `query_one()` to examine a single document or
`query()` to return an interator of matching documents. For example, let's print the
`query()` to return an iterator of matching documents. For example, let's print the
parent directory of each of the files named "input.in" in our example `FileStore`:

```python
Expand Down Expand Up @@ -142,7 +142,7 @@ fs.add_metadata({"name":"input.in"}, {"tags":["preliminary"]})

### Automatic metadata

You can even define a function to automatically crate metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
You can even define a function to automatically create metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
extract information from any key in a `FileStore` record and pass the function as an argument to `add_metadata`.

For example, to extract the date from files named like '2022-05-07_experiment.csv'
Expand Down Expand Up @@ -195,7 +195,7 @@ maggma.core.store.StoreError: (StoreError(...), 'Warning! This command is about
Now that you can access your files on disk via a `FileStore`, it's time to write a `Builder` to read and process the data (see [Writing a Builder](simple_builder.md)).
Keep in mind that `get_items` will return documents like the one shown in (#creating-the-filestore). You can then use `process_items` to

- Create strucured data from the `contents`
- Create structured data from the `contents`
- Open the file for reading using a custom piece of code
- etc.

Expand Down
Loading