materialsproject · rkingsbury · Jul 31, 2023 · Jun 27, 2023 · Jul 31, 2023 · Jun 27, 2023
diff --git a/.flake8 b/.flake8
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -1,9 +1,29 @@
-*Start with a description of this PR. Then edit the list below to the items that make sense for your PR scope, and check off the boxes as you go!*
+## Summary
 
-## Contributor Checklist
+Major changes:
 
-- [ ] I have broken down my PR scope into the following TODO tasks
-   -  [ ] task 1
-   -  [ ] task 2
+- feature 1: ...
+- fix 1: ...
+
+## Todos
+
+If this is work in progress, what else needs to be done?
+
+- feature 2: ...
+- fix 2:
+
+## Checklist
+
+- [ ] Google format doc strings added.
+- [ ] Code linted with `ruff`. (For guidance in fixing rule violates, see [rule list](https://beta.ruff.rs/docs/rules/))
+- [ ] Type annotations included. Check with `mypy`.
+- [ ] Tests added for new features/fixes.
 - [ ] I have run the tests locally and they passed.
-- [ ] I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR
+<!-- - [ ] If applicable, new classes/functions/modules have [`duecredit`](https://github.com/duecredit/duecredit) `@due.dcite` decorators to reference relevant papers by DOI ([example](https://github.com/materialsproject/pymatgen/blob/91dbe6ee9ed01d781a9388bf147648e20c6d58e0/pymatgen/core/lattice.py#L1168-L1172)) -->
+
+Tip: Install `pre-commit` hooks to auto-check types and linting before every commit:
+
+```sh
+pip install -U pre-commit
+pre-commit install
+```
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -14,22 +14,21 @@ on:
 jobs:
   lint:
     runs-on: ubuntu-latest
-
+    strategy:
+      max-parallel: 1
     steps:
-    - uses: actions/checkout@v3
-
-    - name: Set up Python 3.8
-      uses: actions/setup-python@v4
-      with:
-        python-version: "3.10"
-
-    - name: Install dependencies
-      run: |
-        pip install pre-commit
-
-    - name: Run pre-commit
-      run: |
-        pre-commit run --all-files --show-diff-on-failure
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+          cache: pip
+      - name: Run pre-commit
+        run: |
+          pip install pre-commit
+          pre-commit run
 
   test:
     needs: lint

diff --git a/.github/workflows/upgrade-dependencies.yml b/.github/workflows/upgrade-dependencies.yml
@@ -0,0 +1,89 @@
+# https://www.oddbird.net/2022/06/01/dependabot-single-pull-request/
+# https://github.com/materialsproject/MPContribs/blob/master/.github/workflows/upgrade-dependencies.yml
+name: upgrade dependencies
+
+on:
+  workflow_dispatch: # Allow running on-demand
+  schedule:
+    # Runs every Monday at 8:00 UTC (4:00 Eastern)
+    - cron: '0 8 * * 1'
+
+jobs:
+  upgrade:
+    name: ${{ matrix.package }} (${{ matrix.os }}/py${{ matrix.python-version }})
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: ['ubuntu-latest', 'macos-latest', windows-latest]
+        package: ["maggma"]
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+      - uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: 'pip'
+      - name: Upgrade Python dependencies
+        shell: bash
+        run: |
+          python${{ matrix.python-version }} -m pip install --upgrade pip pip-tools
+          cd ${{ matrix.package }}
+          python${{ matrix.python-version }} -m piptools compile -q --upgrade --resolver=backtracking -o requirements/${{ matrix.os }}_py${{ matrix.python-version }}.txt
+          python${{ matrix.python-version }} -m piptools compile -q --upgrade --resolver=backtracking --all-extras -o requirements/${{ matrix.os }}_py${{ matrix.python-version }}_extras.txt
+      - name: Detect changes
+        id: changes
+        shell: bash
+        run: |
+          #git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | awk '{print $4}' | sort -u
+          #sha1=$(git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | awk '{print $4}' | sort -u | head -n1)
+          #[[ $sha1 == "0000000000000000000000000000000000000000" ]] && git update-index --really-refresh ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt
+          echo "count=$(git diff-index HEAD ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | wc -l | xargs)" >> $GITHUB_OUTPUT
+          echo "files=$(git ls-files --exclude-standard --others ${{ matrix.package }}/requirements/${{ matrix.os }}_py${{ matrix.python-version }}*.txt | wc -l | xargs)" >> $GITHUB_OUTPUT
+      - name: commit & push changes
+        if: steps.changes.outputs.count > 0 || steps.changes.outputs.files > 0
+        shell: bash
+        run: |
+          git config user.name github-actions
+          git config user.email github-actions@github.com
+          git add ${{ matrix.package }}/requirements
+          git commit -m "update dependencies for ${{ matrix.package }} (${{ matrix.os }}/py${{ matrix.python-version }})"
+          git push -f origin ${{ github.ref_name }}:auto-dependency-upgrades-${{ matrix.package }}-${{ matrix.os }}-py${{ matrix.python-version }}
+
+  pull_request:
+    name: Merge all branches and open PR
+    runs-on: ubuntu-latest
+    needs: upgrade
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+      - name: detect auto-upgrade-dependency branches
+        id: changes
+        run: echo "count=$(git branch -r | grep auto-dependency-upgrades- | wc -l | xargs)" >> $GITHUB_OUTPUT
+      - name: merge all auto-dependency-upgrades branches
+        if: steps.changes.outputs.count > 0
+        run: |
+          git config user.name github-actions
+          git config user.email github-actions@github.com
+          git checkout -b auto-dependency-upgrades
+          git branch -r | grep auto-dependency-upgrades- | xargs -I {} git merge {}
+          git rebase ${GITHUB_REF##*/}
+          git push -f origin auto-dependency-upgrades
+          git branch -r | grep auto-dependency-upgrades- | cut -d/ -f2 | xargs -I {} git push origin :{}
+      - name: Open pull request if needed
+        if: steps.changes.outputs.count > 0
+        env:
+          GITHUB_TOKEN: ${{ secrets.PAT }}
+        # Only open a PR if the branch is not attached to an existing one
+        run: |
+          PR=$(gh pr list --head auto-dependency-upgrades --json number -q '.[0].number')
+          if [ -z $PR ]; then
+            gh pr create \
+            --head auto-dependency-upgrades \
+            --title "Automated dependency upgrades" \
+            --body "Full log: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+          else
+            echo "Pull request already exists, won't create a new one."
+          fi
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -2,17 +2,31 @@ default_stages: [commit]
 
 default_install_hook_types: [pre-commit, commit-msg]
 
+ci:
+  autoupdate_schedule: monthly
+  # skip: [mypy]
+  autofix_commit_msg: pre-commit auto-fixes
+  autoupdate_commit_msg: pre-commit autoupdate
+
 repos:
-  - repo: https://github.com/charliermarsh/ruff-pre-commit
-    rev: v0.0.261
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.0.280
     hooks:
       - id: ruff
-        args: [--fix, --ignore, "D,E501"]
+        args: [--fix, --ignore, D]
 
   - repo: https://github.com/psf/black
-    rev: 23.3.0
+    rev: 23.7.0
+    hooks:
+      - id: black
+
+  - repo: https://github.com/codespell-project/codespell
+    rev: v2.2.5
     hooks:
-      - id: black-jupyter
+      - id: codespell
+        stages: [commit, commit-msg]
+        exclude_types: [html]
+        additional_dependencies: [tomli] # needed to read pyproject.toml below py3.11
 
   - repo: https://github.com/pre-commit/pre-commit-hooks
     rev: v4.4.0

diff --git a/docs/concepts.md b/docs/concepts.md
@@ -12,7 +12,7 @@ s2 -- Builder 3-->s4(Store 4)
 
 ## Store
 
-A major challenge in building scalable data piplines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
+A major challenge in building scalable data pipelines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data
 sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However,
 Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or even files on disk.
 
@@ -34,4 +34,4 @@ Both `get_items` and `update_targets` can perform IO (input/output) to the data
 
 Another challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.
 
-Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
+Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the original Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
diff --git a/docs/getting_started/advanced_builder.md b/docs/getting_started/advanced_builder.md
@@ -42,4 +42,4 @@ Since `maggma` is designed around Mongo style data sources and sinks, building i
 `maggma` implements templates for builders that have many of these advanced features listed above:
 
 - [MapBuilder](map_builder.md) Creates one-to-one document mapping of items in the source Store to the transformed documents in the target Store.
-- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the traget Store
+- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the target Store
diff --git a/docs/getting_started/group_builder.md b/docs/getting_started/group_builder.md
@@ -56,7 +56,7 @@ class ResupplyBuilder(GroupBuilder):
         super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs)
 ```
 
-Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like.
+Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more useful names is a good idea in writing builders to make it clearer what the underlying data should look like.
 
 `GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters.
 
@@ -65,7 +65,7 @@ Note that unlike the previous `MapBuilder` example, we didn't call the source an
 - store_process_timeout: adds the process time into the target document for profiling
 - retry_failed: retries running the process function on previously failed documents
 
-One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.
+One parameter that doesn't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.
 
 Finally let's get to the hard part which is running our function. We do this by defining `unary_function`
 
@@ -81,4 +81,4 @@ Finally let's get to the hard part which is running our function. We do this by
         return {"resupply": resupply}
 ```
 
-Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names`
+Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values will be put together and kept in `names`
diff --git a/docs/getting_started/running_builders.md b/docs/getting_started/running_builders.md
@@ -15,7 +15,7 @@ my_builder = MultiplyBuilder(source_store,target_store,multiplier=3)
 my_builder.run()
 ```
 
-A better way to run this builder would be to use the `mrun` command line tool. Since evrything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:
+A better way to run this builder would be to use the `mrun` command line tool. Since everything in `maggma` is MSONable, we can use `monty` to dump the builders into a JSON file:
 
 ``` python
 from monty.serialization import dumpfn
@@ -29,7 +29,7 @@ Then we can run the builder using `mrun`:
 mrun my_builder.json
 ```
 
-`mrun` has a number of usefull options:
+`mrun` has a number of useful options:
 
 ``` shell
 mrun --help

diff --git a/docs/getting_started/simple_builder.md b/docs/getting_started/simple_builder.md
@@ -52,7 +52,7 @@ The `__init__` for a builder can have any set of parameters. Generally, you want
 
 Python type annotations provide a really nice way of documenting the types we expect and being able to later type check using `mypy`. We defined the type for `source` and `target` as `Store` since we only care that implements that pattern. How exactly these `Store`s operate doesn't concern us here.
 
-Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributess:
+Note that the `__init__` arguments: `source`, `target`, `multiplier`, and `kwargs` get saved as attributes:
 
 ``` python
         self.source = source
@@ -243,4 +243,4 @@ Then we can define a prechunk method that modifies the `Builder` dict in place t
             }
 ```
 
-When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all posible keys.
+When distributed processing runs, it will modify the `Builder` dictionary in place by the prechunk dictionary. In this case, each builder distribute to a worker will get a modified `query` parameter that only runs on a subset of all possible keys.
diff --git a/docs/getting_started/stores.md b/docs/getting_started/stores.md
@@ -11,7 +11,7 @@ Current working and tested `Store` include:
 - `MongoStore`: interfaces to a MongoDB Collection
 - `MemoryStore`: just a Store that exists temporarily in memory
 - `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files
-- `FileStore`: query and add metadata to files stored on disk as if they were in a databsae
+- `FileStore`: query and add metadata to files stored on disk as if they were in a database
 - `GridFSStore`: interfaces to GridFS collection in MongoDB
 - `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
 - `ConcatStore`: concatenates several Stores together so they look like one Store

diff --git a/docs/getting_started/using_file_store.md b/docs/getting_started/using_file_store.md
@@ -80,7 +80,7 @@ and for associating custom metadata (See ["Adding Metadata"](#adding-metadata) b
 ## Connecting and querying
 
 As with any `Store`, you have to `connect()` before you can query any data from a `FileStore`. After that, you can use `query_one()` to examine a single document or
-`query()` to return an interator of matching documents. For example, let's print the
+`query()` to return an iterator of matching documents. For example, let's print the
 parent directory of each of the files named "input.in" in our example `FileStore`:
 
 ```python
@@ -142,7 +142,7 @@ fs.add_metadata({"name":"input.in"}, {"tags":["preliminary"]})
 
 ### Automatic metadata
 
-You can even define a function to automatically crate metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
+You can even define a function to automatically create metadata from file or directory names. For example, if you prefix all your files with datestamps (e.g., '2022-05-07_experiment.csv'), you can write a simple string parsing function to
 extract information from any key in a `FileStore` record and pass the function as an argument to `add_metadata`.
 
 For example, to extract the date from files named like '2022-05-07_experiment.csv'
@@ -195,7 +195,7 @@ maggma.core.store.StoreError: (StoreError(...), 'Warning! This command is about
 Now that you can access your files on disk via a `FileStore`, it's time to write a `Builder` to read and process the data (see [Writing a Builder](simple_builder.md)).
 Keep in mind that `get_items` will return documents like the one shown in (#creating-the-filestore). You can then use `process_items` to
 
-- Create strucured data from the `contents`
+- Create structured data from the `contents`
 - Open the file for reading using a custom piece of code
 - etc.