Merge remote-tracking branch 'origin/develop' into dynamic-graphs

ecmwf · Sep 17, 2024 · d998e51 · d998e51
2 parents a157104 + dbee83b
commit d998e51
Show file tree

Hide file tree

Showing 17 changed files with 665 additions and 66 deletions.
diff --git a/.github/workflows/changelog-release-update.yml b/.github/workflows/changelog-release-update.yml
@@ -0,0 +1,34 @@
+# .github/workflows/update-changelog.yaml
+name: "Update Changelog"
+
+on:
+  release:
+    types: [released]
+
+permissions:
+  pull-requests: write
+  contents: write
+
+jobs:
+  update:
+    runs-on: ubuntu-latest
+
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+      with:
+        ref: ${{ github.event.release.target_commitish }}
+
+    - name: Update Changelog
+      uses: stefanzweifel/changelog-updater-action@v1
+      with:
+        latest-version: ${{ github.event.release.tag_name }}
+        heading-text: ${{ github.event.release.name }}
+
+    - name: Create Pull Request
+      uses: peter-evans/create-pull-request@v6
+      with:
+        branch: docs/changelog-update-${{ github.event.release.tag_name }}
+        title: '[Changelog] Update to ${{ github.event.release.tag_name }}'
+        add-paths: |
+          CHANGELOG.md
diff --git a/.gitignore b/.gitignore
@@ -121,6 +121,7 @@ celerybeat.pid
 
 # Environments
 .env
+.envrc
 .venv
 env/
 venv/

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -21,7 +21,7 @@ repos:
   - id: check-added-large-files # Check for large files added to git
   - id: check-merge-conflict # Check for files that contain merge conflict
 - repo: https://github.com/psf/black-pre-commit-mirror
-  rev: 24.4.2
+  rev: 24.8.0
   hooks:
   - id: black
     args: [--line-length=120]
@@ -34,7 +34,7 @@ repos:
     - --force-single-line-imports
     - --profile black
 - repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.4.6
+  rev: v0.6.3
   hooks:
   - id: ruff
     # Next line if for documenation cod snippets
@@ -65,6 +65,6 @@ repos:
   - id: optional-dependencies-all
     args: ["--inplace", "--exclude-keys=dev,docs,tests", "--group=dev=all,docs,tests"]
 - repo: https://github.com/tox-dev/pyproject-fmt
-  rev: "2.1.3"
+  rev: "2.2.1"
   hooks:
   - id: pyproject-fmt
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,19 +8,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Please add your functional changes to the appropriate section in the PR.
 Keep it human-readable, your future self will thank you!
 
-## [Unreleased]
+## [Unreleased](https://github.com/ecmwf/anemoi-models/compare/0.3.0...HEAD)
+
+## [0.3.0](https://github.com/ecmwf/anemoi-models/compare/0.2.1...0.3.0) - Remapping of (meteorological) Variables
 
 ### Added
 
+- CI workflow to update the changelog on release
+- Remapper: Preprocessor for remapping one variable to multiple ones. Includes changes to the data indices since the remapper changes the number of variables. With optional config keywords.
+
 ### Changed
 
- - Update CI to inherit from common infrastructue reusable workflows
- - run downstream-ci only when src and tests folders have changed
- - New error messages for wrongs graphs.
+- Update CI to inherit from common infrastructue reusable workflows
+- run downstream-ci only when src and tests folders have changed
+- New error messages for wrongs graphs.
 
 ### Removed
 
-## [0.2.1] - Dependency update
+## [0.2.1](https://github.com/ecmwf/anemoi-models/compare/0.2.0...0.2.1) - Dependency update
 
 ### Added
 
@@ -31,7 +36,7 @@ Keep it human-readable, your future self will thank you!
 
 - anemoi-datasets dependency
 
-## [0.2.0] - Support Heterodata
+## [0.2.0](https://github.com/ecmwf/anemoi-models/compare/0.1.0...0.2.0) - Support Heterodata
 
 ### Added
 
@@ -41,15 +46,12 @@ Keep it human-readable, your future self will thank you!
 
 - Updated to support new PyTorch Geometric HeteroData structure (defined by `anemoi-graphs` package).
 
-## [0.1.0] - Initial Release
+## [0.1.0](https://github.com/ecmwf/anemoi-models/releases/tag/0.1.0) - Initial Release
 
 ### Added
+
 - Documentation
 - Initial code release with models, layers, distributed, preprocessing, and data_indices
 - Added Changelog
 
 <!-- Add Git Diffs for Links above -->
-[unreleased]: https://github.com/ecmwf/anemoi-models/compare/0.2.1...HEAD
-[0.2.1]: https://github.com/ecmwf/anemoi-models/compare/0.2.0...0.2.1
-[0.2.0]: https://github.com/ecmwf/anemoi-models/compare/0.1.0...0.2.0
-[0.1.0]: https://github.com/ecmwf/anemoi-models/releases/tag/0.1.0
diff --git a/docs/modules/data_indices.rst b/docs/modules/data_indices.rst
@@ -45,12 +45,33 @@ config entry:
    :alt: Schematic of IndexCollection with Data Indexing on Data and Model levels.
    :align: center
 
-The are two Index-levels:
+Additionally, prognostic and forcing variables can be remapped and
+converted to multiple variables. The conversion is then done by the
+remapper-preprocessor.
+
+.. code:: yaml
+
+   data:
+     remapped:
+       d:
+         - "d_1"
+         - "d_2"
+
+There are two main Index-levels:
 
 -  Data: The data at "Zarr"-level provided by Anemoi-Datasets
 -  Model: The "squeezed" tensors with irrelevant parts missing.
 
-These are both split into two versions:
+Additionally, there are two internal model levels (After preprocessor
+and before postprocessor) that are necessary because of the possiblity
+to remap variables to multiple variables.
+
+-  Internal Data: Variables from Data-level that are used internally in
+   the model, but not exposed to the user.
+-  Internal Model: Variables from Model-level that are used internally
+   in the model, but not exposed to the user.
+
+All indices at the different levels are split into two versions:
 
 -  Input: The data going into training / model
 -  Output: The data produced by training / model

diff --git a/docs/modules/preprocessing.rst b/docs/modules/preprocessing.rst
@@ -33,3 +33,16 @@ following classes:
    :members:
    :no-undoc-members:
    :show-inheritance:
+
+**********
+ Remapper
+**********
+
+The remapper module is used to remap one variable to multiple other
+variables that have been listed in data.remapped:. The module contains
+the following classes:
+
+.. automodule:: anemoi.models.preprocessing.remapper
+   :members:
+   :no-undoc-members:
+   :show-inheritance:
diff --git a/src/anemoi/models/data_indices/collection.py b/src/anemoi/models/data_indices/collection.py
@@ -25,26 +25,76 @@ class IndexCollection:
 
     def __init__(self, config, name_to_index) -> None:
         self.config = OmegaConf.to_container(config, resolve=True)
-
+        self.name_to_index = dict(sorted(name_to_index.items(), key=operator.itemgetter(1)))
         self.forcing = [] if config.data.forcing is None else OmegaConf.to_container(config.data.forcing, resolve=True)
         self.diagnostic = (
             [] if config.data.diagnostic is None else OmegaConf.to_container(config.data.diagnostic, resolve=True)
         )
+        # config.data.remapped is an optional dictionary with every remapper as one entry
+        self.remapped = (
+            dict()
+            if config.data.get("remapped") is None
+            else OmegaConf.to_container(config.data.remapped, resolve=True)
+        )
+        self.forcing_remapped = self.forcing.copy()
 
         assert set(self.diagnostic).isdisjoint(self.forcing), (
             f"Diagnostic and forcing variables overlap: {set(self.diagnostic).intersection(self.forcing)}. ",
             "Please drop them at a dataset-level to exclude them from the training data.",
         )
-        self.name_to_index = dict(sorted(name_to_index.items(), key=operator.itemgetter(1)))
+        assert set(self.remapped).isdisjoint(self.diagnostic), (
+            "Remapped variable overlap with diagnostic variables. Not implemented.",
+        )
+        assert set(self.remapped).issubset(self.name_to_index), (
+            "Remapping a variable that does not exist in the dataset. Check for typos: ",
+            f"{set(self.remapped).difference(self.name_to_index)}",
+        )
         name_to_index_model_input = {
             name: i for i, name in enumerate(key for key in self.name_to_index if key not in self.diagnostic)
         }
         name_to_index_model_output = {
             name: i for i, name in enumerate(key for key in self.name_to_index if key not in self.forcing)
         }
+        # remove remapped variables from internal data and model indices
+        name_to_index_internal_data_input = {
+            name: i for i, name in enumerate(key for key in self.name_to_index if key not in self.remapped)
+        }
+        name_to_index_internal_model_input = {
+            name: i for i, name in enumerate(key for key in name_to_index_model_input if key not in self.remapped)
+        }
+        name_to_index_internal_model_output = {
+            name: i for i, name in enumerate(key for key in name_to_index_model_output if key not in self.remapped)
+        }
+        # for all variables to be remapped we add the resulting remapped variables to the end of the tensors
+        # keep track of that in the index collections
+        for key in self.remapped:
+            for mapped in self.remapped[key]:
+                # add index of remapped variables to dictionary
+                name_to_index_internal_model_input[mapped] = len(name_to_index_internal_model_input)
+                name_to_index_internal_data_input[mapped] = len(name_to_index_internal_data_input)
+                if key not in self.forcing:
+                    # do not include forcing variables in the remapped model output
+                    name_to_index_internal_model_output[mapped] = len(name_to_index_internal_model_output)
+                else:
+                    # add remapped forcing variables to forcing_remapped
+                    self.forcing_remapped += [mapped]
+            if key in self.forcing:
+                # if key is in forcing we need to remove it from forcing_remapped after remapped variables have been added
+                self.forcing_remapped.remove(key)
 
         self.data = DataIndex(self.diagnostic, self.forcing, self.name_to_index)
+        self.internal_data = DataIndex(
+            self.diagnostic,
+            self.forcing_remapped,
+            name_to_index_internal_data_input,
+        )  # internal after the remapping applied to data (training)
         self.model = ModelIndex(self.diagnostic, self.forcing, name_to_index_model_input, name_to_index_model_output)
+        self.internal_model = ModelIndex(
+            self.diagnostic,
+            self.forcing_remapped,
+            name_to_index_internal_model_input,
+            name_to_index_internal_model_output,
+        )  # internal after the remapping applied to model (inference)
 
     def __repr__(self) -> str:
         return f"IndexCollection(config={self.config}, name_to_index={self.name_to_index})"
@@ -54,7 +104,12 @@ def __eq__(self, other):
             # don't attempt to compare against unrelated types
             return NotImplemented
 
-        return self.model == other.model and self.data == other.data
+        return (
+            self.model == other.model
+            and self.data == other.data
+            and self.internal_model == other.internal_model
+            and self.internal_data == other.internal_data
+        )
 
     def __getitem__(self, key):
         return getattr(self, key)
@@ -63,6 +118,8 @@ def todict(self):
         return {
             "data": self.data.todict(),
             "model": self.model.todict(),
+            "internal_model": self.internal_model.todict(),
+            "internal_data": self.internal_data.todict(),
         }
 
     @staticmethod

diff --git a/src/anemoi/models/interface/__init__.py b/src/anemoi/models/interface/__init__.py
@@ -65,7 +65,7 @@ def _build_model(self) -> None:
         """Builds the model and pre- and post-processors."""
         # Instantiate processors
         processors = [
-            [name, instantiate(processor, statistics=self.statistics, data_indices=self.data_indices)]
+            [name, instantiate(processor, data_indices=self.data_indices, statistics=self.statistics)]
             for name, processor in self.config.data.processors.items()
         ]
 

diff --git a/src/anemoi/models/models/encoder_processor_decoder.py b/src/anemoi/models/models/encoder_processor_decoder.py
@@ -104,22 +104,23 @@ def __init__(
         )
 
     def _calculate_shapes_and_indices(self, data_indices: dict) -> None:
-        self.num_input_channels = len(data_indices.model.input)
-        self.num_output_channels = len(data_indices.model.output)
-        self._internal_input_idx = data_indices.model.input.prognostic
-        self._internal_output_idx = data_indices.model.output.prognostic
+        self.num_input_channels = len(data_indices.internal_model.input)
+        self.num_output_channels = len(data_indices.internal_model.output)
+        self._internal_input_idx = data_indices.internal_model.input.prognostic
+        self._internal_output_idx = data_indices.internal_model.output.prognostic
 
     def _assert_matching_indices(self, data_indices: dict) -> None:
 
-        assert len(self._internal_output_idx) == len(data_indices.model.output.full) - len(
-            data_indices.model.output.diagnostic
+        assert len(self._internal_output_idx) == len(data_indices.internal_model.output.full) - len(
+            data_indices.internal_model.output.diagnostic
         ), (
-            f"Mismatch between the internal data indices ({len(self._internal_output_idx)}) and the output indices excluding "
-            f"diagnostic variables ({len(data_indices.model.output.full) - len(data_indices.model.output.diagnostic)})",
+            f"Mismatch between the internal data indices ({len(self._internal_output_idx)}) and "
+            f"the internal output indices excluding diagnostic variables "
+            f"({len(data_indices.internal_model.output.full) - len(data_indices.internal_model.output.diagnostic)})",
         )
         assert len(self._internal_input_idx) == len(
             self._internal_output_idx,
-        ), f"Model indices must match {self._internal_input_idx} != {self._internal_output_idx}"
+        ), f"Internal model indices must match {self._internal_input_idx} != {self._internal_output_idx}"
 
     def _define_tensor_sizes(self, config: DotDict) -> None:
         self._data_grid_size = self._graph_data[self._graph_name_data].num_nodes

diff --git a/src/anemoi/models/preprocessing/__init__.py b/src/anemoi/models/preprocessing/__init__.py
@@ -14,6 +14,8 @@
 from torch import Tensor
 from torch import nn
 
+from anemoi.models.data_indices.collection import IndexCollection
+
 LOGGER = logging.getLogger(__name__)
 
 
@@ -23,19 +25,19 @@ class BasePreprocessor(nn.Module):
     def __init__(
         self,
         config=None,
+        data_indices: Optional[IndexCollection] = None,
         statistics: Optional[dict] = None,
-        data_indices: Optional[dict] = None,
     ) -> None:
         """Initialize the preprocessor.
 
         Parameters
         ----------
         config : DotDict
-            configuration object
+            configuration object of the processor
+        data_indices : IndexCollection
+            Data indices for input and output variables
         statistics : dict
             Data statistics dictionary
-        data_indices : dict
-            Data indices for input and output variables
         """
         super().__init__()
 

diff --git a/src/anemoi/models/preprocessing/imputer.py b/src/anemoi/models/preprocessing/imputer.py
@@ -33,16 +33,15 @@ def __init__(
         Parameters
         ----------
         config : DotDict
-            configuration object
+            configuration object of the processor
+        data_indices : IndexCollection
+            Data indices for input and output variables
         statistics : dict
             Data statistics dictionary
-        data_indices : dict
-            Data indices for input and output variables
         """
-        super().__init__(config, statistics, data_indices)
+        super().__init__(config, data_indices, statistics)
 
         self.nan_locations = None
-        self.data_indices = data_indices
 
     def _validate_indices(self):
         assert len(self.index_training_input) == len(self.index_inference_input) <= len(self.replacement), (
@@ -174,8 +173,8 @@ class InputImputer(BaseImputer):
     def __init__(
         self,
         config=None,
+        data_indices: Optional[IndexCollection] = None,
         statistics: Optional[dict] = None,
-        data_indices: Optional[dict] = None,
     ) -> None:
         super().__init__(config, data_indices, statistics)
 
@@ -201,7 +200,10 @@ class ConstantImputer(BaseImputer):
     """
 
     def __init__(
-        self, config=None, statistics: Optional[dict] = None, data_indices: Optional[IndexCollection] = None
+        self,
+        config=None,
+        data_indices: Optional[IndexCollection] = None,
+        statistics: Optional[dict] = None,
     ) -> None:
         super().__init__(config, data_indices, statistics)