Merge pull request #63 from earth-chris/checkerboard-split

add checkerboard split, geographic k-fold
earth-chris · Sep 19, 2022 · f8c6770 · f8c6770
2 parents a215dd2 + b17188e
commit f8c6770
Show file tree

Hide file tree

Showing 8 changed files with 200 additions and 10 deletions.
diff --git a/docs/examples/geo.md b/docs/examples/geo.md
@@ -117,13 +117,13 @@ Return a 1-column geodataframe with pseudoabsences concatenated to presence reco
 
 ```python
 presence_points = gpd.read_file('/path/to/occurrence-records.gpkg')
-ela.stack_geometries(presence_points, pseudoabsence_points)
+point_stack = ela.stack_geometries(presence_points, pseudoabsence_points)
 ```
 
 Return 2 columns, with class labels assigned (1 for presences, 0 for pseudoabsences):
 
 ```python
-ela.stack_geometries(
+point_stack = ela.stack_geometries(
   presence_points,
   pseudoabsence_points,
   add_class_label=True,
@@ -133,9 +133,10 @@ ela.stack_geometries(
 If the geometries are in different crs, default is to reproject to the presence crs. Override this with target_crs="background":
 
 ```python
-ela.stack_geometries(
+point_stack = ela.stack_geometries(
   presence_points,
   pseudoabsence_points,
+  add_class_label=True,
   target_crs="background",
 )
 ```
@@ -149,9 +150,9 @@ Annotation refers to reading and storing raster values at the locations of a ser
 Once you have your species presence and pseudo-absence records, you can annotate these records with the covariate data from each location.
 
 ```python
-pseudoabsence_covariates = ela.annotate(
-    pseudoabsence_points,
-    list_of_raster_paths,
+covariates = ela.annotate(
+    point_stack,
+    list_of_rasters,
     drop_na = True,
 )
 ```
@@ -176,8 +177,8 @@ labels = [
     "TMP-mean",
 ]
 
-pseudoabsence_covariates = ela.annotate(
-    pseudoabsence_points,
+covariates = ela.annotate(
+    point_stack,
     raster_paths,
     labels = labels
     drop_na = True,
@@ -199,7 +200,13 @@ One way to add spatial information to a model is to compute geographically-expli
 `elapid` does this by calculating sample weights based on the distance to the nearest neighbor. Points nearby other points receive lower weight scores; far-away points receive higher weight scores.
 
 ```python
-sample_weight = ela.distance_weights(pseudoabsence_points)
+sample_weight = ela.distance_weights(point_stack)
+```
+
+The default is to compute weights based on the distance to the nearest point. You can instead compute the average distance to `n` nearest points instead to compute sample weights using point densities instead of the distance to the single nearest point. This may be useful if you have small clusters of a few points far away from large, densely populated regions.
+
+```python
+sample_weight = ela.distance_weights(point_stack, n_neighbors=10)
 ```
 
 These weights can be passed to many many model fitting routines, typically via `model.fit(x, y, sample_weight=sample_weight)`. This is supported for `ela.MaxentModel()`, as well as many `sklearn` methods.
@@ -208,6 +215,34 @@ This function uses `ela.nearest_point_distance()`, a handy function for computin
 
 ---
 
+## Train/test splits
+
+Uniformly random train/test splits are generally discouraged in spatial modeling because of the strong spatial structure inherent in many datasets. The non-independence of these data is referred to as spatial autocorrelation. Using distance- or density-based sample weights is one way to mitigate these effects. Another is to split the data into geographically distinct train/test regions to try and prioritize model generalization.
+
+One method is to use a "checkerbox" system for creating train/test splits. Points are intersected along a regular grid, and every other grid is used to split the data into train/test sets.
+
+```python
+train, test = ela.checkerboard_split(point_stack, grid_size=1000)
+```
+
+The height and width of the grid used to split the data is controlled by the `grid_size` parameter. This should specify distance in the units of the point data's CRS. The above call would split data along a 1x1 km grid if the CRS units were in meters.
+
+The black and white structure of the checkerboard means this method can only generate one train/test split.
+
+Alternatively, you can create `k` geographically-clustered folds using the `GeographicKFold` cross validation strategy:
+
+```python
+gfolds = ela.GeographicKFold(n_folds=4)
+for train_idx, test_idx in gfolds.split(point_stack):
+    train_points = point_stack.iloc[train_idx]
+    test_points = point_stack.iloc[test_idx]
+    # split x/y data, fit models, evaluate, etc.
+```
+
+This method uses KMeans clustering, fit with the x/y locations of the point data, to group points into spatially distinct clusters. This cross-validation strategy is a good way to test how well models generalize outside of their training extents into novel geographic regions.
+
+---
+
 ## Zonal statistics
 
 In addition to the tools for working with Point data, `elapid` contains a routine for calculating zonal statistics from Polygon or MutliPolygon geometry types.

diff --git a/docs/index.md b/docs/index.md
@@ -71,6 +71,10 @@ Transform covariate data into derivative `features` to expand data dimensionalit
 
 Train and apply species distribution models based on annotated point data, configured with sensible defaults (like `elapid.MaxentModel()` and `elapid.NicheEnvelopeModel()`).
 
+:satellite: **Training spatially-aware models**
+
+Compute spatially-explicit sample weights, checkerboard train/test splits, or geographically-clustered cross-validation splits to reduce spatial autocorellation effects (with `elapid.distance_weights()`, `elapid.checkerboard_split()` and `elapid.GeographicKFold()`).
+
 :earth_asia: **Applying models to rasters**
 
 Apply any pixel-based model with a `.predict()` method to raster data to easily create prediction probability maps (like training a `RandomForestClassifier()` and applying with `elapid.apply_model_to_rasters()`).

diff --git a/docs/module/train_test_split.md b/docs/module/train_test_split.md
@@ -0,0 +1,3 @@
+# elapid.train_test_split
+
+::: elapid.train_test_split
diff --git a/elapid/__init__.py b/elapid/__init__.py
@@ -24,4 +24,5 @@
 )
 from elapid.models import MaxentModel, NicheEnvelopeModel
 from elapid.stats import normalize_sample_probabilities
+from elapid.train_test_split import GeographicKFold, checkerboard_split
 from elapid.utils import load_object, load_sample_data, save_object
diff --git a/elapid/__version__.py b/elapid/__version__.py
@@ -1 +1 @@
-"0.3.13"
+"0.3.14"
diff --git a/elapid/train_test_split.py b/elapid/train_test_split.py
@@ -0,0 +1,112 @@
+"""Methods for geographlically splitting data into train/test splits"""
+
+from typing import List, Tuple
+
+import geopandas as gpd
+import numpy as np
+from shapely.geometry import box
+from sklearn.cluster import KMeans
+from sklearn.model_selection import BaseCrossValidator
+
+from elapid.types import Vector
+
+
+def checkerboard_split(
+    points: Vector, grid_size: float, buffer: float = 0, bounds: Tuple[float, float, float, float] = None
+) -> Tuple[gpd.GeoDataFrame, gpd.GeoDataFrame]:
+    """Create train/test splits with a spatially-gridded checkerboard.
+
+    Args:
+        points: point-format GeoSeries or GeoDataFrame
+        grid_size: the height and width of each checkerboard side to split
+            data using. Should match the units of the points CRS
+            (i.e. grid_size=1000 is a 1km grid for UTM data)
+        buffer: add an x/y buffer around the initial checkerboard bounds
+        bounds: instead of deriving the checkerboard bounds from `points`,
+            use this tuple of [xmin, ymin, xmax, ymax] values.
+
+    Returns:
+        (train_points, test_points) split using a checkerboard grid.
+    """
+    if isinstance(points, gpd.GeoSeries):
+        points = points.to_frame("geometry")
+
+    bounds = points.total_bounds if bounds is None else bounds
+    xmin, ymin, xmax, ymax = bounds
+
+    x0s = np.arange(xmin - buffer, xmax + buffer + grid_size, grid_size)
+    y0s = np.arange(ymin - buffer, ymax + buffer + grid_size, grid_size)
+
+    train_cells = []
+    test_cells = []
+    for idy, y0 in enumerate(y0s):
+        offset = 0 if idy % 2 == 0 else 1
+        for idx, x0 in enumerate(x0s):
+            cell = box(x0, y0, x0 + grid_size, y0 + grid_size)
+            cell_type = 0 if (idx + offset) % 2 == 0 else 1
+            if cell_type == 0:
+                train_cells.append(cell)
+            else:
+                test_cells.append(cell)
+
+    grid_crs = points.crs
+    train_grid = gpd.GeoDataFrame(geometry=train_cells, crs=grid_crs)
+    test_grid = gpd.GeoDataFrame(geometry=test_cells, crs=grid_crs)
+    train_points = (
+        gpd.sjoin(points, train_grid, how="left", predicate="within")
+        .dropna()
+        .drop(columns="index_right")
+        .reset_index(drop=True)
+    )
+    test_points = (
+        gpd.sjoin(points, test_grid, how="left", predicate="within")
+        .dropna()
+        .drop(columns="index_right")
+        .reset_index(drop=True)
+    )
+
+    return train_points, test_points
+
+
+class GeographicKFold(BaseCrossValidator):
+    """Compute geographically-clustered train/test folds using KMeans clustering"""
+
+    def __init__(self, n_splits: int = 4):
+        self.n_splits = n_splits
+
+    def split(self, points: Vector) -> Tuple[gpd.GeoDataFrame, gpd.GeoDataFrame]:
+        """Split point data into geographically-clustered train/test folds and
+            return their array indices.
+
+        Args:
+            points: point-format GeoSeries or GeoDataFrame.
+
+        Yields:
+            (train_idxs, test_idxs) the train/test splits for each geo fold.
+        """
+        for train, test in super().split(points):
+            yield train, test
+
+    def _iter_test_indices(self, X, y=None, groups=None):
+        """The method used by the base class to split train/test data"""
+        kmeans = KMeans(n_clusters=self.n_splits)
+        xy = np.array(list(zip(X.geometry.x, X.geometry.y)))
+        kmeans.fit(xy)
+        clusters = kmeans.predict(xy)
+        indices = np.arange(len(xy))
+        for cluster in range(self.n_splits):
+            test = clusters == cluster
+            yield indices[test]
+
+    def get_n_splits(self, X=None, y=None, groups=None) -> int:
+        """Returns the number of splitting iterations in the cross-validator
+
+        Args:
+            X: ignored, exists for compatibility.
+            y: ignored, exists for compatibility.
+            groups: ignored, exists for compatibility.
+
+        Returns:
+            The number of splitting iterations in the cross-validator.
+        """
+        return self.n_splits
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -36,6 +36,7 @@ nav:
         - elapid.geo: 'module/geo.md'
         - elapid.models: 'module/models.md'
         - elapid.stats: 'module/stats.md'
+        - elapid.train_test_split: 'module/train_test_split.md'
         - elapid.types: 'module/types.md'
         - elapid.utils: 'module/utils.md'
     - Contributing to elapid: 'contributing.md'

diff --git a/tests/test_train_test_split.py b/tests/test_train_test_split.py
@@ -0,0 +1,34 @@
+import os
+
+import geopandas as gpd
+import numpy as np
+
+from elapid import train_test_split
+
+# set the test raster data paths
+directory_path, script_path = os.path.split(os.path.abspath(__file__))
+data_path = os.path.join(directory_path, "data")
+points = gpd.read_file(os.path.join(data_path, "test-point-samples.gpkg"))
+
+
+def test_checkerboard_split():
+    train, test = train_test_split.checkerboard_split(points, grid_size=1000)
+    assert isinstance(train, gpd.GeoDataFrame)
+
+    buffer = 500
+    xmin, ymin, xmax, ymax = points.total_bounds
+    buffered_bounds = [xmin - buffer, ymin - buffer, xmax + buffer, ymax + buffer]
+    train_buffered, test_buffered = train_test_split.checkerboard_split(points, grid_size=1000, bounds=buffered_bounds)
+    assert len(train_buffered) > len(train)
+
+
+def test_GeographicKFold():
+    n_folds = 4
+    gfolds = train_test_split.GeographicKFold(n_splits=n_folds)
+    counted_folds = 0
+    for train_idx, test_idx in gfolds.split(points):
+        train = points.iloc[train_idx]
+        test = points.iloc[test_idx]
+        assert len(train) > len(test)
+        counted_folds += 1
+    assert gfolds.get_n_splits() == n_folds == counted_folds