From a2bf2109740d40457f8378bc426d024e1a247805 Mon Sep 17 00:00:00 2001 From: jerrylin96 Date: Sun, 13 Aug 2023 20:52:02 +0000 Subject: [PATCH] deploy: 390dad671ac7762ed023af9dc31cf807dd5be082 --- intro.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/intro.html b/intro.html index b41a759..78b3e8b 100644 --- a/intro.html +++ b/intro.html @@ -419,7 +419,7 @@

Download the Data

Preprocess the Data#

-

The default preprocessing workflow takes folders of monthly data from the climate model simulations, and creates normalized numpy arrays for input and target data for training, validation, and scoring. These numpy arrays are called train_input.npy, train_target.npy, val_input.npy, val_target.npy, scoring_input.npy, and scoring_target.npy. An option to strictly use a data loader and avoid converting into numpy arrays is available in the data_utils.py; however, this was found to significantly slow down training because of I/O induced slowdown.

+

The default preprocessing workflow takes folders of monthly data from the climate model simulations, and creates normalized numpy arrays for input and target data for training, validation, and scoring. These numpy arrays are called train_input.npy, train_target.npy, val_input.npy, val_target.npy, scoring_input.npy, and scoring_target.npy. An option to strictly use a data loader and avoid converting into numpy arrays is available in the data_utils.py; however, this can significantly slow down training because of increased I/O.

The data comes in the form of folders labeled YYYY-MM where YYYY corresponds to the year and MM corresponds to the month. Within each of these folders are NetCDF (.nc) files that represent inputs and outputs for individual timesteps. Input netCDF files are labeled E3SM-MMF.mli.YYYY-MM-DD-SSSSS.nc where DD corresponds to the day of the month and SSSSS correspond to the seconds of the day (with timesteps being spaced 1200 seconds or 20 minutes apart). Output NetCDF files are labeled the same exact way except mli is replaced by mlo. For vertically-resolved variables (i.e. variables that have values for an entire atmospheric column), lower indices corresponds to higher levels in the atmosphere. This is because pressure decreases monotonically with altitude.

The files containing the default normalization factors for the input and output data are found in the norm_factors/ folder, precomputed for convenience. However, one can use their own normalization factors if so desired. The file containing the E3SM-MMF grid information is found in the grid_info/ folder. This corresponds to the netCDF file ending in grid-info.nc on Hugging Face.

The environment needed for preprocessing can be found in the /preprocessing/env/requirements.txt file. A class designed for preprocessing and metrics can be imported from the data_utils.py script. This script is used in the preprocessing/create_npy_data_splits.ipynb notebook, which creates training, validation, and scoring datasets.