Skip to content

Commit

Permalink
deploy: 390dad6
Browse files Browse the repository at this point in the history
  • Loading branch information
jerrylin96 committed Aug 13, 2023
1 parent b23085d commit a2bf210
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,7 @@ <h3>Download the Data<a class="headerlink" href="#download-the-data" title="Perm
</section>
<section id="preprocess-the-data">
<h3>Preprocess the Data<a class="headerlink" href="#preprocess-the-data" title="Permalink to this heading">#</a></h3>
<p>The default preprocessing workflow takes folders of monthly data from the climate model simulations, and creates normalized numpy arrays for input and target data for training, validation, and scoring. These numpy arrays are called <code class="docutils literal notranslate"><span class="pre">train_input.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">train_target.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">val_input.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">val_target.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">scoring_input.npy</span></code>, and <code class="docutils literal notranslate"><span class="pre">scoring_target.npy</span></code>. An option to strictly use a data loader and avoid converting into numpy arrays is available in the <code class="docutils literal notranslate"><span class="pre">data_utils.py</span></code>; however, this was found to significantly slow down training because of I/O induced slowdown.</p>
<p>The default preprocessing workflow takes folders of monthly data from the climate model simulations, and creates normalized numpy arrays for input and target data for training, validation, and scoring. These numpy arrays are called <code class="docutils literal notranslate"><span class="pre">train_input.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">train_target.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">val_input.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">val_target.npy</span></code>, <code class="docutils literal notranslate"><span class="pre">scoring_input.npy</span></code>, and <code class="docutils literal notranslate"><span class="pre">scoring_target.npy</span></code>. An option to strictly use a data loader and avoid converting into numpy arrays is available in the <code class="docutils literal notranslate"><span class="pre">data_utils.py</span></code>; however, this can significantly slow down training because of increased I/O.</p>
<p>The data comes in the form of folders labeled <code class="docutils literal notranslate"><span class="pre">YYYY-MM</span></code> where <code class="docutils literal notranslate"><span class="pre">YYYY</span></code> corresponds to the year and <code class="docutils literal notranslate"><span class="pre">MM</span></code> corresponds to the month. Within each of these folders are NetCDF (.nc) files that represent inputs and outputs for individual timesteps. Input netCDF files are labeled <code class="docutils literal notranslate"><span class="pre">E3SM-MMF.mli.YYYY-MM-DD-SSSSS.nc</span></code> where <code class="docutils literal notranslate"><span class="pre">DD</span></code> corresponds to the day of the month and <code class="docutils literal notranslate"><span class="pre">SSSSS</span></code> correspond to the seconds of the day (with timesteps being spaced 1200 seconds or 20 minutes apart). Output NetCDF files are labeled the same exact way except <code class="docutils literal notranslate"><span class="pre">mli</span></code> is replaced by <code class="docutils literal notranslate"><span class="pre">mlo</span></code>. For vertically-resolved variables (i.e. variables that have values for an entire atmospheric column), lower indices corresponds to higher levels in the atmosphere. This is because pressure decreases monotonically with altitude.</p>
<p>The files containing the default normalization factors for the input and output data are found in the <code class="docutils literal notranslate"><span class="pre">norm_factors/</span></code> folder, precomputed for convenience. However, one can use their own normalization factors if so desired. The file containing the E3SM-MMF grid information is found in the <code class="docutils literal notranslate"><span class="pre">grid_info/</span></code> folder. This corresponds to the netCDF file ending in <code class="docutils literal notranslate"><span class="pre">grid-info.nc</span></code> on Hugging Face.</p>
<p>The environment needed for preprocessing can be found in the <code class="docutils literal notranslate"><span class="pre">/preprocessing/env/requirements.txt</span></code> file. A class designed for preprocessing and metrics can be imported from the <code class="docutils literal notranslate"><span class="pre">data_utils.py</span></code> script. This script is used in the <code class="docutils literal notranslate"><span class="pre">preprocessing/create_npy_data_splits.ipynb</span></code> notebook, which creates training, validation, and scoring datasets.</p>
Expand Down

0 comments on commit a2bf210

Please sign in to comment.