csbCrawler2Cloud

Python script to load csb data to s3 buckets

Environment variable

Assumes

CSBCRAWLER is set to the root of this project

If you are at the project root, run

   $ export CSBCRAWLER=`pwd`

To confirm the current value

   $ printenv CSBCRAWLER
   /Users/username/src/github/cedardevs/csbCrawler2Cloud

Libraries

Pipenv is used to manage external libraries. Pipfile contains the list of dependencies under the [packages] section.

Verify python version 3.7+

pipenv run python --version

Install spatialindex (on Mac) to support RTree dependency (use homebrew, pipenv didn't find it)

brew install spatialindex

Install all dependencies "[Packages]" from the Pipfile

pipenv install

If the above install does not work, try installing the the dependencies individually

pipenv install boto3
pipenv install rtree
pipenv install shapely
pipenv install pyyaml
pipenv install geopandas

Run app

pipenv run python launch_app.py

Troubleshooting

These environment variables should be set by your ~/.profile (or ~/.bash_profile)

export LC_ALL='en_US.UTF-8'
export LANG='en_US.UTF-8'

Upload to AWS

Requires credentials.yaml

  ACCESS_KEY: xxx
  SECRET_KEY: xxx

Also configure a ~/.aws/credentials file as described in the Boto3 quickstart

File convention assumed

The data lands on NCEI disk as a tarball with 3 files:

YYYYMMDD_uuid_geojson.json
YYYYMMDD_uuid_metadata.json
YYYYMMDD_uuid_pointData.xyz

Athena Notes

-- Generate timestamp column
 SELECT *, from_iso8601_timestamp("xyz"."time") ts FROM csbathenadb.xyz 

-- Create parquet table 
CREATE TABLE csbathenadb.csb_mv
WITH (
  format='PARQUET',
  external_location='s3://csbxyzfiles/optimized/'
) AS SELECT
  xyz.*
, "metadata"."name"
, "metadata"."provider"
FROM
  xyz
, metadata
WHERE ("xyz"."uuid" = "metadata"."uuid")

-- Query using dates
SELECT
  *
FROM
  xyz
WHERE
  from_iso8601_timestamp("xyz"."time")
BETWEEN 
  from_iso8601_timestamp('2015-01-01T00:00:00') 
AND 
  from_iso8601_timestamp('2019-01-01T23:59:00')

Example Request

{
  "uuid": "",
  "email": "david.neufeld@colorado.edu",
  "platform.name": "Tenacity",
  "bbox": "-140.0,24.0,-111.0,32.0",
  "sdate": "2015-01-01T00:00:00",
  "edate": "2019-01-01T23:59:00"
}

Testing

To use the testing facility build into PyCharm (JetBrains IDE), you will need to set the CSBCRAWLER environment variable for the testSuite to run. For example, to execute CsbCrawlerTest.py,

Open the file in the editor,
Right click on the "CsbCrawlerTest.py" tab at the top of the editor window,
Select Edit 'Unittests in CsbCraw...'...
In the dialog that opens, add CSBCRAWLER=/Users/ktanaka/src/github/cedardevs/csbCrawler2Cloud in the "Environment variables:" box (adjust the path to your project location)
Click "OK" to close the dialog

Now if you right-click on the "CsbCrawlerTest.py" tab, you can select Run 'Unittests in CsbCraw...'

To run unit tests from the command line:

python -m unittest tests/CsbCrawlerTest.py

Dev Notes

Generating a Distribution

Create a source distribution with:

python setup.py sdist

A .tar.gz file should appear in the dist/ subdirectory.

Create a built distribution with:

python setup.py bdist_wheel

A .whl file should appear in the dist/ subdirectory. Make use of this file with

pipenv run pip install csbCrawler2Cloud-1.0.0-py3-none-any.whl

Using the Distribution

In production on-prem, conda was used to set up the agiletc user environment. Some system processes required that the general python remain at version 2. Anaconda/miniconda allow per user configurations.

## Dependencies for geopandas
see: https://geopandas.org/install.html

## Use conda-forge as the channel source 
conda install --channel conda-forge rtree
conda install --channel conda-forge libspatialindex
or, specifying version 1.9.3:
conda install --channel conda-forge libspatialindex=1.9.3

anaconda installs packages in ~/anaconda3/pkgs

Running on-prem

You should be able to run on acc-engines (fortuna.ngdc.noaa.gov) in the agiletc account.

fortuna$ sudo /usr/local/bin/become_agiletc
(base) [agiletc@fortuna ~]$ cd src/csbCrawler2Cloud-1.0.1
(base) [agiletc@fortuna csbCrawler2Cloud-1.0.1]$ source setCsbcrawlerEnv.sh
Use 'source setCsbcrawlerEnv.sh' to make the value persist
CSBCRAWLER value:
/home/agiletc/src/csbCrawler2Cloud-1.0.1
(base) [agiletc@fortuna csbCrawler2Cloud-1.0.1]$ python launch_app.py

tar.gz Manifest

Refering to this article for generating md5sum on a file: https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file

Bucket policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicRead",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::odp-noaa-nesdis-ncei-csb/*"
        }
    ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
app		app
config		config
data/input		data/input
docs		docs
onestop-client		onestop-client
server		server
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
launch_app.py		launch_app.py
requirements.txt		requirements.txt
setup.py		setup.py
uniqueIdRun.sh		uniqueIdRun.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csbCrawler2Cloud

Environment variable

Libraries

Troubleshooting

Upload to AWS

File convention assumed

Athena Notes

Example Request

Testing

Dev Notes

Generating a Distribution

Using the Distribution

Running on-prem

tar.gz Manifest

Bucket policy

About

Releases

Packages

Contributors 4

Languages

License

cedardevs/csbCrawler2Cloud

Folders and files

Latest commit

History

Repository files navigation

csbCrawler2Cloud

Environment variable

Libraries

Troubleshooting

Upload to AWS

File convention assumed

Athena Notes

Example Request

Testing

Dev Notes

Generating a Distribution

Using the Distribution

Running on-prem

tar.gz Manifest

Bucket policy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages