Weather data scraper - OpenWeatherMap API

This is a repository for automatically downloading the current weather conditions of three italian cities: Milan, Bologna, Cagliari. The data comes as responds to scheduled requests that are sent to the OpenWeatherMap API every hour. For obtaining the lontitude and latitude coordinates of each city, necessary for the implementation of city-specific requests, a different request is sent to the same API at the beginning of the process. After download, the data is stored in the path indicated by the user in JSON format. Every time a new JSON file lands in that folder, its content is ingested into a Snowflake (data warehouse) internal stage automatically, using Snowflake's connector with Python (snowflake-connector-python).

The user may set the following parameters through the params.yaml:

api_key - subscription key
data_folder_path - folder where the data will be saved
Snowflake parameters
- password
- account
- warehouse
- database
- scheme
- stage

Execute the code

To start scraping data all you have to do is to clone this repository, and set the parameters data_folder_path (where the data will land) and api_key (communication with the weather API). Finally, execute the script scraper.py. Python is required to be installed on your machine, along with its necessary libraries listed in the requirements.txt. Data in JSON format will start landing in the defined directory every hour (one JSON per city). After arrival, the data is ingested to the data warehouse automatically (if configuration parameters have been set in the params.yaml).

To test only the downloader execute the script download_only.py

Docker deployment instructions

The docker has as entrypoint the scraper.py script, which means that whenever the user runs the docker image of the service, the script is executed automatically.

To deploy the image on a new machine use the following command:

docker build -t weather-scraper -f scraper.Dockerfile .

To run the deployed docker image use the following command:

docker run --rm --name weather-scraper-instance -t weather-scraper

JSON format API response example

{
    "coord": {
        "lon": 11.3426,
        "lat": 44.4938
    },
    "weather": [
        {
            "id": 800,
            "main": "Clear",
            "description": "clear sky",
            "icon": "01d"
        }
    ],
    "base": "stations",
    "main": {
        "temp": 296.14,
        "feels_like": 295.93,
        "temp_min": 295.51,
        "temp_max": 297.08,
        "pressure": 1015,
        "humidity": 55
    },
    "visibility": 10000,
    "wind": {
        "speed": 5.66,
        "deg": 90
    },
    "clouds": {
        "all": 0
    },
    "dt": 1712590807,
    "sys": {
        "type": 2,
        "id": 2000038,
        "country": "IT",
        "sunrise": 1712551338,
        "sunset": 1712598604
    },
    "timezone": 7200,
    "id": 3181928,
    "name": "Bologna",
    "cod": 200
}

Snowflake configuration

For the whole ETL workflow impementation a free trial/personal Snowflake account is required. Once the account is created, the user should fill the account details in the params.yaml, before executing the data download service.

For the configuration of the database architecture the user may run the SQL code blocks that are documented in the DATABASE_CONFIG.sql worksheet. These queries will create the database, the different schemas (look data model) and tables, as well as the file format and warehouse internal stage where the JSON data will land.

The LOAD&QUERY.sql worksheet manually ingests into the internal stage, all the history data that are contained in the hist_data folder, and copies them into the landing_json table of the landing_zone schema. Then the data is further split into two tables in the curated_zone, where non-important features are dropped and the data is corrected and cured. Finally the worksheet ends with queries that answer various questions regarding the weather conditions in all three cities (questions posed in the assessment test). For more info regarding the data curing process see ETL_workflow_explanation.pdf.

The code blocks in the scripts SNOWPIPE.sql, STREAMS.sql and TASKS.sql implement the main components of the ETL pipeline (monitoring of INSERT & UPDATES statements in staging tables, rules for passing the data from one table to another, auto-ingestion of new data arriving to the warehouse internal stage).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weather data scraper - OpenWeatherMap API

Execute the code

Docker deployment instructions

JSON format API response example

Snowflake configuration

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data_model		data_model
hist_data		hist_data
DATABASE_CONFIG.sql		DATABASE_CONFIG.sql
LOAD&QUERY.sql		LOAD&QUERY.sql
README.md		README.md
SNOWPIPE.sql		SNOWPIPE.sql
STREAMS.sql		STREAMS.sql
TASKS.sql		TASKS.sql
download_only.py		download_only.py
geo_coordinates.py		geo_coordinates.py
params.yaml		params.yaml
requirements.txt		requirements.txt
scraper.Dockerfile		scraper.Dockerfile
scraper.py		scraper.py

Gkontopodis/weather-data-scraper

Folders and files

Latest commit

History

Repository files navigation

Weather data scraper - OpenWeatherMap API

Execute the code

Docker deployment instructions

JSON format API response example

Snowflake configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages