This program will produce responses for the challenges ask for the final exercise in Module 1 of Data Analytics in IronHack's Bootcamp [PT2020]
This is an exercise on constructing a DATA PIPELINE, showcasing the programming skills and tools acquired in the first module of the program:
''A data pipeline views all data as streaming data and it allows for flexible schemas. Regardless of whether it comes from static sources or from real-time sources, the data pipeline divides each data stream into smaller chunks that it processes in parallel, conferring extra computing power.'''
- Tables (.db). In the following link you can find the .db file with the main dataset:
- API. The projects used the API from Open SKills Project
- Web Scraping. The project retrieves information about ISO 3166 alpha2 code in Wikipedia and from World Health Organization to get the desired countries inside Europe.
You need to create a Data Pipeline that retrieves the following table:
Country | Job Title | Gender | Quantity | Percentage |
---|---|---|---|---|
Spain | Data Scientist | Male | 25 | 5% |
Spain | Data Scientist | Female | 25 | 5% |
... | ... | ... | ... | ... |
** Percentages are in proportion to each gender in each job category for each country |
The main purpuse of this challenge was to work with in favor/againsts polls, which where complex to clean from raw data.
My interpretation of the second challenge was to visually represent how different gender responded to the basic income polls and wether there was a significally difference.
Position | Pro Arguments for Male | Pro Arguments for Female |
---|---|---|
Responses |
Position | Against for Male | Against for Female |
---|---|---|
Responses |
The main purpose of this challenge was to work with education level table, with a discrete qualitative classification of levels.
I framed the challenge to continue working with gender differences within the data collected, therefore my final table shows top 5 jobs for each gender, using matplotlib to visualize quantities.
Education Level | Top 5 Skills |
---|---|
high | |
medium | |
low | |
no education |
As a prerequisite, the programming lenguage of this repository is Python 3.7.3, therefore must have Python 3 installed. The native packages in use are:
Furthermore, it is need to be installed in the proper environment the following libraries:
- SQL Alchemy (v.1.3.17)
- Pandas (v.0.24.2)
- Numpy (v.1.18.1)
- Requests (v.2.23.0)
- Beautiful Soup (v.4.9.1)
- Scikit-learn (v.0.23.1)
- Matplotlib (v.3.1.2)
Version 1.0 [04.07.2020] > First version done for class presentation
Version 1.0.1 [08.07.2020] > Post presentation corrections
- Download the repo (make sure you have fulfilled the prerequisites).
- Run the function \ main_script.py \ and set a valid argument. There is only the option for a country argument. If the country is not in the generated information retrieved from the web sources, the program will exit at the beginning.
- Possible inputs:
- 3.1. View all countries contained in database:
$ python main_script.py -c All
You will get:
· Parsing argument: ['All']
··· Fetching European countries from web scrapping
··· Validating country argparse
>> getting all countries
- 3.2. View a specific country in the ddbb:
$ python main_script.py -c United Kingdom
You will get:
· Parsing argument: ['United', 'Kingdom']
··· Fetching European countries from web scrapping
··· Validating country argparse
·· country_argument found in ddbb
- 3.3. Wrong entries:
$ python main_script.py -c
You will get:
· Parsing argument: []
··· Fetching European countries from web scrapping
··· Validating country argparse
>> country_argument not found.
>> proceeding to exit
**Help from argeparse can always we call in doubt:
$ python main_script.py -help
linkedin.com/in/lauramielgo for inqueries.
Big thanks to TAs and teachers for the help and support in the development of this project:
@github/potacho
@github/TheGurus
The folder structure follows the template given in class, generating as many files as necessary inside each package.
└── project
├── __trash__
├── .gitignore
├── .env
├── requeriments.txt
├── README.md
├── main_script.py
├── notebooks
│ ├── acquisition.ipynb
│ └── wrangling.ipynb
├── package_acquisition
│ ├── module_acquisition.py
│ └── module_cleaning.py
├── package_wrangling
│ └── module_awrangling.py
├── package_analysis
│ └── module_analysis.py
├── package_reporting
│ └── module_reporting.py
└── data
├── raw
└── ddbb
├── processed
└── (here you will find each ddbb table cleaned)
└── results
├── df_percentage_by_job_and_gender.csv
├── df_top_skills.csv
├── viz_distribution_top_skills.png
└── viz_distribution_basic_income.png