Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make scraper script OS-agnostic for consistent execution #224

Merged
merged 4 commits into from
Sep 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions scraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ sudo pacman -S geckodriver firefox # Arch

| package | usage |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| requests | To download previous commits files from our GitHub page and scrape subjects short names |
| requests | To download previously commmitted files from our GitHub page and scrape subjects short names |
| unidecode | To create short names to subjects (that weren't scraped), removing accents from chars. Ex.: Álgebra Linear para a Engenharia -> ÁLE -> ALE |
| selenium | Used to scrape the webpage. On this case is impossible use libraries like `beautifulsoup` due the web stack used by UMinho |
| geckodriver | A selenium dependency to interact with browsers |
Expand All @@ -51,17 +51,17 @@ $ python scraper/main.py

##### Subjects Short Names

[Calendarium](https://calendario.cesium.di.uminho.pt/) use some short names to easily identify some subjects. This names were chosen on previous versions of `filters.json`. The scrap can be done combining the files `data/filter.json` and `data/shifts.json` from a specific commit (when this files were a manual scrap) from [Calendarium Github Page](https://github.com/cesium/calendarium).
[Calendarium](https://calendario.cesium.di.uminho.pt/) use some short names to easily identify some subjects. These names were chosen on previous versions of `filters.json`. The scrape can be done combining the files `data/filter.json` and `data/shifts.json` from a specific commit (when these files were a manual scrape) from [Calendarium Github Page](https://github.com/cesium/calendarium).

If not founded, `scraper/subjects_short_names.json` will be generated by the schedule scraper. Read more at [subjects short names](./modules/README.md#subjects_short_names).
If not found, `scraper/subjects_short_names.json` will be generated by the schedule scraper. Read more at [subjects short names](./modules/README.md#subjects_short_names).

###### You can add manually names to this list

##### Subject IDs and Filter Ids

[Calendarium](https://calendario.cesium.di.uminho.pt/) use a subject ID and a filterID. On UMinho Courses pages, a list of all subjects, ordered first by year/semesters and next by alphabetic order, and the subject IDs are given. This is everything we need to complete `shifts.json` and generate a basic `filters.json` to Calendarium.
[Calendarium](https://calendario.cesium.di.uminho.pt/) uses a subject ID and a filterID. On UMinho Courses pages, a list of all subjects, ordered first by year/semesters and next by alphabetic order, and the subject IDs are given. This is everything we need to complete `shifts.json` and generate a basic `filters.json` to Calendarium.

If not founded, `scraper/subjects.json` will be generated by the schedule scraper. Read more at [subjects scraper documentation](./modules/README.md#subject-id-and-a-filter-id-scraper).
If not found, `scraper/subjects.json` will be generated by the schedule scraper. Read more at [subjects scraper documentation](./modules/README.md#subject-id-and-a-filter-id-scraper).

###### You can add manually subjects to this list

Expand Down
8 changes: 4 additions & 4 deletions scraper/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

from selenium import webdriver

from os import chdir
from os import chdir, path
import json

from modules.subjects_scraper import subjects_scraper
Expand All @@ -11,7 +11,7 @@


# To prevent paths problems, the code need be executed from project root
chdir(__file__.replace("scraper/main.py", ""))
chdir(path.abspath(path.join(path.dirname(path.abspath(__file__)), "..")))

print("Welcome to UMinho Schedule Scraper!")

Expand All @@ -33,14 +33,14 @@
shifts += course_scraper(driver,
"Mestrado em Engenharia Informática", subject_codes)

with open("data/shifts.json", "w") as outfile:
with open(path.join("data", "shifts.json"), "w", encoding="utf-8") as outfile:
json.dump(shifts, outfile, indent=2, ensure_ascii=False)

print(f"\nDone. Scraped {len(shifts)} shifts from the schedules!")
print(f"Check them at data/shifts.json\n")

filters = create_filters(shifts, subjects)
with open("data/filters.json", "w") as outfile:
with open(path.join("data", "filters.json"), "w", encoding="utf-8") as outfile:
json.dump(filters, outfile, indent=2, ensure_ascii=False)

print(f"\nDone. Stored {len(filters)} filters!")
Expand Down
10 changes: 5 additions & 5 deletions scraper/modules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

##### (subjects_short_names_scraper.py)

[Calendarium](https://calendario.cesium.di.uminho.pt/) use some short names to easily identify some subjects. This names were chosen on previous versions of `filters.json`.
[Calendarium](https://calendario.cesium.di.uminho.pt/) uses some short names to easily identify some subjects. These names were chosen on previous versions of `filters.json`.

### Scraping this values

The scrap can be done combining the files `data/filter.json` and `data/shifts.json` from a specific commit (when this files were a manual scrap) from [Calendarium Github Page](https://github.com/cesium/calendarium).
The scrape can be done by combining the files `data/filter.json` and `data/shifts.json` from a specific commit (when these files were a manual scrape) from [Calendarium Github Page](https://github.com/cesium/calendarium).

#### Adding manual values

If for some reason you want add some subjects (a new one) to this scrap, you can edit the dictionary `manual_subject_names` at `scraper/modules/subjects_short_names_scraper.py` file. Follow the next schema:
If for some reason you want add some subjects (a new one) to this scrape, you can edit the dictionary `manual_subject_names` at `scraper/modules/subjects_short_names_scraper.py` file. Follow the next schema:

```python
manual_subject_names = {
Expand All @@ -23,7 +23,7 @@ manual_subject_names = {

#### Output

If not founded, `scraper/subjects_short_names.json` will be generated by the schedule scraper.
If not found, `scraper/subjects_short_names.json` will be generated by the schedule scraper.

## Subject ID and a Filter ID Scraper

Expand All @@ -35,7 +35,7 @@ If not founded, `scraper/subjects_short_names.json` will be generated by the sch
filterId = f"{university_year}{university_semester}{subject_code}"
```

Where the `subject code` is the position of the subject in an alphabetic ordered list. For example:
Where the `subject code` is the position of the subject in an alphabetical ordered list. For example:

```python
# 1st year & 1st semester subjects:
Expand Down
2 changes: 1 addition & 1 deletion scraper/modules/schedule_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def schedule_scraper(driver: WebDriver, subject_codes: list[dict[str, int]]):
Parameters
----------
driver : WebDriver
The selenium driver. Need have the schedule ready
The selenium driver. Needs to have the schedule ready

subject_codes : list[dict[str, int]]
Every subject has its subject ID and filter ID. This IDs are stored on a list of dicts with the format:
Expand Down
11 changes: 7 additions & 4 deletions scraper/modules/subjects_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from time import sleep
from unidecode import unidecode
from collections import Counter
from os import path


def subjects_scraper(driver: WebDriver):
Expand All @@ -35,13 +36,15 @@ def subjects_scraper(driver: WebDriver):
}]
"""

subjects_short_names_path = path.join("scraper", "subjects_short_names.json")

# To compatibility with old version of Calendarium, we use the subjects short names available at GitHub
try:
subjects_short_names = json.load(
open('scraper/subjects_short_names.json'))
open(subjects_short_names_path, encoding="utf-8"))
except FileNotFoundError:
get_subjects_short_names_scraper()
subjects_short_names = json.load(open('scraper/subjects_short_names.json'))
subjects_short_names = json.load(open(subjects_short_names_path, encoding="utf-8"))

# This function will store the return at a file. If the file already exists, we can skip this function
try:
Expand Down Expand Up @@ -87,7 +90,7 @@ def subjects_scraper(driver: WebDriver):
# =====================

# Store the subjects
with open("scraper/subjects.json", "w") as outfile:
with open(path.join("scraper", "subjects.json"), "w", encoding="utf-8") as outfile:
json.dump(subjects, outfile, indent=2, ensure_ascii=False)

print(f"\nDone. Scraped {len(subjects)} subjects from the UMinho page!")
Expand Down Expand Up @@ -269,7 +272,7 @@ def scraper(driver: WebDriver, course_name: str, short_names, master: bool = Fal


def get_subject_codes_from_file():
subjects_file = open("scraper/subjects.json", "r")
subjects_file = open(path.join("scraper", "subjects.json"), "r", encoding="utf-8")

subjects = json.load(subjects_file)
subject_codes = {}
Expand Down
5 changes: 3 additions & 2 deletions scraper/modules/subjects_short_names_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import json
from requests import get
from os import path

manual_subject_names = {

Expand Down Expand Up @@ -94,7 +95,7 @@ def get_subjects_short_names_scraper():

names = {}

print("Not founded info on `shifts.json` about:")
print("Couldn't find info on `shifts.json` about:")

for subject in filters:
filter_id = subject["id"]
Expand All @@ -121,7 +122,7 @@ def get_subjects_short_names_scraper():
for subject in manual_subject_names.values():
print("\t" + subject['name'])

with open("scraper/subjects_short_names.json", "w") as outfile:
with open(path.join("scraper", "subjects_short_names.json"), "w", encoding="utf-8") as outfile:
json.dump(names, outfile, indent=2, ensure_ascii=False)

print(f"\nDone. Stored {len(names)} names!")
Expand Down
Loading