Skip to content

Commit

Permalink
Scraping improvements (#4)
Browse files Browse the repository at this point in the history
* wip/curious script
Update scraping strategy to target API directly for scraping instead of scrolling the DOM. Add functions to handle verification that the org exists. Determine number of pages to be scraped for citizen dossiers. Add all parsed citizen urls to an array.

* wip/add sqlite database for citizen data storage and export
Add Sequelize package. Create Citizen model with required fields. Add database initialization on startup.

* update scraping method and csv structure

* wip/updated csv structure

* various improvements

* cleanup

* feat/updated scraping method and csv strucuture

* chore/update version

* cleanup and formatting

* fix/reduce org name to only SID; update csv to remove mainOrg from affiliation list
  • Loading branch information
S4M8 authored Jul 2, 2024
1 parent 59d95cc commit ee7a54c
Show file tree
Hide file tree
Showing 12 changed files with 1,913 additions and 295 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
intel-extractor-win32-x64
intel-extractor-win32-x64.zip

# Existing Roster
existing_roster.csv

# Generated Database
../src/puppeteer/database/*

# Logs
logs
*.log
Expand Down
Loading

0 comments on commit ee7a54c

Please sign in to comment.