Scraping improvements (#4) · S4M8/intel-extractor@ee7a54c

Commit

Scraping improvements (#4)

* wip/curious script
Update scraping strategy to target API directly for scraping instead of scrolling the DOM. Add functions to handle verification that the org exists. Determine number of pages to be scraped for citizen dossiers. Add all parsed citizen urls to an array.

* wip/add sqlite database for citizen data storage and export
Add Sequelize package. Create Citizen model with required fields. Add database initialization on startup.

* update scraping method and csv structure

* wip/updated csv structure

* various improvements

* cleanup

* feat/updated scraping method and csv strucuture

* chore/update version

* cleanup and formatting

* fix/reduce org name to only SID; update csv to remove mainOrg from affiliation list

Loading branch information

S4M8 authored Jul 2, 2024

1 parent 59d95cc commit ee7a54c

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,12 @@ @@
     intel-extractor-win32-x64
     intel-extractor-win32-x64.zip
+    # Existing Roster
+    existing_roster.csv
+    # Generated Database
+    ../src/puppeteer/database/*
     # Logs
     logs
     *.log
@@ Expand Down @@

0 comments on commit `ee7a54c`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `ee7a54c`

Commit

There are no files selected for viewing

0 comments on commit ee7a54c

0 comments on commit `ee7a54c`