Skip to content

Multiprocess email address scraper for the De La Salle University website staff directory. Our approach models the scraping task as a multiple producer – multiple consumer problem to achieve a 7.22× superlinear speedup compared to serial execution

Notifications You must be signed in to change notification settings

memgonzales/parallel-email-scraper

Repository files navigation

Multiprocesses Email Address Scraper

badge badge badge

This project is a multiprocess email address scraper for the De La Salle University website staff directory.

This is the major course output in an advanced operating systems class for master's students under Mr. Gregory G. Cu of the Department of Software Technology, De La Salle University. The task is to create an email address scraper that employs parallel programming techniques. The complete project specifications can be found in the document Project Specifications.pdf.

Approach

Combining both functional and data decomposition, our proposed approach models the scraping task as a multiple producer – multiple consumer problem:

  • The set of personnel IDs in the staff directory is divided by department, and multiple producers are mapped to different department directories. Each producer retrieves the personnel IDs from its assigned department directory and stores them in a synchronized queue.
  • Concurrently, the IDs are dequeued by consumer subprocesses, which use them to visit the staff members' individual web pages, scrape pertinent information (names, email addresses, and departments) from there, and store these details in another queue.
  • A dedicated subprocess gets the details from this queue and writes them on the output file.

Running our proposed approach with five threads achieves a 7.22× superlinear speedup compared to serial execution. Further experiments show that it achieves better scalability and performance than baseline parallel programming approaches that scrape from the root directory.

App Screenshots

Running the Scraper

  1. Create a copy of this repository:

    • If git is installed, type the following command on the terminal:

      git clone https://github.com/memgonzales/parallel-email-scraper
      
    • If git is not installed, click the green Code button near the top right of the repository and choose Download ZIP. Once the zipped folder has been downloaded, extract its contents.

  2. Install Google Chrome. It is recommended to retain the default installation directory.

  3. Install the necessary dependencies. All the dependencies can be installed via pip.

  4. Run the following command on the terminal:

    python scraper.py
    
  5. The following output files will be produced once the program is finished running:

    • Scraped_Emails.csv - A text file containing the scraped details (names, email addresses, and departments)
    • Website_Statistics.txt - A text file containing the number of pages scraped, the number of email addresses found, and the URLs scraped

    Sample screenshots of the running program and output files are also provided in this repository.

Built Using

This project was built using Python 3.8, with the following libraries and modules used:

Libraries/Modules Description License
Selenium 4.7.2 Provides functions for enabling web browser automation Apache License 2.0
Webdriver Manager 3.8.5 Simplifies management of binary drivers for different browsers Apache License 2.0
multiprocessing Offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock Python Software Foundation License
time Provides various time-related functions Python Software Foundation License

The descriptions are taken from their respective websites.

Authors

About

Multiprocess email address scraper for the De La Salle University website staff directory. Our approach models the scraping task as a multiple producer – multiple consumer problem to achieve a 7.22× superlinear speedup compared to serial execution

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages