Skip to content

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

License

Notifications You must be signed in to change notification settings

ENGRZULQARNAIN/ScrapySub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Here's a detailed documentation for ScrapySub to include in your README.md file for GitHub and PyPI:


scrapysub

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

Features

  • Recursive Scraping: Automatically follows and scrapes links within the same domain.
  • Custom User-Agent: Mimics browser requests to avoid being blocked by websites.
  • Error Handling: Retries failed requests and handles common HTTP errors.
  • Metadata Storage: Stores additional metadata about the scraped content.
  • Politeness: Adds a delay between requests to avoid overwhelming servers.

Installation

Install ScrapySub using pip:

pip install scrapysub

Usage

Here's a quick example to get you started with ScrapySub:

from scrapysub import ScrapWeb

# Initialize the scraper
scraper = ScrapWeb()

# Start scraping from the given URL
url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

# Get all the scraped documents
documents = scraper.get_all_documents()

# Print the content of each document
for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Detailed Example

Importing Required Libraries

from scrapysub import ScrapWeb, Document

Initializing the Scraper

scraper = ScrapWeb()

Starting the Scraping Process

url = "https://myportfolio-five-tau.vercel.app/"
scraper.scrape(url)

Accessing Scraped Documents

documents = scraper.get_all_documents()

for doc in documents:
    print(f"URL: {doc.metadata['url']}")
    print(f"Content: {doc.page_content[:200]}...")  # Print the first 200 characters
    print()

Class and Method Details

ScrapWeb Class

  • __init__(self): Initializes the scraper with a session and custom headers.
  • fetch_page(self, url): Fetches the HTML content of the given URL with retries and error handling.
  • scrape_text(self, html_content): Extracts visible text from the HTML content.
  • tag_visible(self, element): Helper method to filter out non-visible elements.
  • get_links(self, url, html_content): Finds all valid links on the page within the same domain.
  • is_valid_url(self, url, base_url): Checks if a URL is valid and belongs to the same domain.
  • scrape(self, url): Recursively scrapes the given URL and its subpages.
  • get_all_documents(self): Returns all scraped documents.

Document Class

  • __init__(self, page_content, **kwargs): Stores the text content and metadata of a web page.

Error Handling

ScrapySub handles common HTTP errors by retrying failed requests with a delay. If a request fails multiple times, it logs the error and continues with the next URL.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or suggestions, feel free to reach out to the maintainer.

About

ScrapySub is a Python library designed to recursively scrape website content, including subpages. It fetches the visible text from web pages and stores it in a structured format for easy access and analysis. This library is particularly useful for NLP and AI developers who need to gather large amounts of web content for their projects.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages