Skip to content
This repository has been archived by the owner on Oct 26, 2021. It is now read-only.
/ swan Public archive

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

License

Notifications You must be signed in to change notification settings

thatguystone/swan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swan Build Status GoDoc

swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.


Features

  • Main content extraction from almost any source
  • Extract HTML content with images
  • Get article metadata, publish dates, and a lot more
  • Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

  • Inline videos into HTML content when found in an article
  • Recognize news sources and extract corresponding video / audio content
  • Recognize and extract more types of content
  • An interesting idea: buriy/python-readability#57 (comment)

About

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published