scrapper post-process script for RSSGuard ( https://github.com/martinrotter/rssguard )
Arguments - each is a CSS selector ( https://www.w3schools.com/cssref/css_selectors.asp ):
- item
- item title (optional - else would use link's text as title)
- item description (optional - else would use all the text from item as description)
- item link (optional - else would use 1st found link in the item (or the item itself if it's a link))
- item title 2nd part (optional (or if static main title \ multilink option is enabled), else just title, e.g. title is "Batman" and 2nd part is "chapter 94")
- item date (optional, else it'd all be "just now") - aim this selector either at text nodes (e.g.
span
) or elements (a
,img
) withtitle
oralt
containing the Date (e.g. "New!" flashing image badges you get the Date when hovering over)
- for
1) item
-@
at start - enables searching for multiple links inside the found item, e.g. onediv
item and multiplea
links inside it and you want it as separate feed items - for everything after
1) item
-~
as the whole argument - to let the script decide what to do (default action) - e.g. use 1st found link inside the item, use whole text inside the item as the description etc (not actually an option, but rather a format for the argument line), e.g.python css2rss.py div.itemclass ~ span.description
(here link's inner text (2nd argument) will be used as the title by default action but description is being looked for (3rd argument)) - for
2) title
,5) item title 2nd part
and3) item description
-!
at start - makes it a static specified value (after the !), e.g."!my title"
, if you make 1st part of the title fixed then 2nd part title addon would get auto-enabled and it would use text inside the found link as the 2nd part (unless you specify what to use manually as the 5th argument) - for
2) title
,5) item title 2nd part
-$
at start - executes a python code expression instead of using CSS selectors, uses found item link as a starting point and takestext
from iteval("tLink."+your_inputted_argument).text
, see https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for things you can do with it - e.g. go one level up (to the parent element) or to the next element - or select elements CSS selectors can't select, see example below - for
6) date
-?
at start - tells the parser that you're expecting an Americal format of date - "Month/Day/Year"
-
1) item
is searched in the whole document and the rest is searched inside theitem
document node (but you can make theitem
point right at thea
hyperlink - it will be used by default) -
use space
"
, e.g.python css2rss.py div.class "div.subclass > h1.title" span.description
(btw, you can also enclose arguments without any spaces into brackets if you'd like) -
if no item is found - a feed item would be generated with the html dump of the whole page so you could see what could be wrong (e.g. - cloudflare block page)
-
content you need to log-in first to see is available
- scrapper uses cookies of RSSGuard, so if you login into a website using built-in browser of RSSGuard - scrapper would be able to access that content as well to scrape it into a feed
- No javaScripts would run on scrapped pages, so sites which populate their content with javaScripts wouldn't be able to get scrapped, instead their starting version (what you'd see in
right click -> view page source
) would get scrapped.- You could try to get the needed content from other pages of the site, e.g. - main page, releases page or even the search page - one of these pages could be static and not constructed using javaScripts
-
Have Python 3+ or newer ( https://www.python.org/downloads/ ) installed (and added to PATH during install)
1.2. Have Python Soup ( https://www.crummy.com/software/BeautifulSoup/ ) installed (Win+R -> cmd -> enter ->
pip install beautifulsoup4
)
1.3. (optional) If you'd like to parse Dates for articles - Have Maya ( https://github.com/timofurrer/maya/ ) installed (Righ click the Start menu -> run powershell as administrator -> cmd ->pip install maya
) -
Put css2rss.py into your
data4
folder (so you can call the script with justpython css2rss.py
, else you'd need to specify full path to the.py
file)
- a simple link makeover into an rss feed (right-clicked a link -> inspect element -> use its CSS selector):
url: https://www.foxnews.com/media
script: python css2rss.py ".title > a"
(link a
right inside an element with title
class
- the reason for implementing static titles
url: https://kumascans.com/manga/sokushi-cheat-ga-saikyou-sugite-isekai-no-yatsura-ga-marude-aite-ni-naranai-n-desu-ga/
script: python css2rss.py ".eph-num > a" "!Sokushi Cheat" ".chapterdate" ~ ".chapternum"
- the reason for implementing searching multiple links inside one item
url: https://www.asurascans.com/
script: python css2rss.py "@.uta" "h4" img "li > a" "li > a"
- the reason for implementing eval expressions for titles (since CSS selectors can't select text nodes outside any tags)
url: https://reaperscans.com/
script: python css2rss.py "@div.space-y-4:first-of-type div.relative.bg-white" "p.font-medium" "img" "a.border" "$contents[0]"
url: https://reader.kireicake.com/
script: python css2rss.py @.group a[href*='/series/'] .meta_r ".element > .title a" ".element > .title a"
- example for parsing Dates for articles, here it uses OR in the css selector and it looks for either
a
element (the "New!" badge) with date inside its tooltip (title
oralt
) OR for aspan
element without any child nodes (both these elements are of class.post-on
url: https://drakescans.com/
script: python css2rss.py "@.page-item-detail" ".post-title a" "img" "span.chapter > a" ~ ".post-on > a,.post-on:not(:has(*))"
- the workaround to scrap sites which give out their contents via javaScripts (the workaround is to find a static page - right-click -> view page source - and see if your text is originally there - that means it's static and not given out later via JS)
url: https://manhuaus.com/?s=Wo+Wei+Xie+Di&post_type=wp-manga&post_type=wp-manga
script: python css2rss.py ".latest-chap a" "!I'm an Evil God"