Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DresdenCodak: Fix and improve scraper #230

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

toonn
Copy link
Contributor

@toonn toonn commented Sep 21, 2022

The scraper was broken due to the site layout changing. The structure is lacking so there are many unique cases to deal with.

As the comic is separated into two storylines, one finished and one ongoing, and a series of one-offs that don't fit in either storyline I've put each of the series into its own subdirectory.

To make this possible I've changed how the comicdir is guaranteed to exist, taking into account path components in the filename and I've added path separators in the allowed characters for filenames.

I've also added space as an allowed character in filenames because I want them and there's not really a reason to disallow them. This together with allowing both / and \ could be a breaking change, since scrapers that generate file names containing them will no longer have them filtered out.

If that's not viable then maybe we need a new hierarchy of objects that do allow filenames to specify subdirectories and then scrapers can opt into it one at a time.

@codecov
Copy link

codecov bot commented Sep 21, 2022

Codecov Report

Attention: 31 lines in your changes are missing coverage. Please review.

Comparison is base (a23e070) 82.10% compared to head (fb0ea00) 81.75%.
Report is 23 commits behind head on master.

Files Patch % Lines
dosagelib/plugins/d.py 27.90% 31 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #230      +/-   ##
==========================================
- Coverage   82.10%   81.75%   -0.35%     
==========================================
  Files          79       79              
  Lines        6649     6687      +38     
  Branches      525      536      +11     
==========================================
+ Hits         5459     5467       +8     
- Misses       1071     1101      +30     
  Partials      119      119              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@toonn
Copy link
Contributor Author

toonn commented Sep 22, 2022

Is code coverage of scrapers a requirement or just a helpful hint for maintainers to keep an eye on things?

@TobiX
Copy link
Member

TobiX commented Sep 22, 2022

The coverage is just advisory. It would be nice to add some tests for more complicated scrapers (I broke some scrapers by refactoring and only noticed month later), but not required. If you want to add tests, you can find some inspiration in https://github.com/webcomics/dosage/blob/master/tests/test_modules.py

@TobiX
Copy link
Member

TobiX commented Oct 1, 2022

I'm really not sure if I want to allow modules to mess with sub-directories... (I'm not totally opposed, but also not really convinced that the feature is that useful...) I'm pretty sure some modules actually rely on Dosage stripping the path from the "target" file name, so that's one reason I'm a bit uncomfortable with this change...

@toonn
Copy link
Contributor Author

toonn commented Oct 3, 2022

I understand, but I do really want subdirectories (and spaces in names), hence the suggestion of having a bit of a double hierarchy of objects. So we can leave everything as is but things that want to make use of the feature can use _ParserScraperNG to inherit from instead. (Or this could also be a configuration set by scrapers if you would rather have a boolean flag enabling the behavior somewhere.)

The path separators are necessary if we want to be able to organize
comics into books or chapters or categories.

The spaces are nice to have. I assume they haven't been allowed because
they require escaping in a terminal but any modern terminal provides
assistance for that during tab completion. Disallowing spaces here means
they can never be used. Instead a method could be added to an
appropriate class to sanitize filenames and replace spaces by default if
that is actually desired.
If the filename is a path we need to ensure all its components exist to
avoid errors later when we try to create the file at that path.
The scraper was broken due to the site layout changing. The structure is
lacking so there are many unique cases to deal with.

As the comic is separated into two storylines, one finished and one
ongoing, and a series of one-offs that don't fit in either storyline
I've put each of the series into its own subdirectory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants