Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DresdenCodak: Fix and improve scraper #230

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions dosagelib/comic.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,10 @@
def _fnbase(self, basepath):
'''Determine the target base name of this comic file and make sure the
directory exists.'''
comicdir = self.scraper.get_download_dir(basepath)
comicpath = os.path.join(
self.scraper.get_download_dir(basepath), self.filename
)
comicdir = os.path.dirname(comicpath)
if not os.path.isdir(comicdir):

Check warning on line 151 in dosagelib/comic.py

View check run for this annotation

Jenkins - Dosage / Flake8

C812

LOW: missing trailing comma
os.makedirs(comicdir)
return os.path.join(comicdir, self.filename)
return comicpath
77 changes: 70 additions & 7 deletions dosagelib/plugins/d.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,17 +329,80 @@


class DresdenCodak(_ParserScraper):
url = 'http://dresdencodak.com/'
startUrl = url + 'cat/comic/'
firstStripUrl = url + '2007/02/08/pom/'
imageSearch = '//section[d:class("entry-content")]//img[d:class("aligncenter")]'
from datetime import datetime

url = "https://dresdencodak.com/"
firstStripUrl = url + "2005/06/08/the-tomorrow-man/"
imageSearch = '(//section[d:class("entry-content")]//img[d:class("size-full") and not (contains(@alt, "revious") or contains(@alt,"irst") or contains(@alt,"ext"))])[1]'

Check warning on line 336 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

E501

HIGH: line too long (172 > 100 characters)
textSearch = '//section[d:class("entry-content")]//p[(4 < position()) and (position() < (last() - 1))]'

Check warning on line 337 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

E501

HIGH: line too long (107 > 100 characters)
textOptional = True
prevSearch = '//a[img[contains(@src, "prev")]]'
latestSearch = '//a[d:class("tc-grid-bg-link")]'
starter = indirectStarter

# Blog and comic are mixed...
def shouldSkipUrl(self, url, data):
return not data.xpath(self.imageSearch)
# Haven't found a better way to distinguish whether or not a page is part
# of Hob than by the date prefix.
date_format = "%Y-%m-%d"
hob_start = datetime.strptime("2007-02-08", date_format)
hob_end = datetime.strptime("2008-10-22", date_format)

pagenumber_re = compile(
"(?:[0-9]+-)*[^0-9]+_([0-9]+)(?:a|b|-1|_001|-[0-9]+x[0-9]+)?\.jpg$"

Check warning on line 350 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

W605

NORMAL: invalid escape sequence '\.'

Check warning on line 350 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

C812

LOW: missing trailing comma
)

def getPrevUrl(self, url, data):
# Fix skipping newest One-Off
if url == self.url + "2010/06/03/dark-science-01/":
newurl = self.url + "category/oneoffs/"
return self.fetchUrl(

Check warning on line 357 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L356-L357

Added lines #L356 - L357 were not covered by tests
newurl, self.getPage(newurl), self.latestSearch

Check warning on line 358 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

C812

LOW: missing trailing comma
)
return super(DresdenCodak, self).getPrevUrl(url, data)

Check warning on line 360 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L360

Added line #L360 was not covered by tests

def namer(self, image_url, page_url):
import os.path

Check warning on line 363 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L363

Added line #L363 was not covered by tests

filename = image_url.rsplit("/", 1)[-1]

Check warning on line 365 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L365

Added line #L365 was not covered by tests
# The archives are divided into three parts:
# Dark Science, Hob and One-Offs
if filename.startswith("ds"):
filename = filename[:2] + "_" + filename[2:]

Check warning on line 369 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L369

Added line #L369 was not covered by tests
elif filename == "84_new.jpg":
# Single anomalous page
filename = "ds_84.jpg"

Check warning on line 372 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L372

Added line #L372 was not covered by tests
elif filename == "cyborg_time.jpg":
filename = os.path.join("Dark Science", "84b.jpg")

Check warning on line 374 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L374

Added line #L374 was not covered by tests
elif "act_4" in filename:
filename = os.path.join("Dark Science", "80b.jpg")

Check warning on line 376 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L376

Added line #L376 was not covered by tests
elif "act_3" in filename:
filename = os.path.join("Dark Science", "38b.jpg")

Check warning on line 378 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L378

Added line #L378 was not covered by tests
elif "act_2" in filename:
filename = os.path.join("Dark Science", "18b.jpg")

Check warning on line 380 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L380

Added line #L380 was not covered by tests

if filename.startswith("ds_") or "-dark_science_" in filename:
# Dark Science
import re

Check warning on line 384 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L384

Added line #L384 was not covered by tests

pagenumber = re.match(self.pagenumber_re, filename).group(1)
filename = os.path.join(

Check warning on line 387 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L386-L387

Added lines #L386 - L387 were not covered by tests
"Dark Science", "{0:0>3}".format(pagenumber)

Check warning on line 388 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

C812

LOW: missing trailing comma
)
elif "/" not in filename:
# Hob
from datetime import datetime

Check warning on line 392 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L392

Added line #L392 was not covered by tests

date_prefix = page_url.rsplit("/", 5)[-5:-2]

Check warning on line 394 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L394

Added line #L394 was not covered by tests
date = datetime(*(int(i) for i in date_prefix))
if self.hob_start <= date <= self.hob_end:
filename = os.path.join("Hob", filename)

Check warning on line 397 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L397

Added line #L397 was not covered by tests
else:
# One-Offs
year_day_prefix = date.strftime("%Y-%m-%d")
filename = os.path.join(

Check warning on line 401 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L400-L401

Added lines #L400 - L401 were not covered by tests
"One-Offs", "{0}-{1}".format(year_day_prefix, filename)

Check warning on line 402 in dosagelib/plugins/d.py

View check run for this annotation

Jenkins - Dosage / Flake8

C812

LOW: missing trailing comma
)

return filename

Check warning on line 405 in dosagelib/plugins/d.py

View check run for this annotation

Codecov / codecov/patch

dosagelib/plugins/d.py#L405

Added line #L405 was not covered by tests


class DrFun(_ParserScraper):
Expand Down
2 changes: 1 addition & 1 deletion dosagelib/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -415,7 +415,7 @@ def getFilename(name):
"""Get a filename from given name without dangerous or incompatible
characters."""
# first replace all illegal chars
name = re.sub(r"[^0-9a-zA-Z_\-\.]", "_", name)
name = re.sub(r"[^0-9a-zA-Z_ /\-\.\\]", "_", name)
# then remove double dots and underscores
while ".." in name:
name = name.replace('..', '.')
Expand Down