Skip to content

Latest commit

 

History

History
28 lines (18 loc) · 6.22 KB

Ideas & history.md

File metadata and controls

28 lines (18 loc) · 6.22 KB

Name recognition

It all started when I noticed that animage.tumblr.com hosts most of his images on 3rd party servers instead of the tumblr itself, and that those servers keep the original filenames intact.
I found out that he often follows a certain pattern when naming the photos, like (digit)?(shortened person name).jpg
just a few images out of 22k I dumped from his blog

This gave me an idea to create a script that would be able to match all the different ways a single name is being shortened (sometimes a nickname was used too) to the name itself.
At first the database had multiple records of partial names pointing to a single full name. However probably only a third or a quarter of all images were following this structure, so the use was rather limited.

It was at that time that I realized I could use the tags themselves to unambiguously define the names of people on the pictures and more. Fortunately animage was quite responsible in his approach to posting images and every picture he shared had all the required tags attached. Many bloggers have a lot to learn from his ways, I believe. Although at that time I didn't know about tumblr API yet so I was trying to collect the tags from the page itself. It was trivial when using only a single blog, but when I decided to make it universally applicable, detecting tags in all kinds of different tumblr themes became near impossible.

Fortunately, the discovery of the publicly available tumblr API allowed me to solve this problem. Still, requesting post info from there requires having the post ID and currently it has to be collected from the page anyway, involving searching for the post container regardless of different and often faulty designs (just how many times have I seen id attributes being used as classes?). Sometimes the post contents and post meta are in two different containers with separate parent nodes, which further complicates the search. Perhaps I should come up with something even more independent from local designs; something that would figure out which series of post the user is currently viewing and request info from API with the appropriate offset and post amount. I don't really know how to do that now, the address bar contains current page number relative to start, but every tumblr might have individual amount of posts per page, and in case of dashboard it's not applicable at all.

At first I only created an Opera version for personal usage, but eventually decided to go public, which meant I had to make at least a Chrome port. It was not a pleasant experience. Where everything worked intuitively simple and logical in Opera, it had to be twisted in the most unexpected ways to make it work in chrome, especially debugging. Maybe it's just the (tamper)monkey thing, but Opera supports userscripting natively and it works great.

API and DB usage

The discovery of API and a way to easily store tags and other info across domains thanks to flash cookies allowed me to come up with a few other helper tools, that I found to be very useful since then. Some of them include a proper two-tag search for tumblr, implementing the intersection of two sets of images found by each tag, and the tumblr indexer, that is designed to assist in mass downloading of images from tumblr along with their tags.
Compared to existing (I couldn't really find a single decent solution, tbh) methods of dumping all images by a certain tag from a tumblr of choice, my scripts allows to both receive a complete list of direct links to found images (that you can feed to any download manager afterwards), optionally filtered by a blacklist of tags, and automatically populate the database with tags found for every processed image. This is how I already dumped the entire animage picture database along with tags attached to them, 22 thousands of images in just a half an hour of script work.

Speaking about filtering, I'm planning to capitalize on the newly found possibilities and create a more powerful and customizable alternative to existing Tumblr Savior addon. Currently the latter only has the black and white lists, which limit the decision making to searching for the presence of any single tag in a black list, unless there is at least one tag in a white list. This is not very effective, because having just one negative tag might lead to you missing an entire post of something otherwise desirable to be seen, while having just one whitelisted tag makes the script powerless. Using the already implemented by me separation into primary (names in my use case) and secondary (non-name meta) tags feature it should become possible to refine filtering to a point where an image would only be skipped if all the relevant (primary) tags it has are blacklisted.
Separation into categories of tags will also help to make decisions about how many there are images of every relevant tag for automatic creation of folders on disc for most popular tags and sorting images among them, without secondary tags getting in the way.

Filename formatting

The idea to translate and format all found tags in a certain way while appending them to the filename came from long ago when I discovered a mass booru uploader script. It required the image file names to have tags already in danbooru-esque style with spaces replaced by underscores and of course without any unicode characters. To have my image collection follow this pattern it'd have required a manual or semi-automatic renaming of all images. While I had some ideas about creating a set of software tools that should simplify this task (I even made a POC version of quick image sorter back then) it largely remained a prospect, not a reality.
Back to now, I realized I can simply use the exising tagging system of sites like tumblr and automatically convert their tag structure into the one I need. It can be achieved by having existing tags coupled with their version in another notation, such as names in kanji with danbooru-styled names in ANSI. Even better, since many images in my collection come from there anyway, I might as well assign a folder to every tag and have the collection sort itself.
Ain't that cool?