Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bunkr] fixed extractor #4529

Closed
wants to merge 6 commits into from
Closed

[bunkr] fixed extractor #4529

wants to merge 6 commits into from

Conversation

Yakabuff
Copy link

@Yakabuff Yakabuff commented Sep 14, 2023

#4514

  • The cdn link can now be either /v/ or /i/ depending on whether it's a video or image
  • Location of link has been adjusted
  • Image URL now needs to be unescaped

@Yakabuff Yakabuff marked this pull request as draft September 14, 2023 02:07
@Yakabuff Yakabuff marked this pull request as ready for review September 14, 2023 02:50
@Yakabuff
Copy link
Author

@mikf

@HeavenlyVice
Copy link

I was testing this fix out and running into an issue. It's possible I did something incorrectly, however I believe the way the code currently reads in this PR is that it's pulling the cdn from the download page based on the string found between <source src=" and the next " for videos and between <img src=" and the next " for images. Which pulls the entire source URL for that album/gallery item. However, it then truncates that url to only grab the cdn root (i.e., 'https://media-files12.bunker.la/') and then appends the end of the {self} URL starting from 'v/' or 'i/' which works for the test album because the image in the file is found at the link https://bunkrr.su/i/test-%E3%83%86%E3%82%B9%E3%83%88-%22&%3E-QjgneIQv.png which is the same as the file name.
However, this isn't usually the case for files on bunkrr in my experience. For example, I just went and grabbed a random bunkrr album here (NSFW content as it seems most of the public albums are....):
https://bunkrr.su/a/aZM5f6WS

Opening the first file goes to the link:
https://bunkrr.su/v/kLG2yrlpg7DSk

However, the src URL for the download is:
https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Currently, this code grabs the 'https://media-files12.bunkr.la/' appends the 'v/kLG2yrlpg7DSk' from the page link for that item and tries to download using 'https://media-files12.bunkr.la/v/kLG2yrlpg7DSk' which 404s as it's an invalid link. I tried commenting out lines 106 and 107 and adding in 'url = cdn' to just use the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 link that is pulled by @Yakabuff's if/else statement, which worked.... For the first item in the gallery. It isn't iterating through all of the files in the gallery as the headers variable is iterating through the 'v/kLG2yrlpg7DSk' URLs instead of iterating through the actual download links that are needed. It just attempts to download the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 file over and over 844 times I believe (one for each file in the gallery). I can look into this later, but don't have the time to dive into the code right this second. Figured I'd explain all this here in case @Yakabuff would be able to correct this quickly or if there's something I'm missing in how I was testing it such that this PR will actually resolve the bunkrr.su issue.

@HeavenlyVice
Copy link

I was testing this fix out and running into an issue. It's possible I did something incorrectly, however I believe the way the code currently reads in this PR is that it's pulling the cdn from the download page based on the string found between <source src=" and the next " for videos and between <img src=" and the next " for images. Which pulls the entire source URL for that album/gallery item. However, it then truncates that url to only grab the cdn root (i.e., 'https://media-files12.bunker.la/') and then appends the end of the {self} URL starting from 'v/' or 'i/' which works for the test album because the image in the file is found at the link https://bunkrr.su/i/test-%E3%83%86%E3%82%B9%E3%83%88-%22&%3E-QjgneIQv.png which is the same as the file name. However, this isn't usually the case for files on bunkrr in my experience. For example, I just went and grabbed a random bunkrr album here (NSFW content as it seems most of the public albums are....): https://bunkrr.su/a/aZM5f6WS

Opening the first file goes to the link: https://bunkrr.su/v/kLG2yrlpg7DSk

However, the src URL for the download is: https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Currently, this code grabs the 'https://media-files12.bunkr.la/' appends the 'v/kLG2yrlpg7DSk' from the page link for that item and tries to download using 'https://media-files12.bunkr.la/v/kLG2yrlpg7DSk' which 404s as it's an invalid link. I tried commenting out lines 106 and 107 and adding in 'url = cdn' to just use the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 link that is pulled by @Yakabuff's if/else statement, which worked.... For the first item in the gallery. It isn't iterating through all of the files in the gallery as the headers variable is iterating through the 'v/kLG2yrlpg7DSk' URLs instead of iterating through the actual download links that are needed. It just attempts to download the https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4 file over and over 844 times I believe (one for each file in the gallery). I can look into this later, but don't have the time to dive into the code right this second. Figured I'd explain all this here in case @Yakabuff would be able to correct this quickly or if there's something I'm missing in how I was testing it such that this PR will actually resolve the bunkrr.su issue.

It seems that removing Line #96 if not cdn: works along with the changes I mentioned. So:

  • Remove Line 96 if not cdn:
  • Remove Line 106 cdn = cdn[:cdn.index("/", 8)]
  • Remove Line 107 `url = cdn + url[2:]
  • Add url = cdn as new Line 105 after the end of the else statement.
  • Adjust indentation for Line 97-105 appropriately

@Yakabuff
Copy link
Author

Yakabuff commented Sep 14, 2023

@HeavenlyVice Thanks, I will try implementing that and do further testing

@Yakabuff
Copy link
Author

Yakabuff commented Sep 15, 2023

Yeah, there seems to be 2 different formats: https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename>
Maybe it has something to do with their ongoing migration?

This means we will have to either:

  1. Make additional requests for every image to fetch the actual CDN url as we can no longer reliably just concatenate url to cdn root. This is much slower and uses double the requests but is more future proof
  2. Make an assumption that if the first url is in the first cdn url, all subsequent files in the album will be in the https://bunkrr.su/v/<filename> format and we can safely use the concatenation method. If not, we assume it is in the https://bunkrr.su/v/<id> format and make requests to fetch the CDN url for every item in the album

@Yakabuff
Copy link
Author

@mikf @HeavenlyVice Seems to work now. I am able to download links from both https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename> formats. From what it looks like, I think they are transitioning to the https://bunkrr.su/v/<id> format as it is used in new albums.

@HeavenlyVice
Copy link

@mikf @HeavenlyVice Seems to work now. I am able to download links from both https://bunkrr.su/v/<id> and https://bunkrr.su/v/<filename> formats. From what it looks like, I think they are transitioning to the https://bunkrr.su/v/<id> format as it is used in new albums.

Yep, looks good to me. It may run slower than how it was previously working, but this should at least address the issue and get it working again. Then maybe we can figure out a faster way of doing it later. The only other note I'd have is to remove the comment on Line 96 or update it since we're no longer just grabbing the cdn root but the entire download link to clarify for anyone working on future development for the Bunkr extractor.

I'm not sure if it's worthwhile to maybe have it display a process notification in the CLI for users to know that it is actually working when it's first grabbing all of the URLs for larger albums. When I was initially testing I wasn't sure that it was working at first because it just was hanging until I escaped it and ran it with the --verbose flag. It doesn't take a massive amount of time, but I made the mistake of testing it on a larger album (like 844 items in the album I believe?) at first so it did take a few minutes to run through them all. But if it were to just print to the CLI something along the lines of "Fetching download URLs..." or "Processing album information..." or something of the sort. Just a thought.

@sixinchfootlong
Copy link

The reason it's slower now is that's having to fetch a separate page for each and every file it's going to download.
There is a faster way but it's going to require redoing how the extractor parses out URLs because it needs information from three separate locations:

  1. The CDN hostname from one of the download pages. This can be cached for the whole album.
  2. Within the gallery, the filename portion of the thumbnail URL. The thumbnail will have a different CDN and the wrong file extension but other than that the filename is correct.
  3. The displayed file name so that we can replace the thumbnail's .png extension.

Example:

  • CDN hostname: media-files12.bunkr.la
  • Thumbnail filename: Woods-KbuqDmbn-rftcXF0I2H1v.png (bonus: if the filename contains non-ASCII characters, they're already stripped out of the thumbnail name)
  • Displayed filename: Woods-KbuqDmbn.mp4

Resulting download URL: https://media-files12.bunkr.la/Woods-KbuqDmbn-rftcXF0I2H1v.mp4

Unfortunately, this sort of parsing isn't well suited for text.extr and would probably be easier with an actual HTML parser.

@bhaskoro-muthohar
Copy link

bhaskoro-muthohar commented Sep 16, 2023

I tried to install it from your branch, but I got

PS E:\> python -m gallery_dl "https://bunkrr.su/a/XJKNZPzj"
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-sexy-secretary-mpC90sg3-lirGsroJ.mp4'
[download][error] Failed to download mvngokitty-sexy-secretary-mpC90sg3-lirGsroJ.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-2-FxbYOFhM-VMuh66dW.mp4'
[download][error] Failed to download mvngokitty-2-FxbYOFhM-VMuh66dW.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/Mvngokitty-Best-friends-mom-oRVWkYpw-U6Wpexp2.mp4'
[download][error] Failed to download Mvngokitty-Best-friends-mom-oRVWkYpw-U6Wpexp2.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/Mvngokitty-Red-Lingerie-Masturbation-Oz8vNwoD-hI3thlMw.mp4'
[download][error] Failed to download Mvngokitty-Red-Lingerie-Masturbation-Oz8vNwoD-hI3thlMw.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/mvngokitty-gym-buddy-creampie-b0QifI19-VEOuHMdp.mp4'
[download][error] Failed to download mvngokitty-gym-buddy-creampie-b0QifI19-VEOuHMdp.mp4
[downloader.http][warning] '403 Forbidden' for 'https://big-taco-1.bunkr.ru/MvngoKitty-OnlyFans-2019_09_25_5d8bfa7ea895855e90713-Video-l9sWxSjB-wPbhOAfO.mp4'

---edit---

I tried to download it manually but got DDoS-Guard T_T
image

@sixinchfootlong
Copy link

@bhaskoro-muthohar that's not a problem with the code. You need to set your config to a browser User-Agent string or you'll get blocked.

@bhaskoro-muthohar
Copy link

@bhaskoro-muthohar that's not a problem with the code. You need to set your config to a browser User-Agent string or you'll get blocked.

What is the User-Agent value for bunkr you recommend?

mikf added a commit that referenced this pull request Oct 1, 2023
@mikf mikf closed this Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants