Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set to remove old media directory before exporting it again to clean unused files #212

Merged
merged 1 commit into from
Sep 15, 2024

Conversation

evandrocoan
Copy link
Contributor

No description provided.

@aplaice
Copy link
Collaborator

aplaice commented Aug 2, 2024

Thanks very much for the contribution (as always!)!

Clearing out all unused media files is indeed what we should be doing!

I'll need to check whether there's any risk of undesired removal of files elsewhere due to undefined variables (I think not, but best to be sure) (maybe it's best to just remove all files in the media directory, rather than do (the equivalent of) rm -r media/, to limit the potential damage scope, just in case we modify something elsewhere?) and whether there's a performance slow down (almost certainly not, since we're simply copying everything anyway in the next step). (In a couple of days, since I won't have time in the immediate future.)

@aplaice aplaice merged commit a441f96 into Stvad:master Sep 15, 2024
2 checks passed
@aplaice
Copy link
Collaborator

aplaice commented Sep 15, 2024

Thanks again! Sorry for the delay!

Testing with a synthetic deck with 10000 randomly generated PNGs, it seems that, with your PR applied:

  1. for a fresh export, there's no performance regression
  2. for an export over an existing collection, there's a performance gain
  3. for snapshotting, there doesn't seem to be much of a difference because everything is dwarfed by the git/dulwich operations...

AFAICT there's currently no issue with the code from a "safety" issue — the deck_directory is ultimately set by the user (both for export and snapshotting), so it could be technically anything, but we'll be only affecting /media — however, I'm still uncomfortable, given the various famous rm -rf $ACCIDENTALLY_UNSET_VAR/* bugs. (I'm not quite sure what would trigger the issue in our case, and python doesn't have as huge footguns as bash, but it's not impossible that some otherwise innocuous change elsewhere could cause the value of the media_directory variable to be set to something undesirable, under some edge-cases.) I'll try to reduce the potential fallout in these cases, if I can do it without significant performance regressions!

@evandrocoan
Copy link
Contributor Author

Thanks for looking into this. The next step is not deleting the complete media directory but selectively removing missing files and only copying new files. This optimization would be because my collection has several images and audio for each note, and always copying the media takes one minute with i/o operations.

@aplaice
Copy link
Collaborator

aplaice commented Sep 15, 2024

This optimization would be because my collection has several images and audio for each note, and always copying the media takes one minute with i/o operations.

Sounds pretty annoying! :(

The next step is not deleting the complete media directory but selectively removing missing files and only copying new files.

One (relatively rare but still important) concern is that existing files may have changed. Checking the mtime of each file is probably good enough, but there might be some edge cases where it isn't and so one might need to check the hashes (but this might still be faster than just blindly copying everything).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants