Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-12977: Correctly specify the licenses of the JSON dumps #2897

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

reosarevok
Copy link
Member

Fix MBS-12977

Problem

We were claiming our JSON dumps are under a CC0 license, but that is not correct. The JSON dumps contain annotation, rating and tag data, all of which is supplementary / CC-BY-NC-SA data.

Solution

Talked with @mayhem and @mwiencek about this, we agreed that the right thing to do is to update the dump docs to specify the right licenses rather than dropping the supplementary data, since it can be useful and there's no great way to provide it separately like on the MB dumps.

As such, this moves from having one COPYING file to two, one per license, with the README file telling people what content is covered under which license (I also updated the docs with the proper licensing info.

Testing

Added tests to make sure the license files are in the dumps and the README tells people to check them.

@reosarevok reosarevok added the Bug Bugs that should be checked/fixed soonish label Mar 13, 2023
@reosarevok reosarevok force-pushed the MBS-12977 branch 3 times, most recently from 245a9f1 to cec5839 Compare March 13, 2023 19:03
We were claiming this is all CC0, but that is not correct.
The JSON dumps contain annotation, rating and tag data,
all of which is supplementary / CC-BY-NC-SA data.

Talked with mayhem and mwiencek about this, we agreed
that the right thing to do is to update the dump docs to specify
the right licenses rather than dropping the supplementary data,
since it can be useful and there's no great way to provide it separately
like on the MB dumps.
@mayhem
Copy link
Member

mayhem commented Mar 14, 2023

Hi!

I really think this is not a good approach at all -- unless we wish to be a bit evil about it, in which case its fine. 😈

The problem I see is that now in order to consume this dump, which is intended to be ingested into a document store, the user is required to actually process some of the dump files and write a filter to remove the non CC0 bits before they can import the data. Far less than ideal. But if we want people to not notice this (which will be easy) and allow people to import the non CC0 zero data so that we can sue them later for doing so, then this is a great approach.

While we really don't need more data dump files, it would be best to put out a sanitized version of the files that contain non CC0 data and label them as such, so that when people download the files they have to make the correct decisions right on the spot.

@reosarevok
Copy link
Member Author

I mean, enforcing this would be complicated anyway given we have given the whole data as CC0 for years, so if someone stopped importing the supplementary data tomorrow and kept all the existing ones until this change, they would be in the right.

Having two dumps might be doable, as long as we can do it without it doubling the time it needs to generate the dump (since that's already very problematic as it is, taking over a day).

@reosarevok reosarevok marked this pull request as draft December 14, 2023 13:40
@reosarevok
Copy link
Member Author

Converting to draft since we should look into dumping two files, @mwiencek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs that should be checked/fixed soonish
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants