Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalise table merge with a monoid on values #142

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tranma
Copy link
Contributor

@tranma tranma commented Jan 8, 2018

This is a simple change to allow substituting the standard value union with another monoid, such that merge can be used to calculate the number of bytes used by each key. With this information we can filter out any abnormally large entities. Example:

→ ./dist/build/Zebra/zebra import input1.ztxt -s input.zschema -o input1.zbin && ./dist/build/Zebra/zebra merge input1.zbin --measure -o output1.zbin && ./dist/build/Zebra/zebra export output1.zbin                                                                                                                                                             
{"key":{"entity_hash":30,"entity_id":"lisa"},"value":19}
{"key":{"entity_hash":40,"entity_id":"homer"},"value":16}
{"key":{"entity_hash":50,"entity_id":"bart"},"value":111}
{"key":{"entity_hash":50,"entity_id":"millhouse"},"value":24}

→ cat input1.ztxt                                                                                                                                                                                                                                                                                                                                               
{ "key": { "entity_hash": 50, "entity_id": "millhouse" }, "value": { "cash": 19, "item": { "none": {} } } }
{ "key": { "entity_hash": 40, "entity_id": "homer" }, "value": { "cash": 5, "item": { "some": "" } } }
{ "key": { "entity_hash": 30, "entity_id": "lisa" }, "value": { "cash": 5, "item": { "some": "sax" } } }
{ "key": { "entity_hash": 50, "entity_id": "bart" }, "value": { "cash": 27, "item": { "some": "averylongstringaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" } } }

! @nhibberd @charleso

@nhibberd
Copy link
Contributor

nhibberd commented Jan 8, 2018

Will take a look when I get back

@tranma
Copy link
Contributor Author

tranma commented Jan 10, 2018

I've added an extraction step, so you can get a zebra file of the large entities only. What to do from here is up to you, I think the idea we had was to run this merge measure-greater-than-megabytes step before the actual merge, acquire the blacklist and tell merge to ignore these keys. I haven't put in anything to do that latter step, but it should be straightforward to do so.

→ ./dist/build/Zebra/zebra import input1.ztxt -s input.zschema -o input1.zbin && ./dist/build/Zebra/zebra merge input1.zbin --measure-greater-than-megabytes 0 -o output1.zbin && ./dist/build/Zebra/zebra export output1.zbin                                                                                                                                    [292bd7e]
{"key":{"entity_hash":30,"entity_id":"lisa"},"value":19}
{"key":{"entity_hash":40,"entity_id":"homer"},"value":16}
{"key":{"entity_hash":50,"entity_id":"bart"},"value":111}
{"key":{"entity_hash":50,"entity_id":"millhouse"},"value":24}

→ ./dist/build/Zebra/zebra import input1.ztxt -s input.zschema -o input1.zbin && ./dist/build/Zebra/zebra merge input1.zbin --measure-greater-than-megabytes 1 -o output1.zbin && ./dist/build/Zebra/zebra export output1.zbin                                                                                                                                    [292bd7e]

@erikd-ambiata
Copy link
Contributor

I'm not familiar with the zebra code base, but this looks more than reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants