-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dvc][doc] Create MVP for DaVinciRecordTransformer #1087
base: main
Are you sure you want to change the base?
Conversation
The thing about doing a hash of the bytecode sounds pretty interesting, but I wonder how reliable it is... I suspect that it probably has extremely low chances of false negative (the code changed but we fail to detect it) but that it might have a somewhat large probability of false positives (the code did not change but we think it did). The latter could be due to differences in JDK versions or flags used for compilation... Maybe it's not a big deal, but IDK... Also, coming back to false negatives, those may still exist if the scope of the checksumming is not large enough, e.g. the class we checksum did not change but it calls into another class which did change. Overall, I am not sure how I feel about this mechanism, and I think we may want to have a config to opt out of this check. Possibly even: this should be opt in to begin with (i.e. off by default) until we can gain a better understanding of its limitations. |
From my testing, the hash would change under these circumstances:
Circumstances where it didn't change:
I think having it off by default and having the user enable it makes sense. We could have a global variable that's initially set to false, and add another constructor where a user can pass in a boolean of whether or not to enable it. We could call the variable |
Dvrt mvp
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/DaVinciRecordTransformer.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/StoreBackend.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/AvroGenericDaVinciClient.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/AvroGenericDaVinciClient.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/AvroGenericDaVinciClient.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/DaVinciRecordTransformer.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/client/DaVinciRecordTransformer.java
Outdated
Show resolved
Hide resolved
...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/store/AbstractStoragePartition.java
Outdated
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/store/AbstractStorageEngine.java
Outdated
Show resolved
Hide resolved
I have finished my first two passes over this. Good progress, but there are some significant gotchas that have to be addressed. The first and most primary one being that I do not believe this implementation will work with version pushes. |
…store classHash in rocksDB
newBuffer.put(transformedBytes); | ||
newBuffer.flip(); | ||
return newBuffer; | ||
public final O transformAndProcessPut(Lazy<K> key, Lazy<V> value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading about the behavior of the key and value in SIT, I don't think value needs to be lazy since the operation of assembling has already occurred. Right now, it doesn't really do anything.
Another potential issue here is that key is being deserialized twice in transform
and processPut
, unless we expect most of our users to only utilize the key in processPut
.
Thoughts @ZacAttack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's exactly right is it? The point of the lazy is to handle not having to do value deserialization or chunking unless the user needs it. The code today acts like a follower node should which is just storing key/values to bytes, and the point of the lazy is to give users the ability to read the value as a serialized/assembled record if they need to process it for their logic.
Deserializing twice is a problem.... but should be solved via the lazy right? The lazy function should deserialize once and then on successive calls to get it will just return a cached value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserializing twice is a problem.... but should be solved via the lazy right? The lazy function should deserialize once and then on successive calls to get it will just return a cached value.
Yes, you're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a todo to make chunking lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to document the discussion made during the meeting. We need to think about what to do with the lazy value, since the parameter for transform is lazy, and processPut is also lazy. Should transform return a lazy value, or should we create a lazy/non-lazy API? TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To document this a bit further, the 4 scenarios we need to make sure we have correctness for our:
- User deserialized the value and altered it
- User deserialized the value and didn't alter it
- User did not deserialize the value and just passed it long to the storage engine
- User did not deserialize the value and passed along something else entirely
…the return type of delete void. Proceed with deletion based off of storeRecordsInDaVinci
…iConfig to remove the need of instantiating a dummy record transformer, updated docs, and simplified function names
Summary
storeRecordsInDaVinci
boolean to the constructor to detect if a user wants to store their transformed records in DaVinci or another storage of their choice.DaVinciRecordTransformerConfig
to be used byDaVinciConfig
, so we have access to Output Value Class and Schema without needing to create a dummy record transformer.AbstractStorageIterator
and a RocksDB implementation to be used by the recovery method.put
intotransform
andprocessPut
, so that when we're trying to bootstrap from local state we don't try to transform the data again as it already has been transformed. Also created a wrapper methodtransformAndProcessPut
when both need to be called.StoreIngestionTask
.How was this PR tested?
Added more unit tests.
Does this PR introduce any user-facing changes?