Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few queries #54

Open
cloudcompute opened this issue Feb 19, 2023 · 3 comments
Open

Few queries #54

cloudcompute opened this issue Feb 19, 2023 · 3 comments

Comments

@cloudcompute
Copy link

cloudcompute commented Feb 19, 2023

Hi @larousso and @ptitFicus

This project looks valuable. I have few queries relating to it.

a. Why have you decided not to use CDC-Debezium and how are you able to achieve it without using CDC.. pretty interesting? Documentation says: "It keeps trying writing to Kafka until succeeded", won't it impact performance?

b. What is the role of akka in this project? Can we do away with it because it is BSL now.

c. While writing to Kafka, Avro/protobuf formats have not been used which are really fast. Any reason for it? Is it possible to integrate Thoth with this gRPC/REST proxy for Kafka?

d. Does it support these: Optimistic Concurrency, (multiple users writing events to the same table at the same time) Event Replay (right from beginning for Audits etc.), Snapshots, topic compaction (delete/overwrite a message record for a given key), and Event Versioning?

e. What is the difference between Non blocking JOOQ and Standard JOOQ/Kafka implementations. In which case (use cases), shall we use which implementation?

f. Is there no need for Kafka-proxy

Thanks

@ptitFicus
Copy link
Member

Hello,

Thanks for your interest.

a. I'm not a CDC / Debezium expert, however here are some elements :

  • we want to be able to publish event to kafka even if the application crashes right after postgres write. From my understanding, CDC works with triggers that only happen once for each event, for instance in the following scenario event may never be published :

    1. Postgres writes succeed
    2. CDC triggers try to publish into Kafka
    3. Kafka publication fails (network issue / kafka is down / ...)
    4. Application crashes
    5. Application restart, but CDC trigger won't happen again for this event
  • since our use case is pretty simple, we didn't feel the need to bring additional dependencies such as debezium

  • Regarding the "how", here is what Thoth does under the hood (simplified version):

    1. You send a command to Thoth
    2. Thoth derives event(s) to be written from your command and current state (reading events from database)
    3. Thoth opens a transaction :
      1. It writes events in event store, with a flag indicated that these events are not published in Kafka
      2. It computes "in transaction" / "real times" projections and store them in DB (if needed)
      3. It tries a first kafka publication, and on success update the publication flag to successful state.
    4. If both 1 and 2 works, Thoth commit the transaction and the CompletionStage you get in return succeed. If the first kafka publication fails, a process start in background and retries to publish any non published event. This process is independant from the completionStage you get as a result, therefore it does not impact response time.

b. @larousso is working on an akka free version using reactor instead

c. Current Kafka publication is speed is enough for our need, therefore we chose to keep it simple for Kafka publication. I think it may be possible to add an opt-in option for Avro / protobuf support, but it's definitely not on our roadmap. Perhabs it could be a good contribution, what do you think @larousso ?

d.

e. The main difference is in the driver used to communicate with postgres, Non blocking Jooq Akka implementation ise based on the reactive vertx driver for postgres, while Standard JOOQ/Kafka implementation uses a non reactive driver

f. Could you elaborate on this point ?

@larousso
Copy link
Contributor

Hi,

a. With the actual implementation, the event are published in a near real time to kafka. I think that with debezium it would have a latency.
The actual algorithm is to enqueue in memory the messages in order to send it to kafka. If kafka is down, the process crash and restart reading the events from database in order to get the unpublished messages while the hot messages are kept in memory. The idea is to try keep the order of the events.

b. The first version of thoth use akka as reactive stream implementation but with the change of licence, the akka implementation was kept and a reactor implementation was added.
I think we won't support the akka implementation because we won't use it anymore.
At the moment the only dependency that remains is for tests but I will remove it as soon as possible.

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

d.

  1. for optimistic concurency, a sequence num can be used to detect updates from a wrong version.
  2. for event replay, there is a stream method from the API that can be used to read all the events from the database.
  3. topic compaction : at the moment the shard id is the entity id so if you use compaction in kafka i think that it will kept the last event for a given entity.

e. I don't know if @ptitFicus uses jdbc in production, but I think the non blocking one is the more stable solution.

@cloudcompute
Copy link
Author

Thanks to both @larousso and @ptitFicus for the detailed responses.

Implementing CDC using Outbox Pattern

It preserves the order of events. Latency is there but very low. An asynchronous process runs in the background that copies the event records from Outbox table to Kafka. In case it is restarted, it reads again but it may cause duplicates in Kafka. This ensures at-least-once processing guarantee semantics. If duplicates are there, the consuming service has to make sure to ignore the duplicate (idempotency).

hot messages are kept in memory

I don't know how have you implemented it, but what if the process that is keeping track of these messages itself die.

f. Is there no need for Kafka-proxy

It was left by mistake. I merged this question in point c. I wanted to ask whether it is possible to (or makes sense while using ES) to put a Kafka proxy in front of Kafka so that all events written/read to/from it (by microservices) go through this proxy. There are several benefits, for instance, if there are hundreds of microservices they don't need to know about the Kafka cluster details like location, they just need to need to know about this proxy. It has other benefits too which are stated here: [gRPC/REST proxy for Kafka]

c. There are tools for json serialization because it is what we're using at the moment but you could write your own avro serializer if you need it, there is no blocker for that.

Correct

With Regards to both of you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants