Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cluster election when nodes initiate elections at the same epoch #1009

Open
wants to merge 5 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

…poch

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@@ -3113,6 +3113,17 @@ int clusterProcessPacket(clusterLink *link) {
if (sender_claims_to_be_primary && sender_claimed_config_epoch > sender->configEpoch) {
sender->configEpoch = sender_claimed_config_epoch;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);

if (server.cluster->failover_auth_time && sender->configEpoch == server.cluster->failover_auth_epoch) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingXie Not sure if i am doing it (the time or the conditions) right here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this fix makes sense. I wonder if we should check "greater than or equal to" instead? There is no way "myself" can win an election in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that when multiple shards failover, they each elect in their own epoch. In this case, the sender's epoch may be > my own failover epoch, but it is correct (different epoch). something like #1018

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way "myself" can win an election in the past.

this make sense to me, ok, i am changing it and then i will re-test it.

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.60%. Comparing base (09def3c) to head (0d13f5d).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1009      +/-   ##
============================================
- Coverage     70.61%   70.60%   -0.01%     
============================================
  Files           114      114              
  Lines         61664    61668       +4     
============================================
- Hits          43541    43540       -1     
- Misses        18123    18128       +5     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.05% <100.00%> (-0.21%) ⬇️

... and 12 files with indirect coverage changes

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 10, 2024
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great bug, @enjoy-binbin! The fix LGTM overall.

src/cluster_legacy.c Outdated Show resolved Hide resolved
@@ -3113,6 +3113,17 @@ int clusterProcessPacket(clusterLink *link) {
if (sender_claims_to_be_primary && sender_claimed_config_epoch > sender->configEpoch) {
sender->configEpoch = sender_claimed_config_epoch;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG);

if (server.cluster->failover_auth_time && sender->configEpoch == server.cluster->failover_auth_epoch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this fix makes sense. I wonder if we should check "greater than or equal to" instead? There is no way "myself" can win an election in the past.

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
enjoy-binbin and others added 3 commits September 13, 2024 15:08
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants