-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PBM-815: physical restore + logical PITR #844
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
defbin
reviewed
Jul 5, 2023
in |
defbin
approved these changes
Jul 5, 2023
olexandr-havryliak
approved these changes
Jul 5, 2023
minottic
pushed a commit
to paulscherrerinstitute/percona-backup-mongodb
that referenced
this pull request
Jul 12, 2023
Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world. We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to: 1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync) 2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore **whenever a sharded collection is created a full physical backup is needed**. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical. `--base-snashot` always flag must be used with the physical base. For example `pbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z`. Without `--base-snapshot` will always look for a logical backup even if there is no logical or a physical one more recent. Distributed transactions ========= The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage. The new algorithm (changed in logical restores as well in a case of code consistency): By looking at just transactions in the oplog we can't tell which shards were participating in it. But we can assume that if there is commitTransaction at least on one shard then the transaction is commited everywhere. Otherwise, transactions won't be in the oplog or everywhere would be transactionAbort. So we just treat distributed as non-distributed - apply opps once a commit message for this txn is encountered. It might happen that by the end of the oplog there are some distributed txns without commit messages. We should commit such transactions only if the data is full (all prepared statements observed) and this txn was committed at least by one other shard. For that, each shard saves the last 100 dist transactions that were committed, so other shards can check if they should commit their leftovers. We store the last 100, as prepared statements and commits might be separated by other oplog events so it might happen that several commit messages can be cut away on some shards but present on other(s). Given oplog events of dist txns are more or less aligned in [cluster]time, checking the last 100 should be more than enough. If the transaction is more than 16Mb it will be split into several prepared messages. So it might happen that one shard committed the txn but another has observed not all prepared messages by the end of the oplog. In such a case we should report it in logs and describe-restore. It also adds some restore stat for dist transactions (into the restore metadata): `partial` - the num of transactions that were allied on other shards but can't be applied on this one since not all prepare messages got into the oplog (shouldn't happen). `shard_uncommitted` - the number of uncommitted transactions before the sync. Basically, the transaction is full but no commit message in the oplog of this shard. `left_uncommitted` - the num of transactions that remain uncommitted after the sync. The transaction is full but no commit message in the oplog of any shard.
minottic
pushed a commit
to paulscherrerinstitute/percona-backup-mongodb
that referenced
this pull request
Oct 26, 2023
Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world. We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to: 1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync) 2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore **whenever a sharded collection is created a full physical backup is needed**. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical. `--base-snashot` always flag must be used with the physical base. For example `pbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z`. Without `--base-snapshot` will always look for a logical backup even if there is no logical or a physical one more recent. Distributed transactions ========= The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage. The new algorithm (changed in logical restores as well in a case of code consistency): By looking at just transactions in the oplog we can't tell which shards were participating in it. But we can assume that if there is commitTransaction at least on one shard then the transaction is commited everywhere. Otherwise, transactions won't be in the oplog or everywhere would be transactionAbort. So we just treat distributed as non-distributed - apply opps once a commit message for this txn is encountered. It might happen that by the end of the oplog there are some distributed txns without commit messages. We should commit such transactions only if the data is full (all prepared statements observed) and this txn was committed at least by one other shard. For that, each shard saves the last 100 dist transactions that were committed, so other shards can check if they should commit their leftovers. We store the last 100, as prepared statements and commits might be separated by other oplog events so it might happen that several commit messages can be cut away on some shards but present on other(s). Given oplog events of dist txns are more or less aligned in [cluster]time, checking the last 100 should be more than enough. If the transaction is more than 16Mb it will be split into several prepared messages. So it might happen that one shard committed the txn but another has observed not all prepared messages by the end of the oplog. In such a case we should report it in logs and describe-restore. It also adds some restore stat for dist transactions (into the restore metadata): `partial` - the num of transactions that were allied on other shards but can't be applied on this one since not all prepare messages got into the oplog (shouldn't happen). `shard_uncommitted` - the number of uncommitted transactions before the sync. Basically, the transaction is full but no commit message in the oplog of this shard. `left_uncommitted` - the num of transactions that remain uncommitted after the sync. The transaction is full but no commit message in the oplog of any shard.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world.
We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to:
1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync)
2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore whenever a sharded collection is created a full physical backup is needed. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical.
--base-snashot
always flag must be used with the physical base. For examplepbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z
. Without--base-snapshot
will always look for a logical backup even if there is no logical or a physical one more recent.Distributed transactions
The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage.
The new algorithm (changed in logical restores as well in a case of code consistency):
By looking at just transactions in the oplog we can't tell which shards
were participating in it. But we can assume that if there is
commitTransaction at least on one shard then the transaction is commited
everywhere. Otherwise, transactions won't be in the oplog or everywhere
would be transactionAbort. So we just treat distributed as
non-distributed - apply opps once a commit message for this txn is
encountered.
It might happen that by the end of the oplog there are some distributed txns
without commit messages. We should commit such transactions only if the data is
full (all prepared statements observed) and this txn was committed at least by
one other shard. For that, each shard saves the last 100 dist transactions
that were committed, so other shards can check if they should commit their
leftovers. We store the last 100, as prepared statements and commits might be
separated by other oplog events so it might happen that several commit messages
can be cut away on some shards but present on other(s). Given oplog events of
dist txns are more or less aligned in [cluster]time, checking the last 100
should be more than enough.
If the transaction is more than 16Mb it will be split into several prepared
messages. So it might happen that one shard committed the txn but another has
observed not all prepared messages by the end of the oplog. In such a case we
should report it in logs and describe-restore.
It also adds some restore stat for dist transactions (into the restore metadata):
partial
- the num of transactions that were allied on other shardsbut can't be applied on this one since not all prepare messages got into
the oplog (shouldn't happen).
shard_uncommitted
- the number of uncommitted transactions before the sync.Basically, the transaction is full but no commit message in the oplog of this
shard.
left_uncommitted
- the num of transactions that remain uncommitted after the sync.The transaction is full but no commit message in the oplog of any shard.