Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBM-815: physical restore + logical PITR #844

Merged
merged 23 commits into from
Jul 5, 2023
Merged

Conversation

dAdAbird
Copy link
Member

@dAdAbird dAdAbird commented Jul 4, 2023

Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world.

We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to:
1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync)
2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore whenever a sharded collection is created a full physical backup is needed. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical.

--base-snashot always flag must be used with the physical base. For example pbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z. Without --base-snapshot will always look for a logical backup even if there is no logical or a physical one more recent.

Distributed transactions

The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage.
The new algorithm (changed in logical restores as well in a case of code consistency):

By looking at just transactions in the oplog we can't tell which shards
were participating in it. But we can assume that if there is
commitTransaction at least on one shard then the transaction is commited
everywhere. Otherwise, transactions won't be in the oplog or everywhere
would be transactionAbort. So we just treat distributed as
non-distributed - apply opps once a commit message for this txn is
encountered.
It might happen that by the end of the oplog there are some distributed txns
without commit messages. We should commit such transactions only if the data is
full (all prepared statements observed) and this txn was committed at least by
one other shard. For that, each shard saves the last 100 dist transactions
that were committed, so other shards can check if they should commit their
leftovers. We store the last 100, as prepared statements and commits might be
separated by other oplog events so it might happen that several commit messages
can be cut away on some shards but present on other(s). Given oplog events of
dist txns are more or less aligned in [cluster]time, checking the last 100
should be more than enough.
If the transaction is more than 16Mb it will be split into several prepared
messages. So it might happen that one shard committed the txn but another has
observed not all prepared messages by the end of the oplog. In such a case we
should report it in logs and describe-restore.

It also adds some restore stat for dist transactions (into the restore metadata):
partial - the num of transactions that were allied on other shards
but can't be applied on this one since not all prepare messages got into
the oplog (shouldn't happen).
shard_uncommitted - the number of uncommitted transactions before the sync.
Basically, the transaction is full but no commit message in the oplog of this
shard.
left_uncommitted - the num of transactions that remain uncommitted after the sync.
The transaction is full but no commit message in the oplog of any shard.

pbm/restore.go Outdated Show resolved Hide resolved
pbm/restore.go Outdated Show resolved Hide resolved
pbm/restore.go Outdated Show resolved Hide resolved
pbm/restore.go Outdated Show resolved Hide resolved
pbm/restore.go Outdated Show resolved Hide resolved
pbm/restore/restore.go Outdated Show resolved Hide resolved
pbm/restore/restore.go Outdated Show resolved Hide resolved
pbm/restore/restore.go Outdated Show resolved Hide resolved
pbm/restore/restore.go Outdated Show resolved Hide resolved
cli/restore.go Outdated Show resolved Hide resolved
pbm/oplog/restore.go Outdated Show resolved Hide resolved
pbm/oplog/restore.go Outdated Show resolved Hide resolved
pbm/restore/logical.go Outdated Show resolved Hide resolved
@defbin
Copy link
Member

defbin commented Jul 5, 2023

in cli/cli.go:170 remove "pitrestore" event for log filter

dAdAbird and others added 5 commits July 5, 2023 13:50
Co-authored-by: Dmytro Zghoba <dmytro.zghoba@percona.com>
Co-authored-by: Dmytro Zghoba <dmytro.zghoba@percona.com>
@dAdAbird dAdAbird merged commit 3f5ccc4 into main Jul 5, 2023
24 of 28 checks passed
@dAdAbird dAdAbird deleted the PBM-815_phys_n_pitr branch July 5, 2023 16:18
minottic pushed a commit to paulscherrerinstitute/percona-backup-mongodb that referenced this pull request Jul 12, 2023
Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world. 

We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to:
    1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync)
    2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore **whenever a sharded collection is created a full physical backup is needed**. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical.

`--base-snashot` always flag must be used with the physical base. For example `pbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z`. Without `--base-snapshot` will always look for a logical backup even if there is no logical or a physical one more recent.

Distributed transactions
=========
The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage.
The new algorithm (changed in logical restores as well in a case of code consistency):

By looking at just transactions in the oplog we can't tell which shards
were participating in it. But we can assume that if there is
commitTransaction at least on one shard then the transaction is commited
everywhere. Otherwise, transactions won't be in the oplog or everywhere
would be transactionAbort. So we just treat distributed as
non-distributed - apply opps once a commit message for this txn is
encountered.
It might happen that by the end of the oplog there are some distributed txns
without commit messages. We should commit such transactions only if the data is
full (all prepared statements observed) and this txn was committed at least by
one other shard. For that, each shard saves the last 100 dist transactions
that were committed, so other shards can check if they should commit their
leftovers. We store the last 100, as prepared statements and commits might be
separated by other oplog events so it might happen that several commit messages
can be cut away on some shards but present on other(s). Given oplog events of
dist txns are more or less aligned in [cluster]time, checking the last 100
should be more than enough.
If the transaction is more than 16Mb it will be split into several prepared
messages. So it might happen that one shard committed the txn but another has
observed not all prepared messages by the end of the oplog. In such a case we
should report it in logs and describe-restore.

It also adds some restore stat for dist transactions (into the restore metadata):
`partial` - the num of transactions that were allied on other shards
but can't be applied on this one since not all prepare messages got into
the oplog (shouldn't happen).
`shard_uncommitted` - the number of uncommitted transactions before the sync.
Basically, the transaction is full but no commit message in the oplog of this
shard.
`left_uncommitted` - the num of transactions that remain uncommitted after the sync.
The transaction is full but no commit message in the oplog of any shard.
minottic pushed a commit to paulscherrerinstitute/percona-backup-mongodb that referenced this pull request Oct 26, 2023
Add an oplog reply stage during the restore data post-processing. All these steps are done whilst the PSMDB cluster is down and isn’t accessible from the outside world. 

We cannot start a full replica set at this stage without exposing it to external users. So oplog reply is done only on the “pimary” node in the single-node replicates state. The data will be propagated to the rest of the nodes during the cluster start. This leads to:
    1. PITR for physical backup happens only to the primary node and later on is copied during cluster start (cluster initial sync)
    2. PITR for sharded collections works only for writes, not for the creation of sharded collections, therefore **whenever a sharded collection is created a full physical backup is needed**. With the logical backup, it’s covered as a full replica set is restored rather than one node in case of physical.

`--base-snashot` always flag must be used with the physical base. For example `pbm restore --time=2023-07-03T16:23:56 -w --base-snapshot=2023-07-03T16:18:09Z`. Without `--base-snapshot` will always look for a logical backup even if there is no logical or a physical one more recent.

Distributed transactions
=========
The old way on sync all dist transactions between the shards won't work due to no DB available during the physical restore. Hence it would be way too slow to do such a sync over the remote storage.
The new algorithm (changed in logical restores as well in a case of code consistency):

By looking at just transactions in the oplog we can't tell which shards
were participating in it. But we can assume that if there is
commitTransaction at least on one shard then the transaction is commited
everywhere. Otherwise, transactions won't be in the oplog or everywhere
would be transactionAbort. So we just treat distributed as
non-distributed - apply opps once a commit message for this txn is
encountered.
It might happen that by the end of the oplog there are some distributed txns
without commit messages. We should commit such transactions only if the data is
full (all prepared statements observed) and this txn was committed at least by
one other shard. For that, each shard saves the last 100 dist transactions
that were committed, so other shards can check if they should commit their
leftovers. We store the last 100, as prepared statements and commits might be
separated by other oplog events so it might happen that several commit messages
can be cut away on some shards but present on other(s). Given oplog events of
dist txns are more or less aligned in [cluster]time, checking the last 100
should be more than enough.
If the transaction is more than 16Mb it will be split into several prepared
messages. So it might happen that one shard committed the txn but another has
observed not all prepared messages by the end of the oplog. In such a case we
should report it in logs and describe-restore.

It also adds some restore stat for dist transactions (into the restore metadata):
`partial` - the num of transactions that were allied on other shards
but can't be applied on this one since not all prepare messages got into
the oplog (shouldn't happen).
`shard_uncommitted` - the number of uncommitted transactions before the sync.
Basically, the transaction is full but no commit message in the oplog of this
shard.
`left_uncommitted` - the num of transactions that remain uncommitted after the sync.
The transaction is full but no commit message in the oplog of any shard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants