Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient system call errors during recovery cause inconsistent re-initialization #52

Open
palmskog opened this issue May 21, 2017 · 0 comments

Comments

@palmskog
Copy link
Collaborator

From @pfons on April 13, 2016 2:29

During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.

The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (chmod 000 verdi-snapshot-900*).
d) restart all servers
e) issue one GET client request

In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).

Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.

Copied from original issue: uwplse/verdi#40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant