Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Open
alexggh opened this issue Sep 24, 2024 · 0 comments
Open

Comments

@alexggh
Copy link
Contributor

alexggh commented Sep 24, 2024

Postmortemt #5738 showed that node can crash and restart if a runtime api hangs, the danger here is that on API is hanging taking a long time the behaviour is similar on all nodes, in this case all nodes crashed and restarted at the same time.

That's not good for the network so we should explore ideas for reducing the blast radius, on possible method is to timeout on runtime api calls and make sure the subsystems graciously handle this type of errors.

One thing to take into consideration here is that even if the subsystem call timed-out the runtime could still have that API running in the background and burning CPUs time so we need to make sure we graciously cancel kill tasks that are not needed anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant