Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

alexggh · 2024-09-24T11:18:16Z

Postmortemt #5738 showed that node can crash and restart if a runtime api hangs, the danger here is that on API is hanging taking a long time the behaviour is similar on all nodes, in this case all nodes crashed and restarted at the same time.

That's not good for the network so we should explore ideas for reducing the blast radius, on possible method is to timeout on runtime api calls and make sure the subsystems graciously handle this type of errors.

One thing to take into consideration here is that even if the subsystem call timed-out the runtime could still have that API running in the background and burning CPUs time so we need to make sure we graciously cancel kill tasks that are not needed anymore.

alexggh mentioned this issue Sep 24, 2024

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

alexggh commented Sep 24, 2024

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Comments

alexggh commented Sep 24, 2024