-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5 / 5] Introduce approval-voting-parallel #4849
base: master
Are you sure you want to change the base?
Conversation
a591635
to
bd5529d
Compare
cb57906
to
4b3f489
Compare
bd5529d
to
5da132d
Compare
7add070
to
3a0ba90
Compare
5da132d
to
1592d0b
Compare
3a0ba90
to
78bb23d
Compare
1592d0b
to
0b5e321
Compare
78bb23d
to
daf343f
Compare
0b5e321
to
49a0312
Compare
daf343f
to
25f1f0a
Compare
49a0312
to
0746ba3
Compare
25f1f0a
to
d13e1c8
Compare
0746ba3
to
cd2d303
Compare
cd2d303
to
aa42eff
Compare
This is part of the work to further optimize the approval subsystems, if you want to understand the full context start with reading #4849 (comment), however that's not necessary, as this change is self-contained and nodes would benefit from it regardless of subsequent changes landing or not. While testing with 1000 validators I found out that the logic for determining the validators an assignment should be gossiped to is taking a lot of time, because it always iterated through all the peers, to determine which are X and Y neighbours and to which we should randomly gossip(4 samples). This could be actually optimised, so we don't have to iterate through all peers for each new assignment, by fetching the list of X and Y peer ids from the topology first and then stopping the loop once we took the 4 random samples. With this improvements we reduce the total CPU time spent in approval-distribution with 15% on networks with 500 validators and 20% on networks with 1000 validators. ## Test coverage: `propagates_assignments_along_unshared_dimension` and `propagates_locally_generated_assignment_to_both_dimensions` cover already logic and they passed, confirm that there is no breaking change. Additionally, the approval voting benchmark measure the traffic sent to other peers, so I confirmed that for various network size there is no difference in the size of the traffic sent to other peers. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall from a light review
polkadot/zombienet_tests/functional/0016-approval-voting-parallel.zndsl
Outdated
Show resolved
Hide resolved
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
This is part of the work to further optimize the approval subsystems, if you want to understand the full context start with reading #4849 (comment), # Description This PR contain changes to make possible the run of single approval-voting instance on a worker thread, so that it can be instantiated by the approval-voting-parallel subsystem. This does not contain any functional changes it just decouples the subsystem from the subsystem Context and introduces more specific trait dependencies for each function instead of all of them requiring a context. This change can be merged independent of the followup PRs. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
…oting-parallel-5-5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🚀
I have no better idea on how to make this better, so best is we stick with current approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work @alexggh !
Looking forward to seeing this live. We should try to increase Kusama to 750 para validators after the para inherent runtime weight fix is applied.
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
…oting-parallel-5-5
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Final testing on versi and a single kusama validator from here https://github.com/paritytech/devops/issues/3564 were successful merging this now ... The current state is that the approval-voting-parallel subsystem is enabled on all test-networks except production networks like |
This is the implementation of the approach described here: #1617 (comment) & #1617 (comment) & #1617 (comment).
Description of changes
The end goal is to have an architecture where we have single subsystem(
approval-voting-parallel
) and multiple worker types that would full-fill the work that currently is fulfilled by theapproval-distribution
andapproval-voting
subsystems. The main loop of the new subsystem would do just the distribution of work to the workers.The new subsystem will have:
worker_index = msg.validator % WORKER_COUNT
, this guarantees that all assignments and approvals from the same validator reach the same worker.On the hot path of processing messages no synchronisation and waiting is needed between approval-distribution and approval-voting workers.
Guidelines for reading
The full implementation is broken in 5 PRs and all of them are self-contained and improve things incrementally even without the parallelisation being implemented/enabled, the reason this approach was taken instead of a big-bang PR, is to make things easier to review and reduced the risk of breaking this critical subsystems.
After reading the full description of this PR, the changes should be read in the following order:
Context
and be able to run multiple instances of the subsystem on different threads. No functional changesapproval-voting-parallel
subsystem that runs on different workers the logic currently inapproval-distribution
andapproval-voting
.Results
Running subsystem-benchmarks with 1000 validators 100 fully ocuppied cores and triggering all assignments and approvals for all tranches
Approval does not lags behind.
Master
With this PoC
Gathering enough assignments
Enough assignments are gathered in less than 500ms, so that gives un a guarantee that un-necessary work does not get triggered, on master on the same benchmark because the subsystems fall behind on work, that number goes above 32 seconds on master.
Cpu usage:
Master
With this PoC
Enablement strategy
Because just some trivial plumbing is needed in approval-distribution and approval-voting to be able to run things in parallel and because this subsystems plays a critical part in the system this PR proposes that we keep both ways of running the approval work, as separated subsystems and just a single subsystem(
approval-voting-parallel
) which has multiple workers for the distribution work and one worker for the approval-voting work and switch between them with a comandline flag.The benefits for this is twofold.
Next steps
@ordian @eskimor @sandreim @AndreiEres, let me know what you think.