Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements #94

Merged
merged 15 commits into from
Aug 24, 2023
Merged

Conversation

AlexanderWells-diamond
Copy link
Contributor

@AlexanderWells-diamond AlexanderWells-diamond commented Aug 9, 2023

This branch contains a number of performance enhancements for Coniql. This is a 20-25% decrease in CPU usage, which effectively increases Coniql's requests-per-second from ~1,000 to ~2,000, with larger performance gains for 5+ clients subscribing to the same set of PVs.

The largest performance gains have been from simply removing the async keyword wherever possible. Additional gains have been made by creating a connection pool in the CA Plugin, which means we only have 1 camonitor per PV, regardless of the number of client subscriptions for that PV.

There is also a fair amount of work enhancing the Python performance tests in here, although those changes do not affect Coniql's performance.

There is a new document titled "Performance", which goes into some details about the work done and where time is spent. Please let me know if this should have more details added.

Also of note is dropping Python 3.7 support. It looks like Strawberry, as of 0.197.0, is only Python 3.8+. I figure we want to keep up to date with Strawberry so this change seems inevitable.

Closes #86 (by deleting that code)

NOTE: I haven't written any tests for this, but I wanted to put the PR up for the actual code to get some more eyes on before I write any.

Includes adding the duration of the tests to the result, using ujson
to ensure we spend as little time as possible parsing data and so can
receive as quickly as possible, using uuid identifiers for subscriptions
and making the output look better.

The uuid identifiers remove potential issues in Strawberry of multiple
clients using the same ID for each subscription - previously it was
simply the PV's name. Real clients, e.g. Apollo, do something similar
to uuids.
Unbuffered ensures all data is pushed to files ASAP, which is useful
when you have to terminate the tests early (or they don't finish
properly)

The subscript path should always be found in SUB_DIR, which is not
the same as ".".

Also removes trailing whitespace
The contents are just copied from the .gitignore.
This provides about a 1% CPU increase, and there aren't even any fields
that require conversion anyway
This is a minor performance enhancement - when processing metadata, and
therefore the Channel's quality, on every iteration the creation of this
list took up about 1% total CPU time
There is no reason for these to be async, and doing this gives about a
10% CPU improvement.
Now we get and store the channels during the initial Query/Mutation,
which must be async functions anyway as they must call caget/caput.
These retrieved channels are then passed to the Resolver step, and as
they're there already we no longer need async to re-fetch them.

This gives approximately 10% CPU gains.
This will mean Coniql only has a single camonitor for each PV, with the
new CASubscriptionManager handling returning the updates to each
subscription.

With only one client this is ~1-2% CPU improvement. This saving
increases with more clients, up to ~6-8% with 20 clients.

Note that Channel Access itself does do some connection pooling - only
one socket connection will be made to an IOC regardless of how many
subscription requests are made, but the amount of data being sent over
the wire will increase for each subsequent monitor. By pooling it
ourselves we save some network bandwidth.
This is to support asyncio.Queue type hints.
As they're only being scraped once every 15 seconds by Prometheus it
is unnecessary to update them significantly more frequently than that
See issue #95 for when to remove these
@AlexanderWells-diamond
Copy link
Contributor Author

I've made commit f4d794c to suppress the MyPy errors, and raised #95 to cover the work to resolve that.

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #94 (3a8c46e) into main (86054d9) will increase coverage by 0.85%.
Report is 11 commits behind head on main.
The diff coverage is 95.91%.

@@            Coverage Diff             @@
##             main      #94      +/-   ##
==========================================
+ Coverage   93.30%   94.15%   +0.85%     
==========================================
  Files          10       10              
  Lines         807      856      +49     
==========================================
+ Hits          753      806      +53     
+ Misses         54       50       -4     
Files Changed Coverage Δ
src/coniql/simplugin.py 94.54% <66.66%> (-0.52%) ⬇️
src/coniql/types.py 94.11% <92.30%> (-0.62%) ⬇️
src/coniql/caplugin.py 97.31% <95.32%> (+4.71%) ⬆️
src/coniql/strawberry_schema.py 98.68% <98.59%> (+0.18%) ⬆️
src/coniql/app.py 69.84% <100.00%> (+0.98%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rjwills28
Copy link
Collaborator

I've tested on my local system and also see the performance improvements.

I like the idea of only maintaining one camonitor per PV when there are multiple clients however I have come across a problem with this implementation. If I load a screen or send a subscription from the graphiql interface I see the value of that PV update. However, if I then open a second instance of the screen or send the same subscription from another graphiql instance I see that the value does not get displayed. The only way to get a value to display in this second instance is for that PV itself to get updated (i.e. caput xxx). Not a problem when the PV is updating at a given rate but otherwise I think this could be a problem. Let me know if you need any more details on how to reproduce.

@AlexanderWells-diamond
Copy link
Contributor Author

Thanks for investigating that. I'm a little surprised at that though - line 262 of caplug.py exists to cover this situation. I wonder if it's something to do with the GraphiQL interface. Would you mind repeating the test, but instead of opening a new tab open an entirely new (private) window for the second connection?

@rjwills28
Copy link
Collaborator

I retested as described and also tried displaying 2 instances of a screen in cs-web-proto subscribing to a PV that is not on a scan update. I got the same result that the second screen/graphiql did not update. Note if both screens are open when Coniql starts up then all is OK. They key is starting Coniql and then opening one screen followed by the other. I added a debug statement before line 262 and saw that this was not printed when the second instance was loaded so I don't think it's even reaching this bit of the code.

@AlexanderWells-diamond
Copy link
Contributor Author

Thanks again, I've clearly missed something. I'll investigate further.

Previously if the PV never updated its value, additional subscriptions
beyond the first would never receive an update as the subscription
manager callback would never be invoked.

Now we ensure that the initial subscribe function always returns a
completed channel, which is immediately consumed by `await value.get()`
@AlexanderWells-diamond
Copy link
Contributor Author

I believe I've fixed this in 3013983. It worked in local tests with non-updating and slowly updating PVs.

Copy link
Collaborator

@rjwills28 rjwills28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I can confirm that that issue has now been fixed with the latest commit. All looks good on this.

@AlexanderWells-diamond
Copy link
Contributor Author

Thank you for the review. I plan to add tests to this PR before merging it.

These cover multiple subscribers to the same PV, unsubscribing, and
automatic camel casing.

Also remove unnecessary run_ioc function.
Originally added on the basis we might care about tracking the number of
events lost from the queue, but that isn't necessary.
@AlexanderWells-diamond AlexanderWells-diamond merged commit 3a8c46e into main Aug 24, 2023
17 checks passed
@AlexanderWells-diamond AlexanderWells-diamond deleted the performance_enhancements branch August 24, 2023 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor handling of failure to get loop in subscribe_channel
2 participants