Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take hz into account in activerehashing to avoid CPU spikes #977

Open
wants to merge 5 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Sep 2, 2024

Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.

If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.

This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).

Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.

If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.

This PR takes hz into account and fixed the CPU usage a 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

Another approach is that we only update conf to mention the effect of hz.

In addition, i will mention these things we've discussed in the old times:

  1. add active_rehashing_cpu_milliseconds in INFO fields, i got rejected in Redis, back then we don't want to introduce this field in INFO (too many fields) see Add active_rehashing_cpu_milliseconds to INFO STATS redis/redis#13110
  2. Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
  3. Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

Copy link

codecov bot commented Sep 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.39%. Comparing base (c642cf0) to head (7ffffaa).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #977      +/-   ##
============================================
- Coverage     70.54%   70.39%   -0.16%     
============================================
  Files           114      114              
  Lines         61644    61646       +2     
============================================
- Hits          43489    43395      -94     
- Misses        18155    18251      +96     
Files with missing lines Coverage Δ
src/server.c 88.60% <100.00%> (-0.03%) ⬇️
src/server.h 100.00% <ø> (ø)

... and 12 files with indirect coverage changes

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like it.

src/server.c Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 4, 2024
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only remaining comment before we merge this is about the documentation.

I have a suggestion for rewritten documentation for activerehashing in the conf file:

# Active rehashing uses 1% of the CPU time to help perform incremental rehashing
# of the main server hash tables, the ones mapping top-level keys to values.
#
# If active rehashing is disabled and rehashing is needed, a hash table is
# rehashed one "step" on every operation performed on the hash table (add, find,
# etc.), so if the server is idle, the rehashing may never complete and some
# more memory is used by the hash tables. Active rehashing helps prevent this.
#
# Active rehashing runs as a background task. Depending on the value of 'hz',
# the frequency at which the server performs background tasks, active rehashing
# can cause the server to freeze for a short time. For example, if 'hz' is set
# to 10, active rehashing runs for up to one millisecond every 100 milliseconds.
# If a freeze of one millisecond is not acceptable, you can increase 'hz' to let
# active rehashing run more often. If instead 'hz' is set to 100, active
# rehashing runs up to only 100 microseconds every 10 milliseconds. The total is
# still 1% of the time.
activerehashing yes

What do you think?

@zuiderkwast
Copy link
Contributor

If we set the active rehashing to hard-coded 1% of CPU time now, then in the future we can add active-rehashing-effort to control this with 1% as the default value.

I like the idea of active-rehashing-effort, since it is kind of similar to expire-effort. So maybe we should already now try to make them similar to each other. Is the expire-effort the effort per cron cycle (more effort with higher hz) or is it the effort in percent of CPU time, independent of 'hz'?

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Sep 9, 2024

The doc is great, i will take it, thanks.

I like the idea of active-rehashing-effort, since it is kind of similar to expire-effort. So maybe we should already now try to make them similar to each other. Is the expire-effort the effort per cron cycle (more effort with higher hz) or is it the effort in percent of CPU time, independent of 'hz'?

there are two items, the active-rehash-cycle is the effort in percent of CPU time, independent of hz. The active-rehash-effort is the effort per cron cycle (more effort with higher hz). Which one do you like more?

Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@PingXie
Copy link
Member

PingXie commented Sep 13, 2024

I only took a quick look at the PR. It makes sense to me.

@zuiderkwast
Copy link
Contributor

there are two items, the active-rehash-cycle is the effort in percent of CPU time, independent of hz. The active-rehash-effort is the effort per cron cycle (more effort with higher hz). Which one do you like more?

Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

I'm not sure, is active-rehash-cycle something users are not required to think about?

To me, active-rehash-effort seems like less to think about for users. Easier.

Effort per CPU usage is a better, because the rehashing needed is relative to the traffic, which is relative to the CPU usage.

Before we add the config, I think we need to investigate if we can find out some way to choose the best behaviour for the users, maybe automatic. Some possibilities are:

  • Count write commands and adjust automatically to that.
  • Measure if rehashing time. If it is finished very fast, then we can do less rehashing. If it is running almost all the time in cron, then we should do more effort in rehashing.
  • Keep track of remaining rehashing work. Kvstore already has a list of dicts that need rehashing, but for each of them it could be possible to check rehashIdx internally and calculate how much is remaining.

The active-exire-effort is very abstract. The explanation is like this:

# The default effort of the expire cycle will try to avoid having more than
# ten percent of expired keys still in memory, and will try to avoid consuming
# more than 25% of total memory and to add latency to the system. However
# it is possible to increase the expire "effort" that is normally set to
# "1", to a greater value, up to the value "10". At its maximum value the
# system will use more CPU, longer cycles (and technically may introduce
# more latency), and will tolerate less already expired keys still present
# in the system. It's a tradeoff between memory, CPU and latency.
#
# active-expire-effort 1

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Sep 13, 2024

active-rehash-cycle just a CPU cycle [1, 49], default is 1. Users can control how many cycles to use in activerehash, but i can see it is hard to decide the value, we always set it to 1 in the internal fork. It is a option in case this one got merged, and we are not able to adjust the cycle and the rehashing is too slow in some cases (no read/write traffic).

uint64_t threshold_us = 1 * 1000000 / server.hz / 100;
become
uint64_t threshold_us = cycle * 1000000 / server.hz / 100;

active-rehash-effort will like active-exire-effort and the range is [1, 10], it can use the same method we used in active-exire-effort, and the problem is currently active-exire-effort is hard to understand for user (or even for ours develper, oops). The explanation is very abstract, if you don't look in the code, you will never find out what it cost. That is why Oran reject it in the past, it is hard for user to understand and adjust it.

@zuiderkwast
Copy link
Contributor

active-rehash-cycle just a CPU cycle [1, 49], default is 1. Users can control how many cycles to use in activerehash, but i can see it is hard to decide the value, we always set it to 1 in the internal fork.

uint64_t threshold_us = 1 * 1000000 / server.hz / 100;
become
uint64_t threshold_us = cycle * 1000000 / server.hz / 100;

With this caluclation, we will use cycle % of CPU time.

This is good. I like it. But the name cycle is confusing me with "cron cycle".

active-rehash-percent?

But we don't need to add the config now. With this PR, it is probably enough.

@enjoy-binbin
Copy link
Member Author

yean, active-rehash-percent seems better, sorry for the confusing (bad English aha).

yes, we don't need it now i guess. It is a option in case this one got merged (fixed 1%), and we are not able to adjust the speed and someone think rehashing is too slow and eating the memory.

@zuiderkwast
Copy link
Contributor

Merge today? Or major decision?

Not bad english, just two different cycle: "cron cycle" vs "CPU cycle".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants