Take hz into account in activerehashing to avoid CPU spikes #977

enjoy-binbin · 2024-09-02T04:33:00Z

Currently in conf we describe activerehashing as: Active rehashing
uses 1 millisecond every 100 milliseconds of CPU time. This is the
case for hz = 10.

If we change hz, the description in conf will be inaccurate. Users
may notice that the server spends some CPU (used in activerehashing)
at high hz but don't know why, since our cron calls are fixed to 1ms.

This PR takes hz into account and fixed the CPU usage at 1% (this may
not be accurate in some cases because we do 100 step rehashing in
dictRehashMicroseconds but it can avoid CPU spikes in this case).

Currently in conf we describe activerehashing as: Active rehashing uses 1 millisecond every 100 milliseconds of CPU time. This is the case for hz = 10. If we change hz, the description in conf will be inaccurate. Users may notice that the server spends some CPU (used in activerehashing) at high hz but don't know why, since our cron calls are fixed to 1ms. This PR takes hz into account and fixed the CPU usage a 1% (this may not be accurate in some cases because we do 100 step rehashing in dictRehashMicroseconds but it can avoid CPU spikes in this case). Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-09-02T04:40:01Z

Another approach is that we only update conf to mention the effect of hz.

In addition, i will mention these things we've discussed in the old times:

add active_rehashing_cpu_milliseconds in INFO fields, i got rejected in Redis, back then we don't want to introduce this field in INFO (too many fields) see Add active_rehashing_cpu_milliseconds to INFO STATS redis/redis#13110
Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

codecov · 2024-09-02T04:47:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.39%. Comparing base (c642cf0) to head (7ffffaa).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #977      +/-   ##
============================================
- Coverage     70.54%   70.39%   -0.16%     
============================================
  Files           114      114              
  Lines         61644    61646       +2     
============================================
- Hits          43489    43395      -94     
- Misses        18155    18251      +96

Files with missing lines	Coverage Δ
src/server.c	`88.60% <100.00%> (-0.03%)`	⬇️
src/server.h	`100.00% <ø> (ø)`

... and 12 files with indirect coverage changes

zuiderkwast

Nice! I like it.

src/server.c

valkey.conf

zuiderkwast

My only remaining comment before we merge this is about the documentation.

I have a suggestion for rewritten documentation for activerehashing in the conf file:

# Active rehashing uses 1% of the CPU time to help perform incremental rehashing
# of the main server hash tables, the ones mapping top-level keys to values.
#
# If active rehashing is disabled and rehashing is needed, a hash table is
# rehashed one "step" on every operation performed on the hash table (add, find,
# etc.), so if the server is idle, the rehashing may never complete and some
# more memory is used by the hash tables. Active rehashing helps prevent this.
#
# Active rehashing runs as a background task. Depending on the value of 'hz',
# the frequency at which the server performs background tasks, active rehashing
# can cause the server to freeze for a short time. For example, if 'hz' is set
# to 10, active rehashing runs for up to one millisecond every 100 milliseconds.
# If a freeze of one millisecond is not acceptable, you can increase 'hz' to let
# active rehashing run more often. If instead 'hz' is set to 100, active
# rehashing runs up to only 100 microseconds every 10 milliseconds. The total is
# still 1% of the time.
activerehashing yes

What do you think?

zuiderkwast · 2024-09-08T17:57:28Z

If we set the active rehashing to hard-coded 1% of CPU time now, then in the future we can add active-rehashing-effort to control this with 1% as the default value.

I like the idea of active-rehashing-effort, since it is kind of similar to expire-effort. So maybe we should already now try to make them similar to each other. Is the expire-effort the effort per cron cycle (more effort with higher hz) or is it the effort in percent of CPU time, independent of 'hz'?

enjoy-binbin · 2024-09-09T02:57:33Z

The doc is great, i will take it, thanks.

I like the idea of active-rehashing-effort, since it is kind of similar to expire-effort. So maybe we should already now try to make them similar to each other. Is the expire-effort the effort per cron cycle (more effort with higher hz) or is it the effort in percent of CPU time, independent of 'hz'?

there are two items, the active-rehash-cycle is the effort in percent of CPU time, independent of hz. The active-rehash-effort is the effort per cron cycle (more effort with higher hz). Which one do you like more?

Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie · 2024-09-13T07:18:35Z

I only took a quick look at the PR. It makes sense to me.

zuiderkwast · 2024-09-13T08:17:53Z

there are two items, the active-rehash-cycle is the effort in percent of CPU time, independent of hz. The active-rehash-effort is the effort per cron cycle (more effort with higher hz). Which one do you like more?

Add active-rehash-cycle config to control the CPU usage, default is 1, up to 50 (is the same as before, when hz = 500, and rehashing 1ms)
Or active-rehash-effort just like active-expire-effort to control the CPU usage, this was got rejected in Redis by Oran since he do not want to introduce too many configuration items that require users to think about them (we need to decide for them)

I'm not sure, is active-rehash-cycle something users are not required to think about?

To me, active-rehash-effort seems like less to think about for users. Easier.

Effort per CPU usage is a better, because the rehashing needed is relative to the traffic, which is relative to the CPU usage.

Before we add the config, I think we need to investigate if we can find out some way to choose the best behaviour for the users, maybe automatic. Some possibilities are:

Count write commands and adjust automatically to that.
Measure if rehashing time. If it is finished very fast, then we can do less rehashing. If it is running almost all the time in cron, then we should do more effort in rehashing.
Keep track of remaining rehashing work. Kvstore already has a list of dicts that need rehashing, but for each of them it could be possible to check rehashIdx internally and calculate how much is remaining.

The active-exire-effort is very abstract. The explanation is like this:

# The default effort of the expire cycle will try to avoid having more than
# ten percent of expired keys still in memory, and will try to avoid consuming
# more than 25% of total memory and to add latency to the system. However
# it is possible to increase the expire "effort" that is normally set to
# "1", to a greater value, up to the value "10". At its maximum value the
# system will use more CPU, longer cycles (and technically may introduce
# more latency), and will tolerate less already expired keys still present
# in the system. It's a tradeoff between memory, CPU and latency.
#
# active-expire-effort 1

enjoy-binbin · 2024-09-13T08:40:26Z

active-rehash-cycle just a CPU cycle [1, 49], default is 1. Users can control how many cycles to use in activerehash, but i can see it is hard to decide the value, we always set it to 1 in the internal fork. It is a option in case this one got merged, and we are not able to adjust the cycle and the rehashing is too slow in some cases (no read/write traffic).

uint64_t threshold_us = 1 * 1000000 / server.hz / 100;
become
uint64_t threshold_us = cycle * 1000000 / server.hz / 100;

active-rehash-effort will like active-exire-effort and the range is [1, 10], it can use the same method we used in active-exire-effort, and the problem is currently active-exire-effort is hard to understand for user (or even for ours develper, oops). The explanation is very abstract, if you don't look in the code, you will never find out what it cost. That is why Oran reject it in the past, it is hard for user to understand and adjust it.

zuiderkwast · 2024-09-13T08:46:43Z

active-rehash-cycle just a CPU cycle [1, 49], default is 1. Users can control how many cycles to use in activerehash, but i can see it is hard to decide the value, we always set it to 1 in the internal fork.
uint64_t threshold_us = 1 * 1000000 / server.hz / 100;
become
uint64_t threshold_us = cycle * 1000000 / server.hz / 100;

With this caluclation, we will use cycle % of CPU time.

This is good. I like it. But the name cycle is confusing me with "cron cycle".

active-rehash-percent?

But we don't need to add the config now. With this PR, it is probably enough.

enjoy-binbin · 2024-09-13T08:51:30Z

yean, active-rehash-percent seems better, sorry for the confusing (bad English aha).

yes, we don't need it now i guess. It is a option in case this one got merged (fixed 1%), and we are not able to adjust the speed and someone think rehashing is too slow and eating the memory.

zuiderkwast · 2024-09-13T09:02:45Z

Merge today? Or major decision?

Not bad english, just two different cycle: "cron cycle" vs "CPU cycle".

zuiderkwast reviewed Sep 3, 2024

View reviewed changes

src/server.c Show resolved Hide resolved

valkey.conf Outdated Show resolved Hide resolved

valkey.conf Outdated Show resolved Hide resolved

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 4, 2024

zuiderkwast reviewed Sep 8, 2024

View reviewed changes

enjoy-binbin added 4 commits September 9, 2024 11:06

Merge remote-tracking branch 'upstream/unstable' into activerehashing_1

3249207

Signed-off-by: Binbin <binloveplay1314@qq.com>

fix format

5fc8611

Signed-off-by: Binbin <binloveplay1314@qq.com>

doc review from zuiderkwast, thanks

8c88944

Signed-off-by: Binbin <binloveplay1314@qq.com>

fix format

7ffffaa

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast approved these changes Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take hz into account in activerehashing to avoid CPU spikes #977

Take hz into account in activerehashing to avoid CPU spikes #977

enjoy-binbin commented Sep 2, 2024 •

edited

Loading

enjoy-binbin commented Sep 2, 2024

codecov bot commented Sep 2, 2024 •

edited

Loading

zuiderkwast left a comment

zuiderkwast left a comment •

edited

Loading

zuiderkwast commented Sep 8, 2024

enjoy-binbin commented Sep 9, 2024 •

edited

Loading

PingXie commented Sep 13, 2024

zuiderkwast commented Sep 13, 2024

enjoy-binbin commented Sep 13, 2024 •

edited

Loading

zuiderkwast commented Sep 13, 2024

enjoy-binbin commented Sep 13, 2024

zuiderkwast commented Sep 13, 2024

Take hz into account in activerehashing to avoid CPU spikes #977

Are you sure you want to change the base?

Take hz into account in activerehashing to avoid CPU spikes #977

Conversation

enjoy-binbin commented Sep 2, 2024 • edited Loading

enjoy-binbin commented Sep 2, 2024

codecov bot commented Sep 2, 2024 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment • edited Loading

Choose a reason for hiding this comment

zuiderkwast commented Sep 8, 2024

enjoy-binbin commented Sep 9, 2024 • edited Loading

PingXie commented Sep 13, 2024

zuiderkwast commented Sep 13, 2024

enjoy-binbin commented Sep 13, 2024 • edited Loading

zuiderkwast commented Sep 13, 2024

enjoy-binbin commented Sep 13, 2024

zuiderkwast commented Sep 13, 2024

enjoy-binbin commented Sep 2, 2024 •

edited

Loading

codecov bot commented Sep 2, 2024 •

edited

Loading

zuiderkwast left a comment •

edited

Loading

enjoy-binbin commented Sep 9, 2024 •

edited

Loading

enjoy-binbin commented Sep 13, 2024 •

edited

Loading