Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(PUP-12061) Allow splay and splaylimit to be modified in puppet.conf #9484

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

joshcooper
Copy link
Contributor

@joshcooper joshcooper commented Sep 26, 2024

If an agent was already running periodically and splay or splaylimit was modified in puppet.conf, then the agent didn't recalculate its splay mode (enabled/disabled) or its splay offset. A full service restart was required for those changes to take effect.

Now the agent will behave as follows when the agent is running periodically:

  • if splay is changed from disabled to enabled, recalculate splay offset based on the current splaylimit
  • If splay is changed from enabled to disabled, set splay offset to 0.
  • If splaylimit is modified, then recalculate splay based on the new splaylimit
  • If runinterval is modified and splaylimit inherits from runinterval, then recalculate splay offset based on the new runinterval
  • If runinterval is modified, but splaylimit does not inherit, then keep splay offset as-is.

The agent also now logs an info message about splay at information level, see PUP-10732

$ bundle exec puppet agent --no-daemonize --verbose 
Notice: Starting Puppet client version 8.10.0
Info: Running agent every 10 seconds with splay 7 of 10 seconds

This behavior change affects the puppet agent service on all platforms except windows because of the way our windows service runs puppet agent --onetime, see PUP-5823

Confusingly, the changes to the Puppet::Daemon class affect all periodic runs, even when using --no-daemonize.

Running puppet agent --onetime (such as is implied by --test) is unaffected, because we don't start the scheduler.

Memoize the jobs so it's clearer which jobs 0 and 1 are.
selinux.pp Outdated Show resolved Hide resolved
joshcooper

This comment was marked as outdated.

If puppet.conf was modified, but splaylimit wasn't, then we recalculated
the splay offset. This goes against the desired behavior of only splaying when
the agent is initially started and running every runinterval seconds from that
point forward or until runinterval/splaylimit are changed in the future.

Now the SplayJob remembers its last splaylimit so that it only recalculates
splay if the limit changes. It's important to recalculate splay so that the
agent chooses a slot over the new range, either smaller or bigger than before.
Provide more context when the agent runs so it's clearer whether splay is
enabled and what the limits are. We don't want to do this for the reparse and
signal jobs, because they run every 15 and 5 seconds, respectively.

The info message is of the form:

    Running agent every 20 seconds with splay 5 of 10 seconds

This message is logged whenever the agent is running periodically, not for
onetime runs. Whether we're daemonized doesn't affect the behavior. This is
because we always call Puppet::Daemon#start when running periodically to
run the event loop.
Previously, enabling or disabling splay in puppet.conf did not work if the
agent was already running periodically.

If the agent was started with splay disabled, then enabling it would cause a
NoMethodError, when trying to call Puppet::Scheduler::Job#splay_limit

If the agent was started with splay enabled, then disabling it would have no
effect, since we never recalculated the splay limit.

To handle these cases, always create a SplayJob for the agent_run job and set
its splay_limit to either the limit or 0, depending on whether splay is enabled
or not. Setting a splay_limit to 0 causes the splay to also be set to 0 because
rand(1) always returns 0.
Since splaylimit defaults to runinterval, changes to runinterval affect splay.
Add tests covering cases where runinterval is increased and decreased. It's
important to recalculate splay to reduce thundering herds. For example, if
runinterval is changed from 30 mins to 60 mins, we want the compute a new slot
in the 60 min interval to decrease server load.
@joshcooper
Copy link
Contributor Author

Jenkins, please test this

@joshcooper
Copy link
Contributor Author

jenkins please test this

@joshcooper joshcooper marked this pull request as ready for review September 27, 2024 19:37
@joshcooper joshcooper requested a review from a team as a code owner September 27, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant