Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SAMZA-2800] A new Control Group Metric for Samza #1699

Closed
wants to merge 6 commits into from

Conversation

li-afaris
Copy link
Contributor

@li-afaris li-afaris commented Jun 3, 2024

Introduction

Hadoop clusters have the ability to restrict CPU usage for Samza applications by utilizing Control Groups, (Cgroups).
Before enabling CPU enforcement on Hadoop clusters, application owners must have a way of knowing when their application is being throttled by Cgroups. This PR will add a new Cgroup metric that makes application owners aware if containers CPU usage is being throttled by control groups & whether the application needs to request additional resources.

Implementation

The Linux kernel reports when applications within a Cgroup has been throttled by writing values to a file named cpu.stat. cpu.stat contains two fields named nr_periods & nr_throttled. nr_periods represents the number of enforcement periods that elapsed. nr_thorttled represents the number of times the group has been throttled. We can treat these fields as a ratio that shows the number of times applications has been throttled over a number of enforcement periods. The proposal is to have the running container locate the cpu.stat file by reading property values from Hadoop's YARN config.

Implementation details

  • To limit high cardinality in the metrics storage layer, instead of using the Hadoop YARN container id, the metric will emit the Samza container ID as the hostname, (ie: Container 3). This is already supported by the existing metrics framework within Samza.
  • The container will emit a float value between zero and 1 as a gauge metric. A zero value means the Cgroup was not throttled for that period of time. A value of 1 means the Cgroup was unable to complete any work as it was persistently throttled.
  • To stay consistent with existing metrics, a negative value (-1) will be emitted if an exception is thrown when reading the cpu.stat file. Exceptions when reading cpu.stat will be logged to the container logs.
  • This implementation will be specific to Samza on Hadoop. The reasoning is the application itself should emit this metric, not the embedded library.

Considered Alternatives

I’m unaware of alternatives but reading values from cpu.stat is a pattern which appears in the Runc project. Runc is the underlying library for ContainerD which is used by both Docker & Kubernetes.

The metric needs to be emitted from the Samza container itself. Using a system daemon or sidecar application complicates deployments & creates data consistency issues when the sidecar process isn’t running.

External references

@li-afaris li-afaris marked this pull request as ready for review June 7, 2024 00:12
@li-afaris li-afaris changed the title A new Control Group Metric for Samza [SAMZA-2800] A new Control Group Metric for Samza Jun 8, 2024
@@ -47,7 +47,8 @@
scalaTestVersion = "3.0.1"
snappyVersion = "1.1.8.4"
slf4jVersion = "1.7.7"
yarnVersion = "2.10.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to split up the Hadoop version bump (and the related hadoop zookeeper gradle exclude) to a separate PR? It feels like it may be significant enough to stand alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the next push

if (o == null || getClass() != o.getClass()) {
return false;
}
org.apache.samza.container.host.LinuxCgroupStatistics that = (org.apache.samza.container.host.LinuxCgroupStatistics) o;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be fully qualified here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the next push

* under the License.
*/

package org.apache.samza.container.host;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line break after package please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the next push

}

private double getCPUStat() {
if (this.containerID.equals("NOT_DETECTED")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is NOT_DETECTED defined? Should this be a new const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the next push

// return a sentinel value to signal this is not running on Hadoop
return -2.0;
}
String[] controllers = {"cpu", "cpuacct", "cpu,cpuacct" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order isn't guaranteed here. Please add "cpuacct,cpu".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I didn't think about directory name ordering. Updated in the next push

package org.apache.samza.container.host;

import org.junit.Test;
import static org.junit.Assert.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't wildcard import here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to figure out the IDE setting that does this. Updated in the next push.

package org.apache.samza.container.host;


import static org.junit.Assert.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't wildcard import here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to figure out the IDE setting that does this. Updated in the next push.

@li-afaris
Copy link
Contributor Author

I'm not happy with this approach after reviewing the changes. As these are YARN metrics, they should be part of the samza-yarn subproject and not part of samza-core. I'm going to close this PR and open a new one to simplify the change list.

@li-afaris li-afaris closed this Jul 12, 2024
@li-afaris li-afaris deleted the cgroup_metrics branch July 12, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants