Crier failing to report OOMed pods #210

BenTheElder · 2024-07-25T20:33:27Z

In prow.k8s.io we're seeing some pods fail to report status beyond "running" that have really been OOMed (if we manually inspect via the build cluster)

See: https://kubernetes.slack.com/archives/C7J9RP96G/p1721926824700569 and other #testing-ops therads

/kind bug

jihoon-seo · 2024-07-26T08:32:45Z

I guess that you mean crier.. 😊
/retitle Crier failing to report OOMed pods

dims · 2024-07-26T17:06:49Z

😢

petr-muller · 2024-07-27T23:40:48Z

I don't think this is crier - that only comes later. I think this is podutils interacting with kubernetes/kubernetes#117793 .

NAME                                   READY   STATUS      RESTARTS   AGE
3b953040-572a-400f-8024-02310b8e54f8   1/2     OOMKilled   0          93m

The above (from the linked Slack thread) indicates one of the two containers in the Pod is still alive - sidecar. That container waits for entrypoint wrapper to write how the actual test process (entrypoints subprocess) ended, but with kubernetes/kubernetes#117793 all processes in the containers get OOMkilled, including entrypoint. In the past only the test process would be killed so entrypoint would be able to collect its exit code and exit normally.

Not sure about what would be the solution yet.

BenTheElder · 2024-07-29T14:16:22Z

IMHO crier should still eventually report what happened here because it is observing the pod success / fail from the outside, but we see pods being perma reported as running (and if they really are running we should be enforcing a timeout via plank?)

We could either hand off more of this to crier / plank to observe the test container failure, or we could introduce additional heart beating / signaling between the entry point and sidecar?

petr-muller · 2024-07-29T23:01:38Z

True; you are right that Plank should eventually reap a pod stuck like this - if that does not happen then it's perhaps a separate bug, not sure if plank just uses too soft signals (sidecar iirc ignores some signals in certain state when it thinks it still needs to finish uploading artifacts) or it somehow fails to enforce a timeout entirely.

BenTheElder · 2024-07-30T21:46:24Z

Left a breadcrumb on kubernetes/kubernetes#117793 because I think it's worth noting when the project breaks itself with breaking changes :+)

This problem is probably going to get worse as cgroup v2 rolls out to the rest of our clusters and more widely in the ecosystem.

BenTheElder · 2024-08-01T13:19:34Z

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1818540717593595904

https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1818540717593595904/

We actually appear to have uploaded all the artifacts ?? But we're just not recording a finished.json or the test log

BenTheElder · 2024-08-30T00:09:26Z

I'm seeing a lack of metadata (like the pod json) on pods that received SIGTERM and probably(?) weren't OOMed, at least there's no indication they would be, monitoring suggests they're actually not using much of the memory requested/limited.

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-arm64/1829281539263827968

BenTheElder · 2024-08-30T00:18:50Z

In that case we have some other weirdness: https://kubernetes.slack.com/archives/CCK68P2Q2/p1724976972581789?thread_ts=1724975407.113939&cid=CCK68P2Q2

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2024

k8s-ci-robot changed the title ~~cier failing to report OOMed pods~~ Crier failing to report OOMed pods Jul 26, 2024

jsturtevant mentioned this issue Jul 26, 2024

Kubernetes e2e suite: [It] [sig-api-machinery] ResourceQuota should create a ResourceQuota and capture the life of a ResourceClaim [Feature:DynamicResourceAllocation] kubernetes/kubernetes#126364

Closed

This was referenced Jul 30, 2024

test: Use latest K8s build for scale tests kubernetes/kops#16701

Merged

use the cgroup aware OOM killer if available kubernetes/kubernetes#117793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crier failing to report OOMed pods #210

Crier failing to report OOMed pods #210

BenTheElder commented Jul 25, 2024

jihoon-seo commented Jul 26, 2024

dims commented Jul 26, 2024

petr-muller commented Jul 27, 2024

BenTheElder commented Jul 29, 2024

petr-muller commented Jul 29, 2024

BenTheElder commented Jul 30, 2024

BenTheElder commented Aug 1, 2024

BenTheElder commented Aug 30, 2024 •

edited

Loading

BenTheElder commented Aug 30, 2024

Crier failing to report OOMed pods #210

Crier failing to report OOMed pods #210

Comments

BenTheElder commented Jul 25, 2024

jihoon-seo commented Jul 26, 2024

dims commented Jul 26, 2024

petr-muller commented Jul 27, 2024

BenTheElder commented Jul 29, 2024

petr-muller commented Jul 29, 2024

BenTheElder commented Jul 30, 2024

BenTheElder commented Aug 1, 2024

BenTheElder commented Aug 30, 2024 • edited Loading

BenTheElder commented Aug 30, 2024

BenTheElder commented Aug 30, 2024 •

edited

Loading