Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Investigate memory leak in single-binary deployment #3991

Closed
2 tasks done
eapolinario opened this issue Aug 26, 2023 · 6 comments · Fixed by #5704
Closed
2 tasks done

[BUG] Investigate memory leak in single-binary deployment #3991

eapolinario opened this issue Aug 26, 2023 · 6 comments · Fixed by #5704
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working stale

Comments

@eapolinario
Copy link
Contributor

eapolinario commented Aug 26, 2023

Describe the bug

A user in the OSS slack reported that single-binary pods keep getting killed due to OOM errors. Here's a graph of memory consumption in this case (different colors refer to different pods):

image

They are running stock Flyte v1.8.1 release.

Expected behavior

Memory consumption should remain relatively constant.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@eapolinario eapolinario added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 26, 2023
@wild-endeavor
Copy link
Contributor

look at prometheus metrics, educated guess from dan - every time a task runs, prometheus metrics in propeller might be contributing to memory usage.

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Sep 8, 2023
@eapolinario eapolinario self-assigned this Sep 8, 2023
@eapolinario eapolinario added the backlogged For internal use. Reserved for contributor team workflow. label Sep 25, 2023
Copy link

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Jun 22, 2024
@cpaulik
Copy link
Contributor

cpaulik commented Aug 28, 2024

This looks a lot like what I'm seeing as well. See https://flyte-org.slack.com/archives/CP2HDHKE1/p1723819039330009 for more discussion.

Is there a way to disable prometheus metrics to test this? I haven't found a configuration option yet...

would deploying via flyte-core help here? Since this issue is specifically for single-binary deployment.

@eapolinario
Copy link
Contributor Author

After some investigation, it turns out that accumulating metrics data is the expected behavior of the prometheus golang client, as per prometheus/client_golang#920. We emit high-cardinality metrics in flytepropeller, as an example you can see here the emitted metrics after running a few hundred workflow executions. Note how we maintain a lot of metrics per execution id. This is the symptom described in the prometheus client gh issue.

One could argue that emitting metrics per-execution id is the wrong granularity, but that's exactly what we do in the case of single-binary.

would deploying via flyte-core help here?

Knowing what we know now, the set of default metrics emitted by flytepropeller in the default cause does not cause the issue. Similar argument can be made to flyteadmin. In other words, this was a single-binary issue-only.

Here's the PR to remove that label from single binary metrics: #5704

@wild-endeavor
Copy link
Contributor

Related to #5606

@divyank000
Copy link

This has solved the issue at a shorter time horizon. However I still see the issue on the longer time horizon.
initially for us, it was around few hours on production workload. Now it is around 2 days on production workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants