Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory error during program teardown deallocations #844

Open
nselliott opened this issue Aug 19, 2023 · 4 comments
Open

Out-of-memory error during program teardown deallocations #844

nselliott opened this issue Aug 19, 2023 · 4 comments
Assignees

Comments

@nselliott
Copy link

Describe the bug

We have a SAMRAI test problem running on CPUs only that allocates most of its arrays for numerical data using QuickPool host allocators. When deallocating those arrays during program teardown, we hit an out-of-memory error when QuickPool goes into do_coalesce() and tries to malloc a large chunk of memory.

To Reproduce

I have provided a reproducer and build/run instructions to @mcfadden8 .

Expected behavior

We did not expect a call to umpire::Allocator::deallocate() to cause an allocation call that hits an OOM error.

Compilers & Libraries (please complete the following information):
Using umpire 2023.06.0

  • Compiler & version: Reproducer has been provided using gcc 10.3.1 on TOSS4. I don't believe this is unique to a particular compiler/platform.

Additional context

We have a workaround that makes CPU-only runs use a default host allocator instead of a QuickPool-based allocator. This is successful, but we would like our CPU-only tests to use QuickPool, as we use QuickPool on GPUs and want to keep the code base for CPU and GPU unified wherever possible. We also don't know if this is a bug that could also happen on allocation/deallocation of GPU data, though we have not seen this kind of error on a GPU run.

@mcfadden8
Copy link
Collaborator

mcfadden8 commented Aug 20, 2023

Thank you for writing this up @nselliott, we are tracking this issue here: https://rzlc.llnl.gov/gitlab/umpire/umpire/-/issues/12

I've been able to reproduce the issue and am investigating the cause. It is normal behavior for umpire to coalesce blocks of pool memory as they become available during deallocation time. The amount of memory that Umpire is attempting to allocate that causes the OOM appears to be a bogus (extremely large) amount. I'm instrumenting the library to determine where the internal accounting is going wrong.

I am glad to hear that you are able to temporarily work around this issue while we work on a fix.

@mcfadden8 mcfadden8 self-assigned this Aug 20, 2023
@mcfadden8
Copy link
Collaborator

#845

@nselliott
Copy link
Author

@mcfadden8 Did that pull request sufficiently fix this?

@mcfadden8
Copy link
Collaborator

@nselliott - Yes. There is more information provided here: https://rzlc.llnl.gov/gitlab/umpire/umpire/-/issues/12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants