Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pathological reallocation in memory pools #906

Open
msimberg opened this issue Aug 29, 2024 · 1 comment · May be fixed by #908
Open

Pathological reallocation in memory pools #906

msimberg opened this issue Aug 29, 2024 · 1 comment · May be fixed by #908

Comments

@msimberg
Copy link
Contributor

Describe the bug

With certain allocation patterns and sizes QuickPool never coalesces a backing allocation to be large enough, leading to unnecessary reallocations.

To Reproduce

Compile the following file with anything from 2024.02.01 to current develop (older versions likely also affected, but have not tested):

#include <umpire/ResourceManager.hpp>
#include <umpire/strategy/QuickPool.hpp>

#include <iostream>
#include <cstddef>

int main() {
    auto alloc = umpire::ResourceManager::getInstance().getAllocator("DEVICE");
    auto pooled_alloc =
        umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>(
            "DEVICE_pool", alloc,
            16,  // initial pool size important only in the sense that allocations must be larger than the initial size, otherwise no additional allocations are required
            16,  // next size is not important to reproduce
            16); // alignment important: if all allocations are multiples of alignment all is good

    const std::size_t niter = 5;
    for (std::size_t iter = 0; iter < niter; ++iter) {
            std::cerr << "iteration " << iter << '\n';
            auto* p1 = pooled_alloc.allocate(1024 + 8);
            auto* p2 = pooled_alloc.allocate(1024 + 8); // need at least two allocations not multiples of alignment
            pooled_alloc.deallocate(p2);
            pooled_alloc.deallocate(p1);
    }
}

Just to print the allocations I ran this through gdb with the following gdb script:

start

break cudaMalloc
commands
continue
end

break cudaFree
commands
continue
end

continue

and then

gdb -batch -x gdbscript umpire_test

produces:

...
iteration 0
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 1
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 2
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 3
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 4
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
...

Sizes are not visible above, but the first iteration first allocates twice, once for the initial buffer and once more to accommodate the second allocation. It then coalesces, freeing the two backing allocations above, and allocating one larger block. However, this block does not take overallocation into account, and is too small for the next iteration. The pool then always allocates for the second user-triggered allocation, frees both backing allocations, and attempts to coalesce. Repeat.

If one of the user-requested allocations is 16-byte aligned the reproducer correctly allocates only on the first iteration and then never again.

I suppose using umpire's logging should produce the same results, I'm just not familiar enough with it.

I used DEVICE only to make it easier to hook gdb into cudaMalloc and avoid printing potential other allocations. The same behaviour should happen with any backing memory.

I expect this affects DynamicPoolList as well.

Expected behavior

I expect that repeating the same allocation and deallocation pattern multiple times (no matter what initial size, next size, alignment, and user-requested allocation sizes are used), letting the pool empty between iterations to allow for coalescing (with the default coalescing strategy), would lead to allocations only during the first iteration and when coalescing after the first iteration. After that all allocations should fit in the pool without any further allocations.

Compilers & Libraries (please complete the following information):

  • Compiler & version: GCC 12.3.0
  • CUDA version (if applicable): CUDA 12.2.1 (not relevant to reproduce though)

Additional context

If I understand the pools correctly, the reason for this behaviour is that the actual size and the high water mark don't include the overallocation. They only sum up the sizes requested by the user, not the sizes actually allocated eventually, so there can be cases like the above where the pool never grows big enough to fit all required allocations.

@msimberg
Copy link
Contributor Author

msimberg commented Sep 4, 2024

Just wanted to add that I tried to work around this in a real application by rounding all allocations to the alignment before passing them to Umpire and I still get reallocations every iteration, so that on its own might not explain the reallocation. I may also have made a mistake of course... But thought I'd mention this for a possible fix in Umpire.

Currently I'm working around this by using a custom coalescing heuristic which allocates x percent more than the previous high water mark and in our case this seems to cover the mismatch between actual high water mark and apparent high water mark, but this is obviously not a general fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant