[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

Abhishekghosh1998 · 2023-10-29T21:49:06Z

A note: I don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.

Describe the bug
Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly.
In the case of :

onnx_dump,it fails with terminate called without an active exception
- Sometimes shows ERROR sending to socket: Bad file descriptor before printing terminate called without an active exception
cudart,it fails with a simple Segmentation fault (core dumped)

Suppose I write a CUDA executable named say toy.cu as follows:

#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 128

__global__
void do_something(float* d_array)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    d_array[idx]*=100;
}
int main()
{
    long N= 1<<7;
    float *arr = (float*) malloc(N*sizeof(float));
    long i;
    for (i=1;i<=N;i++)
        arr[i-1]=i;
    
    float *d_array;
    cudaError_t ret;
    
    ret = cudaMalloc(&d_array, N*sizeof(float));
    printf("Return value of cudaMalloc = %d\n", ret);
    
    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
	exit(1);
    }

    ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
    printf("Return value of cudaMemcpy = %d\n", ret);

    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
	exit(1);
    }

    int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
    do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);

    ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
    printf("Return value of cudaMemcpy = %d\n", ret);

    int j;
    for(i=0;i<N;)
    {
        for(j=0;j<8;j++)
                printf("%.0f\t", arr[i++]);
        printf("\n");
    }
    cudaFree(d_array);
    return 0;
}

And compile it as :

nvcc -o toy toy.cu --cudart shared

Then, in the docker container setup to use the appropriate libguestlib.so I call the following script.sh

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 <executable> <num_instances>"
    exit 1
fi

executable=$1
num_instances=$2

for ((i=1; i<=$num_instances; i++)); do
    $executable &
done

And run the following command:

$ ./script.sh toy 20

Many (not all) fails, whether I use cudart or onnx_dump

To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:

libcudnn7_7.6.3.30-1+cuda10.1_amd64.deb      
libcudnn7-doc_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-dev_7.6.3.30-1+cuda10.1_amd64.deb

Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp to connect to my host using it's IP address.

And then did the following:

$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install

Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu and /usr/local/cuda-10.1/targets/x86_64-linux/lib/ in the docker container. And modified the library symlinks accordingly:

/usr/lib/x86_64-linux-gnu$ ls -lh libcu*
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root   14 Sep 10 04:41 libcublasLt.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcublasLt.so.10.1.0.105
-rw-r--r-- 1 root root  23M Feb 25  2019 libcublasLt_static.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root   14 Sep 10 04:41 libcublas.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcublas.so.10.1.0.105
-rw-r--r-- 1 root root  87M Feb 25  2019 libcublas_static.a
lrwxrwxrwx 1 root root   29 Sep  9 16:09 libcudadebugger.so.1 -> libcudadebugger.so.535.104.05
-rwxr-xr-x 1 root root 9.8M Sep  9 15:43 libcudadebugger.so.535.104.05
lrwxrwxrwx 1 root root   12 Sep  9 16:09 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root   14 Sep 18 15:49 libcuda.so.1 -> libguestlib.so
-rw-r--r-- 1 root root  16M Feb 25  2019 libcuda.so.418.39
-rwxr-xr-x 1 root root  28M Sep  9 15:43 libcuda.so.535.104.05
lrwxrwxrwx 1 root root   29 Mar  7  2019 libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root   14 Sep 10 04:42 libcudnn.so.7 -> libguestlib.so
-rw-r--r-- 1 root root 7.0M Sep  9 16:14 libcudnn.so.7.5.0
lrwxrwxrwx 1 root root   32 Mar  7  2019 libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 351M Feb 15  2019 libcudnn_static_v7.a
lrwxrwxrwx 1 root root   23 Apr  6  2018 libcupsfilters.so.1 -> libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 211K Apr  6  2018 libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root  34K Dec 12  2018 libcupsimage.so.2
-rw-r--r-- 1 root root 558K Dec 12  2018 libcups.so.2
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcurand.so.10
lrwxrwxrwx 1 root root   19 Jan 29  2019 libcurl-gnutls.so.3 -> libcurl-gnutls.so.4
lrwxrwxrwx 1 root root   23 Jan 29  2019 libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.5.0
-rw-r--r-- 1 root root 499K Jan 29  2019 libcurl-gnutls.so.4.5.0
lrwxrwxrwx 1 root root   16 Jan 29  2019 libcurl.so.4 -> libcurl.so.4.5.0
-rw-r--r-- 1 root root 507K Jan 29  2019 libcurl.so.4.5.0
lrwxrwxrwx 1 root root   12 May 23  2018 libcurses.a -> libncurses.a
lrwxrwxrwx 1 root root   13 May 23  2018 libcurses.so -> libncurses.so

/usr/local/cuda-10.1/targets/x86_64-linux/lib$ ls -lh libcu*
-rw-r--r-- 1 root root 701K Feb 25  2019 libcudadevrt.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcudart.so -> libcudart.so.10.1
lrwxrwxrwx 1 root root   14 Sep 18 15:45 libcudart.so.10.1 -> libguestlib.so
-rw-r--r-- 1 root root 493K Feb 25  2019 libcudart.so.10.1.105
-rw-r--r-- 1 root root 868K Feb 25  2019 libcudart_static.a
lrwxrwxrwx 1 root root   14 Feb 25  2019 libcufft.so -> libcufft.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:39 libcufft.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 112M Feb 25  2019 libcufft.so.10.1.105
-rw-r--r-- 1 root root 132M Feb 25  2019 libcufft_static.a
-rw-r--r-- 1 root root 119M Feb 25  2019 libcufft_static_nocallback.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcufftw.so -> libcufftw.so.10
lrwxrwxrwx 1 root root   21 Feb 25  2019 libcufftw.so.10 -> libcufftw.so.10.1.105
-rw-r--r-- 1 root root 489K Feb 25  2019 libcufftw.so.10.1.105
-rw-r--r-- 1 root root  33K Feb 25  2019 libcufftw_static.a
lrwxrwxrwx 1 root root   18 Feb 25  2019 libcuinj64.so -> libcuinj64.so.10.1
lrwxrwxrwx 1 root root   22 Feb 25  2019 libcuinj64.so.10.1 -> libcuinj64.so.10.1.105
-rw-r--r-- 1 root root 7.5M Feb 25  2019 libcuinj64.so.10.1.105
-rw-r--r-- 1 root root  32K Feb 25  2019 libculibos.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcurand.so -> libcurand.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:39 libcurand.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  58M Feb 25  2019 libcurand.so.10.1.105
-rw-r--r-- 1 root root  58M Feb 25  2019 libcurand_static.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcusolver.so -> libcusolver.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:40 libcusolver.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 175M Feb 25  2019 libcusolver.so.10.1.105
-rw-r--r-- 1 root root  88M Feb 25  2019 libcusolver_static.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcusparse.so -> libcusparse.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:40 libcusparse.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  87M Feb 25  2019 libcusparse.so.10.1.105
-rw-r--r-- 1 root root  97M Feb 25  2019 libcusparse_static.a

Added the guest config in the docker container as:

$ cat /etc/ava/guest.conf 
channel = "TCP";
manager_address = "10.192.34.20:3333";
gpu_memory = [1024L];

Then I tried to launch the manger on the host as follows:

build$ ./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333

And on the guest, I try to run the toy cuda program. But it fails as described earlier.

I described the setup for onnx_dump but the setup for cudart is similar. But still it gives the error as described earlier.

Expected behavior
I expect all the instances of the toy executable launched concurrently to run successfully.

Environment:

OS: Ubuntu 18.04.6 LTS x86_64
Python version: 3.6.9
GCC version: 7.5.0
Kernel: 5.4.0-150-generic
Host: SYS-7049GP-TRT 0123456789
CPU: Intel Xeon Gold 6140 (72) @ 3.700GHz
GPU: NVIDIA Tesla P40
NVIDIA Driver Version: 418.226.00
CUDA Version: 10.1

The text was updated successfully, but these errors were encountered:

Abhishekghosh1998 · 2023-10-29T21:56:09Z

@yuhc any insight?

Abhishekghosh1998 added the bug Something isn't working label Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

Abhishekghosh1998 commented Oct 29, 2023 •

edited

Loading

Abhishekghosh1998 commented Oct 29, 2023

[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

Comments

Abhishekghosh1998 commented Oct 29, 2023 • edited Loading

Abhishekghosh1998 commented Oct 29, 2023

Abhishekghosh1998 commented Oct 29, 2023 •

edited

Loading