Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly. #204

Open
Abhishekghosh1998 opened this issue Oct 29, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@Abhishekghosh1998
Copy link

Abhishekghosh1998 commented Oct 29, 2023

A note: I don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.

Describe the bug
Concurrent execution of many instances of the same CUDA executable causes few of them to fail randomly.
In the case of :

  1. onnx_dump,it fails with terminate called without an active exception
    • Sometimes shows ERROR sending to socket: Bad file descriptor before printing terminate called without an active exception
  2. cudart,it fails with a simple Segmentation fault (core dumped)

Suppose I write a CUDA executable named say toy.cu as follows:

#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 128

__global__
void do_something(float* d_array)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    d_array[idx]*=100;
}
int main()
{
    long N= 1<<7;
    float *arr = (float*) malloc(N*sizeof(float));
    long i;
    for (i=1;i<=N;i++)
        arr[i-1]=i;
    
    float *d_array;
    cudaError_t ret;
    
    ret = cudaMalloc(&d_array, N*sizeof(float));
    printf("Return value of cudaMalloc = %d\n", ret);
    
    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
	exit(1);
    }

    ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
    printf("Return value of cudaMemcpy = %d\n", ret);

    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
	exit(1);
    }

    int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
    do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);

    ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
    printf("Return value of cudaMemcpy = %d\n", ret);

    int j;
    for(i=0;i<N;)
    {
        for(j=0;j<8;j++)
                printf("%.0f\t", arr[i++]);
        printf("\n");
    }
    cudaFree(d_array);
    return 0;
}

And compile it as :

​nvcc -o toy toy.cu --cudart shared​

Then, in the docker container setup to use the appropriate libguestlib.so I call the following script.sh

#!/bin/bash

if [ $# -ne 2 ]; then
    echo "Usage: $0 <executable> <num_instances>"
    exit 1
fi

executable=$1
num_instances=$2

for ((i=1; i<=$num_instances; i++)); do
    $executable &
done

And run the following command:

$ ./script.sh toy 20

Many (not all) fails, whether I use cudart or onnx_dump

To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:

libcudnn7_7.6.3.30-1+cuda10.1_amd64.deb      
libcudnn7-doc_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-dev_7.6.3.30-1+cuda10.1_amd64.deb

Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp to connect to my host using it's IP address.

And then did the following:

$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install

Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu and /usr/local/cuda-10.1/targets/x86_64-linux/lib/ in the docker container. And modified the library symlinks accordingly:

/usr/lib/x86_64-linux-gnu$ ls -lh libcu*
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root   14 Sep 10 04:41 libcublasLt.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcublasLt.so.10.1.0.105
-rw-r--r-- 1 root root  23M Feb 25  2019 libcublasLt_static.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root   14 Sep 10 04:41 libcublas.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcublas.so.10.1.0.105
-rw-r--r-- 1 root root  87M Feb 25  2019 libcublas_static.a
lrwxrwxrwx 1 root root   29 Sep  9 16:09 libcudadebugger.so.1 -> libcudadebugger.so.535.104.05
-rwxr-xr-x 1 root root 9.8M Sep  9 15:43 libcudadebugger.so.535.104.05
lrwxrwxrwx 1 root root   12 Sep  9 16:09 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root   14 Sep 18 15:49 libcuda.so.1 -> libguestlib.so
-rw-r--r-- 1 root root  16M Feb 25  2019 libcuda.so.418.39
-rwxr-xr-x 1 root root  28M Sep  9 15:43 libcuda.so.535.104.05
lrwxrwxrwx 1 root root   29 Mar  7  2019 libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root   14 Sep 10 04:42 libcudnn.so.7 -> libguestlib.so
-rw-r--r-- 1 root root 7.0M Sep  9 16:14 libcudnn.so.7.5.0
lrwxrwxrwx 1 root root   32 Mar  7  2019 libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 351M Feb 15  2019 libcudnn_static_v7.a
lrwxrwxrwx 1 root root   23 Apr  6  2018 libcupsfilters.so.1 -> libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 211K Apr  6  2018 libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root  34K Dec 12  2018 libcupsimage.so.2
-rw-r--r-- 1 root root 558K Dec 12  2018 libcups.so.2
-rw-r--r-- 1 root root  12M Sep 10 04:40 libcurand.so.10
lrwxrwxrwx 1 root root   19 Jan 29  2019 libcurl-gnutls.so.3 -> libcurl-gnutls.so.4
lrwxrwxrwx 1 root root   23 Jan 29  2019 libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.5.0
-rw-r--r-- 1 root root 499K Jan 29  2019 libcurl-gnutls.so.4.5.0
lrwxrwxrwx 1 root root   16 Jan 29  2019 libcurl.so.4 -> libcurl.so.4.5.0
-rw-r--r-- 1 root root 507K Jan 29  2019 libcurl.so.4.5.0
lrwxrwxrwx 1 root root   12 May 23  2018 libcurses.a -> libncurses.a
lrwxrwxrwx 1 root root   13 May 23  2018 libcurses.so -> libncurses.so
/usr/local/cuda-10.1/targets/x86_64-linux/lib$ ls -lh libcu*
-rw-r--r-- 1 root root 701K Feb 25  2019 libcudadevrt.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcudart.so -> libcudart.so.10.1
lrwxrwxrwx 1 root root   14 Sep 18 15:45 libcudart.so.10.1 -> libguestlib.so
-rw-r--r-- 1 root root 493K Feb 25  2019 libcudart.so.10.1.105
-rw-r--r-- 1 root root 868K Feb 25  2019 libcudart_static.a
lrwxrwxrwx 1 root root   14 Feb 25  2019 libcufft.so -> libcufft.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:39 libcufft.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 112M Feb 25  2019 libcufft.so.10.1.105
-rw-r--r-- 1 root root 132M Feb 25  2019 libcufft_static.a
-rw-r--r-- 1 root root 119M Feb 25  2019 libcufft_static_nocallback.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcufftw.so -> libcufftw.so.10
lrwxrwxrwx 1 root root   21 Feb 25  2019 libcufftw.so.10 -> libcufftw.so.10.1.105
-rw-r--r-- 1 root root 489K Feb 25  2019 libcufftw.so.10.1.105
-rw-r--r-- 1 root root  33K Feb 25  2019 libcufftw_static.a
lrwxrwxrwx 1 root root   18 Feb 25  2019 libcuinj64.so -> libcuinj64.so.10.1
lrwxrwxrwx 1 root root   22 Feb 25  2019 libcuinj64.so.10.1 -> libcuinj64.so.10.1.105
-rw-r--r-- 1 root root 7.5M Feb 25  2019 libcuinj64.so.10.1.105
-rw-r--r-- 1 root root  32K Feb 25  2019 libculibos.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcurand.so -> libcurand.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:39 libcurand.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  58M Feb 25  2019 libcurand.so.10.1.105
-rw-r--r-- 1 root root  58M Feb 25  2019 libcurand_static.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcusolver.so -> libcusolver.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:40 libcusolver.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 175M Feb 25  2019 libcusolver.so.10.1.105
-rw-r--r-- 1 root root  88M Feb 25  2019 libcusolver_static.a
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcusparse.so -> libcusparse.so.10
lrwxrwxrwx 1 root root   14 Oct 29 21:40 libcusparse.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  87M Feb 25  2019 libcusparse.so.10.1.105
-rw-r--r-- 1 root root  97M Feb 25  2019 libcusparse_static.a

Added the guest config in the docker container as:

$ cat /etc/ava/guest.conf 
channel = "TCP";
manager_address = "10.192.34.20:3333";
gpu_memory = [1024L];

Then I tried to launch the manger on the host as follows:

build$ ./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333

And on the guest, I try to run the toy cuda program. But it fails as described earlier.


I described the setup for onnx_dump but the setup for cudart is similar. But still it gives the error as described earlier.

Expected behavior
I expect all the instances of the toy executable launched concurrently to run successfully.

Environment:

  • OS: Ubuntu 18.04.6 LTS x86_64
  • Python version: 3.6.9
  • GCC version: 7.5.0
  • Kernel: 5.4.0-150-generic
  • Host: SYS-7049GP-TRT 0123456789
  • CPU: Intel Xeon Gold 6140 (72) @ 3.700GHz
  • GPU: NVIDIA Tesla P40
  • NVIDIA Driver Version: 418.226.00
  • CUDA Version: 10.1
@Abhishekghosh1998 Abhishekghosh1998 added the bug Something isn't working label Oct 29, 2023
@Abhishekghosh1998
Copy link
Author

@yuhc any insight?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant