Hello,
I am trying to sort out an odd crash on exit while using NVCC (via Kokkos-CUDA) and the Boost Unit Test Framework. Here my are data points:
Kokkos-CUDA with the Boost test tool infrastructure - Crash
Kokkos-HIP with the Boost test tool infrastructure - works
Kokkos-Serial with the Boost test tool infrastructure - works
Pure CUDA with the Boost test tool infrastructure - works
Serial with the Boost test tool infrastructure - works
Kokkos-CUDA alone (e.g. normal application) - works
For the first case I have three scenarios (only used to high the Kokkos usage - the code does not crash in Kokkos::initialize), two of which crash:
if(true) Kokkos::initialize(argc,argv);
crash in CUDA - SEGFAULTif(false) Kokkos::initialize(argc,argv);
crash in elsewhere - Subprocess aborted#if 0
Kokkos::initialize()
#endif
No crash
For the first scenario the call stack is the following:
#0 0x00001555521b5d9a in ?? () from ./cuda/12.2.0/lib64/libcudart.so.12
#1 0x00001555521b8c14 in ?? () from ./cuda/12.2.0/lib64/libcudart.so.12
#2 0x000015555219d882 in ?? () from ./cuda/12.2.0/lib64/libcudart.so.12
#3 0x00001555521a072b in ?? () from ./cuda/12.2.0/lib64/libcudart.so.12
#4 0x000015554b3b5797 in __cxa_finalize () from /lib64/libc.so.6
#5 0x0000155551301a87 in __do_global_dtors_aux () from build_kokkos_cuda/library/Operators/libOperators-g.so.5.5.0
#6 0x00007fffffffad10 in ?? ()
#7 0x000015555532dcee in _dl_fini () at dl-fini.c:141
For second, the call stack is quite deep after the call to __cxa_finalize () otherwise the first four (4-7) are the same.
At this point the only culprit I can perhaps point at is nvcc I know at that previous versions of Boost and nvcc did not get along. And such, wondering if the usage is a corner case because I know of another project that is successfully using all three Boost, Kokkos, and CUDA.
I am using Boost 1.85, Kokkos 4.3, and CUDA 12.2. Any insight or thoughts ? Our unit test work with Kokkos-HIP so able to validate results.
Allen Sanderson
SCI Institute
University of Utah