[BGL] infinite loop in astar_search (there are no negative edge costs)

Hi all I am using astar_search from the BGL and have sometimes hangs due to an infinite loop (no, I do not use negative costs or negative heuristic values). Now here is the strange problem: when compiling the attached (see below) test program with compiler optimization enabled it runs into an infinite loop. Without optimization a solution is found. I need help, because I do not know what I should do next. Is this a compiler bug? Is there something boost can do (workaround)? Is there a known issue (could not find anything) with the relax function? -- Detlef Here are the facts: I condensed one of the malicious (undirected) graphs to <SNIP> V ar[181]; for( unsigned int i = 0; i < 181; ++i ) { ar[i] = add_vertex( VProp(i, 1.0), g ); } add_edge( ar[1], ar[18 ], EProp( 2.20 ), g ); add_edge( ar[0], ar[30 ], EProp( 1.58 ), g ); add_edge( ar[0], ar[66 ], EProp( 0.93 ), g ); add_edge( ar[66], ar[100 ], EProp( 2.72 ), g ); add_edge( ar[66], ar[101 ], EProp( 2.87 ), g ); add_edge( ar[30], ar[142 ], EProp( 4.88 ), g ); add_edge( ar[18], ar[142 ], EProp( 3.29 ), g ); </SNAP>. There are no other edges. A complete listing of the program is attached. My compiler version: # g++ -v Using built-in specs. Target: i586-suse-linux Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib --libexecdir=/usr/lib --enable-languages=c,c++,objc,f95,java,ada --disable-checking --with-gxx-include-dir=/usr/include/c++/4.0.2 --enable-java-awt=gtk --disable-libjava-multilib --with-slibdir=/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --without-system-libunwind --host=i586-suse-linux Thread model: posix gcc version 4.0.2 20050901 (prerelease) (SUSE Linux) # g++ -O2 searchtest4.cpp && a.out 2>/dev/null discover: 0 examine: 0 edge_relaxed: 0 -- 30 discover: 30 edge_relaxed: 0 -- 66 discover: 66 examine: 66 edge_not_relaxed: 66 -- 0 edge_relaxed: 66 -- 100 discover: 100 edge_relaxed: 66 -- 101 discover: 101 examine: 30 edge_not_relaxed: 30 -- 0 edge_relaxed: 30 -- 142 discover: 142 CYCLE: examine: 100 CYCLE: edge_relaxed: 100 -- 66 CYCLE: black_target: 100 -- 66 CYCLE: examine: 66 CYCLE: edge_not_relaxed: 66 -- 0 CYCLE: edge_relaxed: 66 -- 100 CYCLE: black_target: 66 -- 100 CYCLE: edge_relaxed: 66 -- 101
From now on the lines marked CYCLE are repeated. --> infinite loop
# g++ -O0 searchtest4.cpp && a.out 2>/dev/null
discover: 0
examine: 0
edge_relaxed: 0 -- 30
discover: 30
edge_relaxed: 0 -- 66
discover: 66
examine: 66
edge_not_relaxed: 66 -- 0
edge_relaxed: 66 -- 100
discover: 100
edge_relaxed: 66 -- 101
discover: 101
examine: 30
edge_not_relaxed: 30 -- 0
edge_relaxed: 30 -- 142
discover: 142
examine: 100
edge_not_relaxed: 100 -- 66 // from here on the two outputs differ
examine: 101
edge_not_relaxed: 101 -- 66
examine: 142
edge_not_relaxed: 142 -- 30
edge_relaxed: 142 -- 18
discover: 18
examine: 18
edge_relaxed: 18 -- 1
discover: 1
edge_not_relaxed: 18 -- 142
examine: 1
found
1->18->142->30->0
--> success
Here is the listing "searchtest4.cpp": ----------------------------------
#include

On Feb 8, 2006, at 5:13 AM, Detlef Mages wrote:
I am using astar_search from the BGL and have sometimes hangs due to an infinite loop (no, I do not use negative costs or negative heuristic values).
Now here is the strange problem: when compiling the attached (see below) test program with compiler optimization enabled it runs into an infinite loop. Without optimization a solution is found.
I need help, because I do not know what I should do next. Is this a compiler bug? Is there something boost can do (workaround)? Is there a known issue (could not find anything) with the relax function?
We ran into a similar problem with Dijkstra's algorithm in the Parallel BGL, with the same strange symptoms. Once we saw that the problem occur on Intel x86 hardware but not our PowerPC-based Macs, we figured out what was going wrong. The floating point unit in x86 machines has 80 bits of precision, whereas values of type "double" have 64 bits of precision. When relaxing occurs, it's possible that the A* search finds a new path that is just barely better than the old path, so the edge is relaxed. However, the "just barely" better requires more than 64 bits of precision to describe, so when the new distance is written back into memory the new distance is effectively the same as the old distance... so we end up looping, finding a better distance each time but having that better distance truncated back to the old value. This only occurs when optimizations are enabled because the optimizations are needed to turn all of the code in "relax" into efficient FPU operations, and it only happens on x86 because nobody else (that I know of) has more precision in the FPU than in memory. One fix is to force the compiler to write the relaxed distance value back into memory (to round it), then compare against the old distance (already rounded); if the values aren't equal we can continue. Thanks for the test case. I was able to duplicate the problem on x86 Linux with GCC and I've committed a fix to Boost CVS. The updated astar_search.hpp is also attached. May we have your permission to put this test case under the Boost Software License (all versions) and include it in the BGL test suite? Doug

On Wednesday 08 February 2006 15:56, Doug Gregor wrote:
On Feb 8, 2006, at 5:13 AM, Detlef Mages wrote:
I am using astar_search from the BGL and have sometimes hangs due to an infinite loop (no, I do not use negative costs or negative heuristic values).
Now here is the strange problem: when compiling the attached (see below) test program with compiler optimization enabled it runs into an infinite loop. Without optimization a solution is found.
I need help, because I do not know what I should do next. Is this a compiler bug? Is there something boost can do (workaround)? Is there a known issue (could not find anything) with the relax function?
We ran into a similar problem with Dijkstra's algorithm in the Parallel BGL, with the same strange symptoms. Once we saw that the problem occur on Intel x86 hardware but not our PowerPC-based Macs, we figured out what was going wrong. The floating point unit in x86 machines has 80 bits of precision, whereas values of type "double" have 64 bits of precision. When relaxing occurs, it's possible that the A* search finds a new path that is just barely better than the old path, so the edge is relaxed. However, the "just barely" better requires more than 64 bits of precision to describe, so when the new distance is written back into memory the new distance is effectively the same as the old distance... so we end up looping, finding a better distance each time but having that better distance truncated back to the old value. This only occurs when optimizations are enabled because the optimizations are needed to turn all of the code in "relax" into efficient FPU operations, and it only happens on x86 because nobody else (that I know of) has more precision in the FPU than in memory. One fix is to force the compiler to write the relaxed distance value back into memory (to round it), then compare against the old distance (already rounded); if the values aren't equal we can continue.
Thanks for the test case. I was able to duplicate the problem on x86 Linux with GCC and I've committed a fix to Boost CVS. The updated astar_search.hpp is also attached. May we have your permission to put this test case under the Boost Software License (all versions) and include it in the BGL test suite?
Doug
Thanks for the immediate fix. I tested the fixed astar_search.hpp and everything seems to work now. I already had faint intuition about decimal point precission, but discarded it due to my strong believe in the correctness without side-effects of how compilers address hardware. Attached you will find a cleaned up test case. I managed to reduce the graph even more without loosing its sensibility to the issue. Please feel free to put it in the BGL test suite. Thanks again for your help. Detlef
participants (2)
-
Detlef Mages
-
Doug Gregor