Experts cannot help you unless you provide a reproducible example. The fewer steps they have to follow, the more time they can devote to helping you.
git diff > ~/public/patch
, which can then be applied via git apply ~/public/patch
Always run with and without the fix to verify that the change has worked. A convenient way to do this is simply to rename the build directory (e.g. mv build build.bak
) so that it’s no longer in your environment (i.e. $LD_LIBRARY_PATH
).
All crashes are initially reported as core dump, but this is a general term.
In this case it reports “signal 11(Segmentation fault)”. You should report these either as “crash” or “segmentation fault” if you’re making a bug report.
Core dump from CoreDumpSvc on aibuild40-006.cern.ch at Thu Dec 12 04:13:30 2019
Caught signal 11(Segmentation fault). Details:
errno = 0, code = 1 (address not mapped to object)
pid = 8, uid = 0
value = (784913664, 0x2ec8d500)
vmem = 3880.77 MB
rss = 2621.29 MB
total-ram = 117869 MB
free-ram = 2444.74 MB
buffer-ram= 46.8945 MB
total-swap= 40960 MB
free-swap = 39928.3 MB
addr = 0x8
Event counter: 2
Slot 0 : Current algorithm = TrigL2MufastHypoAlg
: Last Incident = BeginIncFiringAlg:BeginEvent
: Event ID = [327265,186540986,t:1497896168.422137145,l:100,b:212]
-------------------------------------------------------------------------------------
| AtlasBaseDir : /build/atnight/localbuilds/nightlies/main/build/install |
| AtlasVersion : 2019-12-11T2131 |
| CMTCONFIG : x86_64-centos7-gcc8-opt |
-------------------------------------------------------------------------------------
Note: to see line numbers in below stacktrace you might consider running following :
atlasAddress2Line --file <logfile>
-------------------------------------------------------------------------------------
*** Break *** segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
Thread 4 (Thread 0x7f2ef119f700 (LWP 39026)):
#0 0x00007f2f3e08d469 in waitpid () from /lib64/libc.so.6
#1 0x00007f2f3e00af12 in do_system () from /lib64/libc.so.6
#2 0x00007f2f3e00b2c1 in system () from /lib64/libc.so.6
#3 0x00007f2f3640dd03 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#4 0x00007f2f364105d4 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#5 0x00007f2f13c7eaf8 in CoreDumpSvcHandler::action(int, siginfo_t*, void*) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/CoreDumpSvc.cxx:104
#6 <signal handler called>
#7 (anonymous namespace)::combMaker(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::function<void (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_iterator.h:783
#8 0x00007f2f09409c1d in HLT::elementsInUniqueCombinations(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#9 0x00007f2f032f0384 in TrigMufastHypoTool::multiplicitySelection(std::vector<TrigMufastHypoTool::MuonClusterInfo, std::allocator<TrigMufastHypoTool::MuonClusterInfo> >&) const () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#10 0x00007f2f032f0a9b in TrigMufastHypoTool::decide (this=0x188aa000, toolInput=...) at /build/atnight/localbuilds/nightlies/main/athena/Trigger/TrigHypothesis/TrigMuonHypoMT/src/TrigMufastHypoTool.cxx:269
#11 0x00007f2f032e45eb in TrigMufastHypoAlg::execute(EventContext const&) const () at /build/atnight/localbuilds/nightlies/main/build/install/GAUDI/22.0.9/InstallArea/x86_64-centos7-gcc8-opt/include/GaudiKernel/GaudiHandle.h:272
#12 0x00007f2f28580bb7 in Gaudi::Algorithm::sysExecute(EventContext const&) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/src/Lib/Algorithm.cpp:464
#13 0x00007f2f167e6fa9 in AthReentrantAlgorithm::sysExecute (this=<optimized out>, ctx=...) at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaBaseComps/src/AthReentrantAlgorithm.cxx:70
#14 0x00007f2f0b23a421 in AlgoExecutionTask::execute() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AlgoExecutionTask.cpp:55
#15 0x00007f2f35eb87f5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f2f15703e00, parent=..., child=<optimized out>) at ../../include/tbb/machine/gcc_ia32_common.h:96
#16 0x00007f2f35eb2428 in tbb::internal::arena::process (this=0x7f2f15773a00, s=...) at ../../src/tbb/arena.cpp:156
#17 0x00007f2f35eb0e93 in tbb::internal::market::process (this=0x7f2f1577b580, j=...) at ../../src/tbb/market.cpp:702
#18 0x00007f2f35ead47c in tbb::internal::rml::private_worker::run (this=0x7f2f154e7100) at ../../src/tbb/private_server.cpp:266
#19 0x00007f2f35ead689 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#20 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f2ef25fb700 (LWP 38787)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7f2ef25fad80) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7f2ef25fad80) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f0b242a44 in tbb::concurrent_bounded_queue<std::function<StatusCode ()>, tbb::cache_aligned_allocator<std::function<StatusCode ()> > >::pop(std::function<StatusCode ()>&) (destination=..., this=0x34ba8ae0) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 AvalancheSchedulerSvc::activate() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AvalancheSchedulerSvc.cpp:391
#7 0x00007f2f3dd0224f in execute_native_thread_routine () at /tmp/dkonst/build/contrib/gcc-8.3.0/src/gcc/8.3.0/libstdc++-v3/src/c++11/thread.cc:80
#8 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f2f1466a700 (LWP 38498)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7f2f14669de0) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7f2f14669de0) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f14906ea2 in tbb::concurrent_bounded_queue<std::function<void ()>, tbb::cache_aligned_allocator<std::function<void ()> > >::pop(std::function<void ()>&) (destination=..., this=0x16435e00) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 InertMessageSvc::m_activate() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiCoreSvc/src/MessageSvc/InertMessageSvc.cpp:38
#7 0x00007f2f3dd0224f in execute_native_thread_routine () at /tmp/dkonst/build/contrib/gcc-8.3.0/src/gcc/8.3.0/libstdc++-v3/src/c++11/thread.cc:80
#8 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f2f3fe917c0 (LWP 28456)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7ffcdfc87c70) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7ffcdfc87c70) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f0b243346 in tbb::concurrent_bounded_queue<EventContext*, tbb::cache_aligned_allocator<EventContext*> >::pop (destination=
0x7ffcdfc87d88: 0x0, this=0x34ba8a90) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 AvalancheSchedulerSvc::popFinishedEvent (this=0x34ba8000, eventContext=
0x7ffcdfc87d88: 0x0) at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AvalancheSchedulerSvc.cpp:546
#7 0x00007f2f13c1756a in AthenaHiveEventLoopMgr::drainScheduler(int&) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:1242
#8 0x00007f2f13c1849d in AthenaHiveEventLoopMgr::nextEvent(int) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:841
#9 0x00007f2f13c1725c in AthenaHiveEventLoopMgr::executeRun(int) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:728
#10 0x00007f2f148434d8 in ApplicationMgr::executeRun(int) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/GaudiKernel/StatusCode.h:119
#11 0x00007f2ef1df6047 in ?? ()
#12 0x000000000e197de8 in ?? ()
#13 0x00007f2f3675cee0 in gThreadTsd () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#14 0x0000001434f0d410 in ?? ()
#15 0x00007f2f3811f6dc in ROOT::TVirtualRWMutex::UnLock() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libPyROOT.so
#16 0x00007f2f3203e882 in TClingCallFunc::IFacePtr() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCling.so
#17 0x00007f2f3812e5e9 in FastCall(long, void*, void*, void*) () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libPyROOT.so
===========================================================
Since this is a multi-threaded job, the stack traces for each thread are printed. Generally only one or two will be relevant - usually the threads currently processing the event loop, i.e. with Gaudi::Algorithm::sysExecute(EventContext const&)
in their stack trace. The culprit will likely be towards the top of the traces:
#7 (anonymous namespace)::combMaker(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::function<void (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_iterator.h:783
#8 0x00007f2f09409c1d in HLT::elementsInUniqueCombinations(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#9 0x00007f2f032f0384 in TrigMufastHypoTool::multiplicitySelection(std::vector<TrigMufastHypoTool::MuonClusterInfo, std::allocator<TrigMufastHypoTool::MuonClusterInfo> >&) const () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#10 0x00007f2f032f0a9b in TrigMufastHypoTool::decide (this=0x188aa000, toolInput=...) at /build/atnight/localbuilds/nightlies/main/athena/Trigger/TrigHypothesis/TrigMuonHypoMT/src/TrigMufastHypoTool.cxx:269
#11 0x00007f2f032e45eb in TrigMufastHypoAlg::execute(EventContext const&) const () at /build/atnight/localbuilds/nightlies/main/build/install/GAUDI/22.0.9/InstallArea/x86_64-centos7-gcc8-opt/include/GaudiKernel/GaudiHandle.h:272
#12 0x00007f2f28580bb7 in Gaudi::Algorithm::sysExecute(EventContext const&) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/src/Lib/Algorithm.cpp:464
By looking at the function in question, you might already be able to see the problem. Or you may need to recompile the relevant packages to get the debug symbols locally, and possibly add extra debug printouts. You can also launch GDB:
athena.py --debug exec ...
and inspect the relevant frames (i.e. lines of the stack trace):
(gdb) c # continue (start running)
# ... wait for crash ...
(gdb) thread apply all bt # show stacktraces for all threads
(gdb) thread 4 # switch to thread 4
(gdb) bt # stacktrace for thread 4
For more information on how to use gdb
see the relevant talk in the Athena Development Tutorial.
Segmentation faults are often caused by reading memory you shouldn’t access, e.g.
std::vector<int> myVec {1,2,3};
// segfault - reading beyond end of vector
std::cout << myVec[4] << std::endl;
Signs of accessing invalid memory addresses can be:
addr = 0x8
) which are reserved for the operating system0x0
)By default athena uses tcmalloc
for memory management: this is faster than the default OS allocator glibc
, but will crash further from the site of the problem. Running with athena.py --stdcmalloc
can help pinpoint the problem more easily.
Other forms of memory corruption will lead glibc
to dump everything in memory. Sample error messages:
*** Error in `/cvmfs/atlas-nightlies.cern.ch/repo/sw/main/sw/lcg/releases/Python/2.7.16-e553a/x86_64-centos7-gcc8-opt/bin/python': double free or corruption (fasttop): 0x000000003cb73c20 ***
or
glibc detected *** /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc48-opt/20.2.0/sw/lcg/releases/LCG_71/Python/2.7.6/x86_64-slc6-gcc48-opt/bin/python: free(): invalid pointer: 0x00007f368c353638 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x7f368c708e66]
/lib64/libc.so.6(+0x7897a)[0x7f368c70b97a]
/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc48-opt/20.2.0/AtlasSite/gcc-links/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6(_ZNSs6assignERKSs+0x79)[0x7f368b27ff09]
These are less helpful, so you should run these in gdb
to get a standard stack trace. Various kinds of memory corruption can cause these - double free is the most common cause.
These often appear as tcmalloc reports
ESDtoAOD 05:56:17 tcmalloc: large alloc 1073741824 bytes == 0x123348000 @
ESDtoAOD 05:56:45 tcmalloc: large alloc 2147483648 bytes == 0x17aa0c000 @
ESDtoAOD 05:57:31 tcmalloc: large alloc 4294967296 bytes == 0x1fb20c000 @
ESDtoAOD 05:59:06 tcmalloc: large alloc 8589934592 bytes == 0x366b7a000 @
ESDtoAOD 05:59:33 tcmalloc: large alloc 4769398784 bytes == 0x567c0c000 @
ESDtoAOD 06:06:04 tcmalloc: large alloc 2147483648 bytes == 0x17aa0c000 @
ESDtoAOD 06:06:43 tcmalloc: large alloc 4294967296 bytes == 0x567c0c000 @
ESDtoAOD 06:08:56 tcmalloc: large alloc 8589934592 bytes == 0x351ae4000 @
ESDtoAOD 06:09:10 tcmalloc: large alloc 4769398784 bytes == 0x749a80000 @
ESDtoAOD 06:16:04 tcmalloc: large alloc 4294967296 bytes == 0x567c0c000 @
ESDtoAOD 06:17:23 tcmalloc: large alloc 8589934592 bytes == 0x351ae4000 @
ESDtoAOD 06:17:37 tcmalloc: large alloc 4769398784 bytes == 0x749a50000 @
followed by a crash. In some cases the job may be killed by the OS, batch system etc. You can set a breakpoint on the tcmalloc
large alloc reporting function:
(gdb)breakpoint (anonymous namespace)::ReportLargeAlloc(unsigned long, void*)
to find out where it’s happening. The alloc threshold is 1GB by default but can also be changed to the value of your choice: export TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=67108864
A scheduler stall indicates that an algorithm requires an input that was not produced in the job so far. These are quite rare because the default for trigger jobs (i.e. anything configured with runHLT.py is to require dependencies to be fixed at initialisation - every algorithm specifies an input that is either present in the input file or produced by another algorithm. If all this is resolved automatically, how can things go wrong? In the HLT there are two main mechanisms which might cause input data to go missing
For example:
AvalancheSchedulerSvc 1 0 ERROR *** Stall detected, event context: s: 0 e: 1
AvalancheSchedulerSvc 1 0 ERROR Event 1 on slot 0 failed
AthenaHiveEventLoopMgr FATAL Failed event detected on s: 0 e: 1 w/ fail mode: AlgStall
Further down this should list which algorithm required an input that was not provided. For example:
L2MuCombRecoInViews (344), w/ decision: UNDEFINED(-1)
SGInputLoader (8), w/ decision: TRUE(1), in state: INITIAL
VDVInDetFTF_muon (345), w/ decision: TRUE(1), in state: EVTACCEPTED
ActsPixelClusterizationAlg_muon (346), w/ decision: UNDEFINED(-1), in state: CONTROLREADY
========
missing data: ( 'InDet::SiDetectorElementStatus' , 'StoreGateSvc+ITkPixelDetectorElementStatus' )
required by tool: ActsPixelClusterizationAlg_muon.ActsPixelClusteringTool
can be produced by alg(s): ( PrecisionTrackViewDataVerifier_forGSFRefit_GSFVDV in state: INITIAL ) ( PrecisionTrackViewDataVerifier_forGSFRefit_LRTGSFVDV in state: INITIAL )
indicates that the tool
ActsPixelClusterizationAlg_muon.ActsPixelClusteringTool
, whose parent is
ActsPixelClusterizationAlg_muon
requires a ReadHandle for “missing data” and
indicates a suggested algorithm which can produce it. In this case, the failing
algorithm has the suffix _muon
, while the suggested tool has _forGSFRefit_GSF
(i.e. electron reconstruction).
Scheduler stalls are a difficult topic, so it’s difficult to provide good overall recommendations. Some further information can be found in these (ATLAS internal) slides.
Most crashes should be seen in athena
as well as athenaHLT
. For instance,
an out-of-bounds memory access in a reconstruction algorithm will manifest in
both. In some rare cases a problem will manifest only in athena
or only in
athenaHLT
. The basic differences between each are as follows:
athena
is the basic offline workflow, reading from a file and using the AthenaHiveEventLoopMgr (the same as all other offline workflows).athenaHLT
is closer to (but not identical with) what runs at P1. A parent process runs athena initialization, after which child processes are forked (retaining the information/memory from the parent) and run over events using the HltEventLoopMgr.gdb
can be attached to either the parent process with --debug parent
or the child with --debug child
. In most cases the child process (i.e. during the event loop) is where the problem happens.
To reproduce a problem seen during data-taking at Point-1, use the -b/--use-database
option in order to
run the HLT with the same configuration (menu and prescales) as was used online when processing the events
in FILE
:
asetup Athena,24.0.40
athenaHLT -b --filesInput=FILE
You can find the correct release number for a given run using ATLAS Run Query.