Trigger Debugging

Last update: 05 Jul 2023 [History] [Edit]

Introduction

Debugging trigger algorithms is mainly a function of

  • Reading and understanding stack traces
  • Reproducing problems
  • Fixing code

Providing instructions to reproduce

Experts cannot help you unless you provide a reproducible example. The fewer steps they have to follow, the more time they can devote to helping you.

  • Nightly builds only last for a month, so provide the most recent nightly that shows the problem
  • If possible, make files available on afs/eos. Otherwise, provide details for downloading them through rucio
  • Put the exact asetup required
  • Specify the atlas/athena git commit
  • If you make local modifications to any packages, point to either
  • your git commit - this can then be applied as e.g. git cherry-pick f00dbeef
  • a diff - git diff > ~/public/patch - this can then be applied via `git apply ~/public/patch’

Example:

mkdir build run
asetup 22.0.11,Athena
git atlas init-workdir https://:@gitlab.cern.ch:8443/atlas/athena.git
cd athena
git checkout release/22.0.11
git atlas addpkg TriggerMenuMT
git apply ~user/public/TriggerMenu.patch
cd ../build
cmake ../athena/Projects/WorkDir && make -j8 && source x86_64-centos7-gcc8-opt/setup.sh
cd ../run
athena.py --stdcmalloc --threads 1 --concurrent-events 1 --filesInput /cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/TrigP1Test/data18_13TeV.00360026.physics_EnhancedBias.merge.RAW._lb0151._SFO-1._0001.1 --skipEvents 0 -c 'doWriteBS=False;doWriteRDOTrigger=False' TriggerJobOpts/runHLT_standalone.py

Types of crash

All crashes are initially reported as core dump, but this is a general term.

Segmentation fault

In this case it reports “signal 11(Segmentation fault)”. You should report these either as “crash” or “segmentation fault” if you’re making a bug report.

Core dump from CoreDumpSvc on aibuild40-006.cern.ch at Thu Dec 12 04:13:30 2019

Caught signal 11(Segmentation fault). Details: 
errno = 0, code = 1 (address not mapped to object)
pid = 8, uid = 0
value = (784913664, 0x2ec8d500)
vmem = 3880.77 MB
rss = 2621.29 MB
total-ram = 117869 MB
free-ram = 2444.74 MB
buffer-ram= 46.8945 MB
total-swap= 40960 MB
free-swap = 39928.3 MB
addr = 0x8

Event counter: 2
Slot 0 : Current algorithm = TrigL2MufastHypoAlg 
: Last Incident = BeginIncFiringAlg:BeginEvent
: Event ID = [327265,186540986,t:1497896168.422137145,l:100,b:212]
-------------------------------------------------------------------------------------
| AtlasBaseDir : /build/atnight/localbuilds/nightlies/main/build/install |
| AtlasVersion : 2019-12-11T2131 |
| CMTCONFIG : x86_64-centos7-gcc8-opt |
-------------------------------------------------------------------------------------
Note: to see line numbers in below stacktrace you might consider running following :
atlasAddress2Line --file <logfile>
-------------------------------------------------------------------------------------

*** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================

Thread 4 (Thread 0x7f2ef119f700 (LWP 39026)):
#0 0x00007f2f3e08d469 in waitpid () from /lib64/libc.so.6
#1 0x00007f2f3e00af12 in do_system () from /lib64/libc.so.6
#2 0x00007f2f3e00b2c1 in system () from /lib64/libc.so.6
#3 0x00007f2f3640dd03 in TUnixSystem::StackTrace() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#4 0x00007f2f364105d4 in TUnixSystem::DispatchSignals(ESignals) () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#5 0x00007f2f13c7eaf8 in CoreDumpSvcHandler::action(int, siginfo_t*, void*) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/CoreDumpSvc.cxx:104
#6 <signal handler called>
#7 (anonymous namespace)::combMaker(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::function<void (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_iterator.h:783
#8 0x00007f2f09409c1d in HLT::elementsInUniqueCombinations(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#9 0x00007f2f032f0384 in TrigMufastHypoTool::multiplicitySelection(std::vector<TrigMufastHypoTool::MuonClusterInfo, std::allocator<TrigMufastHypoTool::MuonClusterInfo> >&) const () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#10 0x00007f2f032f0a9b in TrigMufastHypoTool::decide (this=0x188aa000, toolInput=...) at /build/atnight/localbuilds/nightlies/main/athena/Trigger/TrigHypothesis/TrigMuonHypoMT/src/TrigMufastHypoTool.cxx:269
#11 0x00007f2f032e45eb in TrigMufastHypoAlg::execute(EventContext const&) const () at /build/atnight/localbuilds/nightlies/main/build/install/GAUDI/22.0.9/InstallArea/x86_64-centos7-gcc8-opt/include/GaudiKernel/GaudiHandle.h:272
#12 0x00007f2f28580bb7 in Gaudi::Algorithm::sysExecute(EventContext const&) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/src/Lib/Algorithm.cpp:464
#13 0x00007f2f167e6fa9 in AthReentrantAlgorithm::sysExecute (this=<optimized out>, ctx=...) at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaBaseComps/src/AthReentrantAlgorithm.cxx:70
#14 0x00007f2f0b23a421 in AlgoExecutionTask::execute() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AlgoExecutionTask.cpp:55
#15 0x00007f2f35eb87f5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f2f15703e00, parent=..., child=<optimized out>) at ../../include/tbb/machine/gcc_ia32_common.h:96
#16 0x00007f2f35eb2428 in tbb::internal::arena::process (this=0x7f2f15773a00, s=...) at ../../src/tbb/arena.cpp:156
#17 0x00007f2f35eb0e93 in tbb::internal::market::process (this=0x7f2f1577b580, j=...) at ../../src/tbb/market.cpp:702
#18 0x00007f2f35ead47c in tbb::internal::rml::private_worker::run (this=0x7f2f154e7100) at ../../src/tbb/private_server.cpp:266
#19 0x00007f2f35ead689 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#20 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f2ef25fb700 (LWP 38787)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7f2ef25fad80) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7f2ef25fad80) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f0b242a44 in tbb::concurrent_bounded_queue<std::function<StatusCode ()>, tbb::cache_aligned_allocator<std::function<StatusCode ()> > >::pop(std::function<StatusCode ()>&) (destination=..., this=0x34ba8ae0) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 AvalancheSchedulerSvc::activate() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AvalancheSchedulerSvc.cpp:391
#7 0x00007f2f3dd0224f in execute_native_thread_routine () at /tmp/dkonst/build/contrib/gcc-8.3.0/src/gcc/8.3.0/libstdc++-v3/src/c++11/thread.cc:80
#8 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f2f1466a700 (LWP 38498)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7f2f14669de0) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7f2f14669de0) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f14906ea2 in tbb::concurrent_bounded_queue<std::function<void ()>, tbb::cache_aligned_allocator<std::function<void ()> > >::pop(std::function<void ()>&) (destination=..., this=0x16435e00) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 InertMessageSvc::m_activate() () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiCoreSvc/src/MessageSvc/InertMessageSvc.cpp:38
#7 0x00007f2f3dd0224f in execute_native_thread_routine () at /tmp/dkonst/build/contrib/gcc-8.3.0/src/gcc/8.3.0/libstdc++-v3/src/c++11/thread.cc:80
#8 0x00007f2f3eaa6e65 in start_thread () from /lib64/libpthread.so.0
#9 0x00007f2f3e0c688d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2f3fe917c0 (LWP 28456)):
#0 0x00007f2f3e0c0ba9 in syscall () from /lib64/libc.so.6
#1 0x00007f2f35ea2a3d in tbb::internal::futex_wait (comparand=2, futex=0x7ffcdfc87c70) at ../../include/tbb/machine/linux_common.h:81
#2 tbb::internal::binary_semaphore::P (this=0x7ffcdfc87c70) at ../../src/tbb/semaphore.h:205
#3 tbb::internal::concurrent_monitor::commit_wait (thr=..., this=<optimized out>) at ../../src/tbb/concurrent_monitor.h:140
#4 tbb::internal::concurrent_queue_base_v3::internal_pop(void*) () at ../../src/tbb/concurrent_queue.cpp:440
#5 0x00007f2f0b243346 in tbb::concurrent_bounded_queue<EventContext*, tbb::cache_aligned_allocator<EventContext*> >::pop (destination=
0x7ffcdfc87d88: 0x0, this=0x34ba8a90) at /cvmfs/sft.cern.ch/lcg/releases/tbb/2019_U7-ba3eb/x86_64-centos7-gcc8-opt/include/tbb/concurrent_queue.h:368
#6 AvalancheSchedulerSvc::popFinishedEvent (this=0x34ba8000, eventContext=
0x7ffcdfc87d88: 0x0) at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiHive/src/AvalancheSchedulerSvc.cpp:546
#7 0x00007f2f13c1756a in AthenaHiveEventLoopMgr::drainScheduler(int&) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:1242
#8 0x00007f2f13c1849d in AthenaHiveEventLoopMgr::nextEvent(int) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:841
#9 0x00007f2f13c1725c in AthenaHiveEventLoopMgr::executeRun(int) () at /build/atnight/localbuilds/nightlies/main/athena/Control/AthenaServices/src/AthenaHiveEventLoopMgr.cxx:728
#10 0x00007f2f148434d8 in ApplicationMgr::executeRun(int) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/GaudiKernel/StatusCode.h:119
#11 0x00007f2ef1df6047 in ?? ()
#12 0x000000000e197de8 in ?? ()
#13 0x00007f2f3675cee0 in gThreadTsd () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCore.so
#14 0x0000001434f0d410 in ?? ()
#15 0x00007f2f3811f6dc in ROOT::TVirtualRWMutex::UnLock() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libPyROOT.so
#16 0x00007f2f3203e882 in TClingCallFunc::IFacePtr() () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libCling.so
#17 0x00007f2f3812e5e9 in FastCall(long, void*, void*, void*) () from /cvmfs/sft.cern.ch/lcg/releases/LCG_96/ROOT/6.18.00/x86_64-centos7-gcc8-opt/lib/libPyROOT.so
===========================================================

Since this is a multi-threaded job, the stack traces for each thread are printed out. Generally only one or two will be relevant - usually the threads currently processing the event loop, i.e. with Gaudi::Algorithm::sysExecute(EventContext const&). Nothing below Gaudi calls is likely to be relevant to the bug - probably it will be somewhere near the top:

#7 (anonymous namespace)::combMaker(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::function<void (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>, unsigned long, std::vector<unsigned long, std::allocator<unsigned long> >) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/bits/stl_iterator.h:783
#8 0x00007f2f09409c1d in HLT::elementsInUniqueCombinations(std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > > const&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&, std::function<bool (std::vector<unsigned long, std::allocator<unsigned long> > const&)>) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#9 0x00007f2f032f0384 in TrigMufastHypoTool::multiplicitySelection(std::vector<TrigMufastHypoTool::MuonClusterInfo, std::allocator<TrigMufastHypoTool::MuonClusterInfo> >&) const () at /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0-cebb0/x86_64-centos7/include/c++/8.3.0/new:169
#10 0x00007f2f032f0a9b in TrigMufastHypoTool::decide (this=0x188aa000, toolInput=...) at /build/atnight/localbuilds/nightlies/main/athena/Trigger/TrigHypothesis/TrigMuonHypoMT/src/TrigMufastHypoTool.cxx:269
#11 0x00007f2f032e45eb in TrigMufastHypoAlg::execute(EventContext const&) const () at /build/atnight/localbuilds/nightlies/main/build/install/GAUDI/22.0.9/InstallArea/x86_64-centos7-gcc8-opt/include/GaudiKernel/GaudiHandle.h:272
#12 0x00007f2f28580bb7 in Gaudi::Algorithm::sysExecute(EventContext const&) () at /build/atnight/localbuilds/nightlies/main/build/src/GAUDI/GaudiKernel/src/Lib/Algorithm.cpp:464

By looking at the function in question, you might already be able to see the problem. Or you may need to recompile the relevant packages to get the debug symbols locally, and possibly add extra debug printouts. You can also launch GDB:

athena.py --debug exec rest-of-the-command

and inspect relevant frames (i.e. lines of the stack trace).

#To get the stack traces for all threads
(gdb) thread apply all bt
#To move to a specific thread
(gdb)

Fixing segfaults

Once you have identified the source of the problem, it’s time to reproduce and fix it. Instructions should be in the JIRA ticket associated to the crash. Segmentation faults are caused by reading memory you shouldn’t access, e.g.

vector<int> myVec {1,2,3};
std::cout << myVec[4] << std::endl;//segfault - reading element 4 while there are only 0, 1, 2 elements

In the case of the stack trace above, the address accessed is addr = 0x8 - very low numbers are reserved for the operating system. The other important address number is 0x0 - a null pointer. High numbers (e.g. 0x0000001434f0d410 are valid addresses that should not be accessed by either the entire program, or this function in the program (similar to the example above).

By default athena uses tcmalloc for memory management: this is faster than the default OS allocator glibc, but will crash further from the site of the problem. Running with athena.py --stdcmalloc is more likely to be helpful.

Memory corruption

Other forms of memory corruption will lead glibc to dump everything in memory. Sample error messages:

*** Error in `/cvmfs/atlas-nightlies.cern.ch/repo/sw/main/sw/lcg/releases/Python/2.7.16-e553a/x86_64-centos7-gcc8-opt/bin/python': double free or corruption (fasttop): 0x000000003cb73c20 ***

or

glibc detected *** /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc48-opt/20.2.0/sw/lcg/releases/LCG_71/Python/2.7.6/x86_64-slc6-gcc48-opt/bin/python: free(): invalid pointer: 0x00007f368c353638 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x7f368c708e66]
/lib64/libc.so.6(+0x7897a)[0x7f368c70b97a]
/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc48-opt/20.2.0/AtlasSite/gcc-links/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6(_ZNSs6assignERKSs+0x79)[0x7f368b27ff09]

These are less helpful, so you should run these in gdb to get a standard stack trace. Various kinds of memory corruption can cause these - double free is the most common cause.

Out of memory

These often appear as tcmalloc reports

ESDtoAOD 05:56:17 tcmalloc: large alloc 1073741824 bytes == 0x123348000 @
ESDtoAOD 05:56:45 tcmalloc: large alloc 2147483648 bytes == 0x17aa0c000 @
ESDtoAOD 05:57:31 tcmalloc: large alloc 4294967296 bytes == 0x1fb20c000 @
ESDtoAOD 05:59:06 tcmalloc: large alloc 8589934592 bytes == 0x366b7a000 @
ESDtoAOD 05:59:33 tcmalloc: large alloc 4769398784 bytes == 0x567c0c000 @
ESDtoAOD 06:06:04 tcmalloc: large alloc 2147483648 bytes == 0x17aa0c000 @
ESDtoAOD 06:06:43 tcmalloc: large alloc 4294967296 bytes == 0x567c0c000 @
ESDtoAOD 06:08:56 tcmalloc: large alloc 8589934592 bytes == 0x351ae4000 @
ESDtoAOD 06:09:10 tcmalloc: large alloc 4769398784 bytes == 0x749a80000 @
ESDtoAOD 06:16:04 tcmalloc: large alloc 4294967296 bytes == 0x567c0c000 @
ESDtoAOD 06:17:23 tcmalloc: large alloc 8589934592 bytes == 0x351ae4000 @
ESDtoAOD 06:17:37 tcmalloc: large alloc 4769398784 bytes == 0x749a50000 @

followed by a crash. In some cases the job may be killed by the OS, batch system etc. You can set a breakpoint on the tcmalloc large alloc reporting function:

(gdb)breakpoint (anonymous namespace)::ReportLargeAlloc(unsigned long, void*)

to find out where it’s happening. The alloc threshold is 1GB by default but can also be changed to the value of your choice: export TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=67108864

Fixing and verifying crashes

Always run with and without the fix to verify that the change has worked. A convenient way to do this is simply to rename the build directory mv build notbuild so that it’s no longer in your environment (e.g. $LD_LIBRARY_PATH).

Debugging scheduler stalls

TODO