Introduction

Callgrind is a tool that uses the runtime code instrumentation framework of Valgrind for call-graph generation. Valgrind is a kind of emulator or virtual machine. It uses JIT (just-in-time) compilation techniques to translate x86 instructions to a simpler form called code on which various tools can be executed. The code processed by the tools is then translated back to the x86 instructions and executed on the host CPU. This way even shared libraries and dynamically loaded plugins can be analyzed but this kind of approach results in a huge slowdown (about 50 times for the callgrind tool) of analyzed applications and big memory consumption.

Simple use case

Prepare your development area as usual. You need to run Athena in debug mode only if you want to get a detailed line-by-line profile. Call-graphs can be produced in optimized mode as well.
Run Athena with Valgrind: valgrind --tool=callgrind --trace-children=yes $(which athena.py) your_Job.py
the --tool option determines which tool should be executed, callgrind in our case
the --trace-children option tells Valgrind to analyze child processes of the main application, otherwise you’ll get only profiles of bash sessions which I think is not what you want wink
Note that depending on the system you might want to use --enable-debuginfod=no i.e. to avoid a service providing debug information over an HTTP API.

Profiling selected algorithms

Profiling an entire Athena job has not only the disadvantage of being very slow but also the resulting profiles can be huge (easily more than 100MB for a few events). This can make it difficult to analyze the results using KCacheGrind. Moreover, the developer might only be interested in his/her algorithm. This is where the ValgrindAuditor (part of Control/Valkyrie) can help out. Once configured with an algorithm name to profile it will turn the callgrind instrumentation on before the algorithm’s execute method and turn it off again afterwards. Once the instrumentation is off the remaining valgrind overhead should only be about a factor of 4 which makes it much easier to run an entire Athena job in a reasonable amount of time.

To enable the ValgrindAuditor add the following lines to your component accumulator job:

flags.PerfMon.Valgrind.ProfiledAlgs=["EMBremCollectionBuilder"] #EMBremCollectionBuilder is an example. Replace as appropriate
from Valkyrie.ValkyrieConfig import ValgrindServiceCfg
acc.merge(ValgrindServiceCfg(flags))

Set flags.PerfMon.Valgrind.ProfiledAlgs to the name of the algorithm you want to profile (you can of course add multiple algorithms). Sometimes you might want to skip a few events before collecting profiling data, e.g. to exclude first-event initializations, ld symbol lookups, etc. This can be done by setting IgnoreFirstNEvents. For a complete documentation of all the ValgrindSvc properties see the Valkyrie doxygen page.

Before you run your job in Valgrind you might simply want to run Athena with your modified job options. If everything works fine (and you haven’t made a mistake in the algorithm name) you should see the following output

ValgrindAuditor     VERBOSE Starting callgrind: EMBremCollectionBuilder [event 1]
ValgrindAuditor     VERBOSE Stopping callgrind: EMBremCollectionBuilder [event 1]

This tells you that ValgrindAuditor found your algorithm and is enabling/disabling the callgrind instrumentation before/after execution (which won’t have any effect since we are not running in Valgrind yet). Once we have established that the auditor is configured correctly you can run your job in Valgrind valgrind --tool=callgrind --trace-children=yes --collect-jumps=yes --instr-atstart=no --enable-debuginfod=no $(which athena.py) --imf your_Job.py

An example using Reco_tf and some more callgrind options:

InputRDOFile="/cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/CampaignInputs/mc20/RDO/mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.recon.AOD.e6337_s3681_r13145/100events.RDO.pool.root"

valgrind --tool=callgrind --trace-children=yes --collect-jumps=yes --instr-atstart=no --cacheuse=yes --cache-sim=yes\
	--simulate-wb=yes --simulate-hwpref=yes --branch-sim=yes --dump-instr=yes --enable-debuginfod=no $(which Reco_tf.py) \
  --inputRDOFile $InputRDOFile \
  --outputAODFile myAOD.pool.root \
  --preInclude 'egammaConfig.egammaOnlyFromRawFlags.egammaOnlyFromRaw' \
  --preExec 'ConfigFlags.PerfMon.Valgrind.ProfiledAlgs=["EMBremCollectionBuilder"]'\
  --postInclude 'Valkyrie.ValkyrieConfig.ValgrindServiceCfg' \
  --autoConfiguration 'everything' \
  --maxEvents '20' \
  --fileValidation FALSE \
  --perfmon 'none'

The most important option is --instr-atstart=no. This turns the instrumentation off at the beginning so that it can be turned on by ValgrindAuditor on the first event of the algorithm(s) being profiled. See the Valgrind manual for other callgrind options. After the job is done you will find several callgrind.out files from which the largest one is again the one you are interested in.

Profiling selected pieces of code

The above only works on Algorithms. To, for example, profile one method, you need to do a bit more:

in the CMakeLists.txt file add: find_package( valgrind )
in the file to be profiled, add #include "valgrind/callgrind.h"
at the start of the piece of code to be profiled, add CALLGRIND_START_INSTRUMENTATION; and at the end add CALLGRIND_STOP_INSTRUMENTATION;

After compiling and setting up your code, you can run callgrind as before valgrind --tool=callgrind --trace-children=yes --collect-jumps=yes --instr-atstart=no --enable-debuginfod=no $(which athena.py) --imf your_Job.py

There’s more in the relevant section of the valgrind manual.

`KCacheGrind`

Data produced by callgrind can be loaded into the KCacheGrind tool for browsing the performance results. The actual profile of Athena run is stored in the biggest file produced by callgrind. On lxplus KCacheGrind is already installed so the user doesn’t need to do anything special: $(which kcachegrind) --version

Qt: version
KDE Development Platform: version
`KCachegrind`: version `kde`