Valgrind is an extremely useful code-checking tool. It works by tracking every single bit of memory, and checking that they are all properly initialized, etc. This takes a lot of CPU power and memory, so expect programs being watched by valgrind to run a lot slower.
But, to quote the valgrind website:
“With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, and making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs.”
As mentioned, Valgrind comes with several possible ‘tools’ or ‘skins’ which can be used. The default is Memcheck
, which checks all reads and writes of memory, and intercepts all calls to malloc/new/free/delete
.
As a result, it can catch the following errors:
malloc's
blocksmalloc's
blocks are lost forevermalloc/new/new []
vs free/delete/delete []
src
and dst
pointers in memcpy()
and related functionspthreads
APIThere are other possibilities, such as
Addrcheck
- a lightweight (faster) version of memcheck
Cachegrind
- is a cache profilerCallgrind
- a call graph profiler (extended version of Cachegrind
)
Massif
- a heap memory profilerHelgrind
- data races in multithreaded programsFor more information, read the official valgrind documentation.
Valgrind is shipped with the LCG
releases and is available in the Athena environment.
To use valgrind with Athena and jobOption
you need first generate a pickle file from your python job options and then call valgrind as shown here:
athena.py --config-only=rec.pkl --stdcmalloc jobOptions.py
valgrind $valgrindOpts $(which python) $(which athena.py) --stdcmalloc rec.pkl
where the valgrindOpts
is the configuration for a particular tool if you want to check the memory leaks/violations for example:
valgrindOpts="--show-possibly-lost=no,--smc-check=all,--tool=memcheck,--leak-check=full,--num-callers=30,--log-file=valgrind.%p.%n.out,--track-origins=yes,"
--leak-check=yes
enables leak-checking (the default, but full printout of leaks)- -trace-children=yes
you likely need this, as athena.py spawns
a subprocess and by default runs in a subprocess, so your logfile would be empty.--num-callers=25
gives the depth of the stacktrace - depending on the problem you might need to increase this number even further--show-reachable=yes
will also show leaks that are still ‘reachable’, see the manual for an explanation--track-origins=yes
will tell you where you allocated variables which are later used as uninitialized
--smc-check=all
tells valgrind to allow code to be modified during run time, self-modifying code. This is needed for !JIT with root-6. It is set as a default in =#VALGRIND_OPTS=
in all recent releases.or you can check for the memory heap utilization with massif:
valgrindOpts="---tool=massif,--pages-as-heap=yes,--threshold=0.01,--detailed-freq=1,--log-file=valgrind.log"
--pages-as-heap=yes
tells Massif
to profile memory at the page level--threshold=0.01
is the significance threshold for heap allocations, as a percentage of total memory size.--detailed-freq=1
is the frequency of detailed snapshots. 1 means every snapshot is detailed.
Other massif options can be found hereThis will dump all output into (appropriately) valgrind.log
.
Performance monitoring i.e. PerfMon
needs to be turned off when running valgrind to avoid a crash. The following fragment should be included in your job options file or in the preExec
argument for a transform (e.g Reco_tf.py
) command.
from RecExConfig.RecFlags import rec
rec.doPerfMon.set_Value_and_Lock(False)
rec.doDetailedPerfMon.set_Value_and_Lock(False)
rec.doSemiDetailedPerfMon.set_Value_and_Lock(False)
Running valgrind on dbg
is pretty slow, so you may want to run on opt first. This will at least give you a rough idea of where the problems lie (for instance it can tell you the methods/functions of classes with problems). For more details, you can run in dbg
. Again, running the entire of ATLAS reconstruction in dbg
is slow (and may take too much memory). It’s probably best to just rebuild the particular packages you’re interested in dbg
(see UsingDebugBuiltPackagesWithOptBuild for details of how to do this).
To produce a pickled configuration with ComponentAccumulator, run something like:
athena --config-only=myConfig.pkl <myJobConfigFile.py>
To execute it using Valgrind:
valgrind --leak-check=yes --trace-children=yes --num-callers=25 --show-reachable=yes --track-origins=yes --smc-check=all $(which python) $(which CARunner.py) myConfig.pkl
Note that you can use the --evtMax
parameter to limit the number of events.
It is also possible to run valgrind directly, which implies that also the python configuration stage is processed by valgrind:
valgrind --tool=memcheck --leak-check=full --smc-check=all --num-callers=30 $(which python) myConfig.py
Most of the transform jobs have been migrated to CA
which means each job basically boils down to running an auto-generated runargs
file, e.g., runargs.JOBNAME.py
, and then executing the corresponding runwrapper.JOBNAME.sh
. if can simply add valgrind inside the transform job:
Reco_tf.py \
--perfmon 'none' \
[...]
--athenaopts="--stdcmalloc" \
--valgrind "True" \
--valgrindDefaultOpts "False" \
--valgrindExtraOpts="${valgrindOpts}";
There will normally be a lot of errors reported from external packages over which we have no control. These can be suppressed by using “suppression files” as follows:
source $(which valgrind-atlas-opts.sh)
This script will setup all relevant suppression files in $VALGRIND_OPTS
so there is no need to specify them directly on the command line and you can run the valgrind command as listed above. In case you want to use additional suppression files, specify them directly on the command line via --suppressions
.