Valgrind

Last update: 17 May 2024 [History] [Edit]

Introduction

Valgrind is an extremely useful code-checking tool. It works by tracking every single bit of memory, and checking that they are all properly initialized, etc. This takes a lot of CPU power and memory, so expect programs being watched by valgrind to run a lot slower.

But, to quote the valgrind website:

“With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, and making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs.”

As mentioned, Valgrind comes with several possible ‘tools’ or ‘skins’ which can be used. The default is Memcheck, which checks all reads and writes of memory, and intercepts all calls to malloc/new/free/delete.

As a result, it can catch the following errors:

  • Use of uninitialized memory
  • Reading/writing memory after it has been freed
  • Reading/writing off the end of malloc's blocks
  • Reading/writing inappropriate areas on the stack
  • Memory leaks – where pointers to malloc's blocks are lost forever
  • Mismatched use of malloc/new/new [] vs free/delete/delete []
  • Overlapping src and dst pointers in memcpy() and related functions
  • Some misuses of the POSIX pthreads API

There are other possibilities, such as

  • Addrcheck - a lightweight (faster) version of memcheck
  • Cachegrind - is a cache profiler
  • Callgrind - a call graph profiler (extended version of Cachegrind)
    • You can use this to profile only your algorithm, using Valkyrie
  • Massif - a heap memory profiler
  • Helgrind - data races in multithreaded programs

For more information, read the official valgrind documentation.

Starting Valgrind

Valgrind is shipped with the LCG releases and is available in the Athena environment.

To use valgrind with Athena and jobOption you need first generate a pickle file from your python job options and then call valgrind as shown here: athena.py --config-only=rec.pkl --stdcmalloc jobOptions.py

valgrind $valgrindOpts $(which python) $(which athena.py) --stdcmalloc rec.pkl

where the valgrindOpts is the configuration for a particular tool if you want to check the memory leaks/violations for example:

 valgrindOpts="--show-possibly-lost=no,--smc-check=all,--tool=memcheck,--leak-check=full,--num-callers=30,--log-file=valgrind.%p.%n.out,--track-origins=yes,"
  • --leak-check=yes enables leak-checking (the default, but full printout of leaks)
  • - -trace-children=yes you likely need this, as athena.py spawns a subprocess and by default runs in a subprocess, so your logfile would be empty.
  • --num-callers=25 gives the depth of the stacktrace - depending on the problem you might need to increase this number even further
  • --show-reachable=yes will also show leaks that are still ‘reachable’, see the manual for an explanation
  • --track-origins=yes will tell you where you allocated variables which are later used as uninitialized
    • the last two options will increase memory usage
  • --smc-check=all tells valgrind to allow code to be modified during run time, self-modifying code. This is needed for !JIT with root-6. It is set as a default in =#VALGRIND_OPTS= in all recent releases.

or you can check for the memory heap utilization with massif:

valgrindOpts="---tool=massif,--pages-as-heap=yes,--threshold=0.01,--detailed-freq=1,--log-file=valgrind.log"
  • --pages-as-heap=yes tells Massif to profile memory at the page level
  • --threshold=0.01 is the significance threshold for heap allocations, as a percentage of total memory size.
  • --detailed-freq=1 is the frequency of detailed snapshots. 1 means every snapshot is detailed. Other massif options can be found here

This will dump all output into (appropriately) valgrind.log.

Performance monitoring i.e. PerfMon needs to be turned off when running valgrind to avoid a crash. The following fragment should be included in your job options file or in the preExec argument for a transform (e.g Reco_tf.py) command.

from RecExConfig.RecFlags import rec
rec.doPerfMon.set_Value_and_Lock(False)
rec.doDetailedPerfMon.set_Value_and_Lock(False)
rec.doSemiDetailedPerfMon.set_Value_and_Lock(False)

Running valgrind on dbg is pretty slow, so you may want to run on opt first. This will at least give you a rough idea of where the problems lie (for instance it can tell you the methods/functions of classes with problems). For more details, you can run in dbg. Again, running the entire of ATLAS reconstruction in dbg is slow (and may take too much memory). It’s probably best to just rebuild the particular packages you’re interested in dbg (see UsingDebugBuiltPackagesWithOptBuild for details of how to do this).

CA-base configuration

To produce a pickled configuration with ComponentAccumulator, run something like:

athena --config-only=myConfig.pkl <myJobConfigFile.py>

To execute it using Valgrind:

valgrind --leak-check=yes --trace-children=yes --num-callers=25 --show-reachable=yes --track-origins=yes  --smc-check=all $(which python) $(which CARunner.py) myConfig.pkl

Note that you can use the --evtMax parameter to limit the number of events.

It is also possible to run valgrind directly, which implies that also the python configuration stage is processed by valgrind:

valgrind --tool=memcheck --leak-check=full --smc-check=all --num-callers=30 $(which python) myConfig.py

Running through job transforms

Most of the transform jobs have been migrated to CA which means each job basically boils down to running an auto-generated runargs file, e.g., runargs.JOBNAME.py, and then executing the corresponding runwrapper.JOBNAME.sh. if can simply add valgrind inside the transform job:

    Reco_tf.py \
      --perfmon 'none' \
      [...]
      --athenaopts="--stdcmalloc" \
      --valgrind "True" \
      --valgrindDefaultOpts "False" \
      --valgrindExtraOpts="${valgrindOpts}";

Suppression Files

There will normally be a lot of errors reported from external packages over which we have no control. These can be suppressed by using “suppression files” as follows: source $(which valgrind-atlas-opts.sh)

This script will setup all relevant suppression files in $VALGRIND_OPTS so there is no need to specify them directly on the command line and you can run the valgrind command as listed above. In case you want to use additional suppression files, specify them directly on the command line via --suppressions.