Valgrind examples and issues

Last update: 25 Apr 2024 [History] [Edit]

Valgrind examples:

In addition to what was shown in the previous section you can use also as a reference SPOT bi-weekly jobs.

Valgrind Output Explained

The output from valgrind is (to say the least!) a bit cryptic. There are two types of output, “runtime messages” which appear as and when valgrind detects a problem, and “Final reports” … summary information at the end of the job (for example, on memory leaks, etc). In this section, they are both described.

“Run-Time” Messages

As valgrind finds memory problems in running code, it outputs cryptic-looking messages. In this section, some examples are explained.

The message will be of the form:

==31929== Conditional jump or move depends on the uninitialized value(s)
==31929==    at 0x807B0E4: PyObject_Free (in 
/afs/cern.ch/sw/lcg/external/Python/2.3.4/rh73_gcc32/bin/python)
==31929==    by 0x8074D4F: dictresize (in 
/afs/cern.ch/sw/lcg/external/Python/2.3.4/rh73_gcc32/bin/python)
etc. 

where the value inside the == is the pid of your program. The rest of the message is the interesting part, and common examples are explained below.

Message Explanation Example of broken code
Conditional jump or move depends on uninitialised value(s) You have something like if ( isTrue ) where _isTrue_ is undefined. This is very bad - it causes completely unpredictable behavior, and therefore nasty bugs. Please note that if valgrind says this, it really is true. It is not the case that valgrind hasn’t understood your code … valgrind does not parse code, it just watches memory int i; //i undefined if (i>2) doSomething();
Invalid read of size 8 You have tried to read an invalid memory location i.e. which is not on the stack. See here for more details  

Final Reports

If you’re using memcheck, then you will get lots of messages at the end, relating to memory leaks. The following information (more info at FAQ) might be useful: The errors reported can be: | Message | Explanation | | ————— | ————————————————————————————————————————————————————————————————— | | definitely lost | your program is leaking memory – fix it! | | Possibly lost | your program is probably leaking memory unless you’re doing funny things with pointers | | still reachable | means your program is probably ok – it didn’t free some memory it could have. This is quite common and often reasonable. Don’t use --show-reachable=yes if you don’t want to see these reports. | | suppressed | a leak error has been suppressed. There are some suppressions in the default suppression files. You can ignore suppressed errors. |

(The details are in section 3.6 of the manual)

Examples

Invalid read

Running on release 10 I see the following, just before a crash (note that I have removed some of the backtrace output for clarity):

==29848== Invalid read of size 4
==29848==    at 0x3C2C23B4: Trk::Surface::associatedDetectorElement() const (Surface.h:159)
==29848==    by 0x3C288BD4: Trk::TrackStateOnSurface::~TrackStateOnSurface() (TrackStateOnSurface.cxx:84)
==29848==    by 0x3C0BBD59: DataVector<Trk::TrackStateOnSurface const>::~DataVector() (in 
/afs/cern.ch/user/e/emoyse/public/TestNightly/Tracking/TrkEvent/TrkTrack/
TrkTrack-04-18-03/i686-slc3-gcc323-opt/libTrkTrackDict.so)
==29848==    by 0x3C2848C8: Trk::Track::~Track() (in /afs/cern.ch/user/e/emoyse/public/TestNightly/Tracking/TrkEvent/TrkTrack/
TrkTrack-04-18-03/i686-slc3-gcc323-opt/libTrkTrack.so)
==29848==    by 0x3C0BB9A9: DataVector<Trk::Track>::~DataVector() (in /afs/cern.ch/user/e/emoyse/public/TestNightly/Tracking/TrkEvent/TrkTrack/
TrkTrack-04-18-03/i686-slc3-gcc323-opt/libTrkTrackDict.so)
==29848==    by 0x3C5428B8: SG::DataBucket<DataVector<Trk::Track> >::~DataBucket() (in /afs/cern.ch/user/e/emoyse/public/TestNightly/Tracking/TrkEventCnv/
TrkEventAthenaPool/TrkEventAthenaPool-01-06-01/i686-slc3-gcc323-opt/
libTrkEventAthenaPoolPoolCnv.so)

Followed by:

==29848==  Address 0x4BDF9960 is 16 bytes inside a block of size 28 freed
==29848==    at 0x3414BC27: operator delete(void\*) (vg_replace_malloc.c:156)
==29848==    by 0x379C6E06: Trk::PlaneSurface::~PlaneSurface() (in /afs/cern.ch/atlas/software/dist/10.0.0/Tracking/TrkDetDescr/TrkSurfaces/
TrkSurfaces-00-04-06/i686-slc3-gcc323-opt/libTrkSurfaces.so)
==29848==    by 0x3C2C73C7: Trk::AtaPlane::~AtaPlane() (AtaPlane.cxx:117)
==29848==    by 0x3C288BD4: Trk::TrackStateOnSurface::~TrackStateOnSurface() (TrackStateOnSurface.cxx:84)
==29848==    by 0x3C0BBD59: DataVector<Trk::TrackStateOnSurface const>::~DataVector() (in /afs/cern.ch/user/e/emoyse/public/TestNightly/Tracking/TrkEvent/TrkTrack/
TrkTrack-04-18-03/i686-slc3-gcc323-opt/libTrkTrackDict.so)

What this means is the following:

  • the TrackStateOnSurface destructor is trying to use a pointer to an object that no longer exists because the object was already deleted in the PlaneSurface destructor.

Finding a Memory Leak (using and addrcheck)

In this example, we will step through finding and solving a memory leak. I have run on opt, and got the following message in the job output:

==29199== 300 bytes in 15 blocks are definitely lost in loss record 1006 of 1301 
==29199== at 0x3414B6D6: operator new(unsigned) (vg_replace_malloc.c:133) 
==29199== by 0x3F79D4F0: Trk::TrackSummaryTool::createSummary(Trk::Track const&) (in /afs/cern.ch/atlas/software/
dist/nightlies/rel/atlrel_1/Tracking/TrkTools/TrkTrackSummaryTool/TrkTrackSummaryTool-00-11-01/i686-slc3-gcc323-
opt/libTrkTrackSummaryToolLib.so) 
==29199== by 0x3F7BBDA0: InDet::PriVxTopAlg::m_preselect(Trk::Track const\*) (in /afs/cern.ch/atlas/software/dist
/nightlies/rel/atlrel_1/InnerDetector/InDetRecAlgs/InDetPriVxFinder/InDetPriVxFinder-01-00-01/i686-slc3-gcc323-
opt/libInDetPriVxFinder.so) 
==29199== by 0x3F7C0780: InDet::VxPrimary::execute() (in /afs/cern.ch/atlas/software/dist/nightlies/rel/atlrel_1
/InnerDetector/InDetRecAlgs/InDetPriVxFinder/InDetPriVxFinder-01-00-01/i686-slc3-gcc323-opt/libInDetPriVxFinder.so) 
==29199== by 0x342850DF: Algorithm::sysExecute() (in /afs/cern.ch/atlas/offline/external/Gaudi/0.14.6.14-root4/
GaudiKernel/v15r7p4/i686-slc3-gcc323-opt/libGaudiKernel.so) 
==29199== by 0x341760B1: AthenaEventLoopMgr::executeAlgorithms() (in /afs/cern.ch/atlas/software/dist/nightlies
/rel/atlrel_1/Control/AthenaServices/AthenaServices-01-05-16/i686-slc3-gcc323-opt/libAthenaServices.so) 
==29199== by 0x341764D5: AthenaEventLoopMgr::executeEvent(void\*) (in /afs/cern.ch/atlas/software/dist/nightlies
/rel/atlrel_1/Control/AthenaServices/AthenaServices-01-05-16/i686-slc3-gcc323-opt/libAthenaServices.so) 
==29199== by 0x34176CC8: AthenaEventLoopMgr::nextEvent(int) (in /afs/cern.ch/atlas/software/dist/nightlies/rel
/atlrel_1/Control/AthenaServices/AthenaServices-01-05-16/i686-slc3-gcc323-opt/libAthenaServices.so) 

This tells us that there is a 300 byte leak coming from Trk::TrackSummaryTool::createSummary(Trk::Track const&), which was called by InDet::PriVxTopAlg::m_preselect(Trk::Track const\*) etc. (i.e. the last/latest call is at the top, and first at the bottom).

It might be useful to have some more detailed output - such as line numbers, for that last method. So we can check out and build just that package with debug info (see here for details on how to do this).

Now re-run … the new output is:

==8807== 20 bytes in 1 blocks are definitely lost in loss record 242 of 2341
==8807==    at 0x3414C6D6: operator new(unsigned) (vg_replace_malloc.c:133)
==8807==    by 0x3EB1894B: Trk::TrackSummaryTool::createSummary(Trk::Track const&) (TrackSummaryTool.cxx:171)
==8807==    by 0x3F91A769: Trk::TrackScoringTool::score(Trk::Track const&) (in /afs/cern.ch/user/e/emoyse/
public/Development/Tracking/TrkTools/TrkAmbiguityProcessor/TrkAmbiguityProcessor-00-00-01/i686-slc3-gcc323-opt/
libTrkAmbiguityProcessorLib.so)
==8807==    by 0x3F918E3F: Trk::TrackAmbiguityProcessorTool::process(DataVector<Trk::Track> const\*) (in 
/afs/cern.ch/user/e/emoyse/public/Development/Tracking/TrkTools/TrkAmbiguityProcessor/
TrkAmbiguityProcessor-00-00-01/i686-slc3-gcc323-opt/libTrkAmbiguityProcessorLib.so)
==8807==    by 0x3F92CF0E: InDetAmbiguitySolver::resolveTracks() (in /afs/cern.ch/user/e/emoyse/
public/Development/InnerDetector/InDetRecAlgs/InDetAmbiguitySolver/InDetAmbiguitySolver-00-08-00/
i686-slc3-gcc323-opt/libInDetAmbiguitySolver.so)
==8807==    by 0x3F92CA48: InDetAmbiguitySolver::execute() (in /afs/cern.ch/user/e/emoyse/public/
Development/InnerDetector/InDetRecAlgs/InDetAmbiguitySolver/InDetAmbiguitySolver-00-08-00/
i686-slc3-gcc323-opt/libInDetAmbiguitySolver.so)
==8807==    by 0x342860DF: Algorithm::sysExecute() (in /afs/cern.ch/atlas/offline/external/Gaudi/
0.14.6.14-root4/GaudiKernel/v15r7p4/i686-slc3-gcc323-opt/libGaudiKernel.so)
==8807==    by 0x341770B1: AthenaEventLoopMgr::executeAlgorithms() (in /afs/cern.ch/atlas/
software/dist/nightlies/rel/atlrel_1/Control/AthenaServices/AthenaServices-01-05-16/i686-slc3-gcc323-opt/
libAthenaServices.so)

So now we see that there is a problem in TrackSummaryTool.cxx, line 171… which narrows it down to here:

const std::vector< const Trk::TrackParameters* >* extrapParameters = 0;
if (m_doHoles) 
 {
 extrapParameters = extrapolate(track);
 [...]
}

The answer is now clear -extrapParameters was never deleted. Fixing this and checking again with valgrind confirms that the bug is gone.

Special issue with object factory design in Tracking

Especially in the tracking realm, many tools employ a cascaded object factory design pattern. That means they create an object and instead of deleting it before the method reaches the end, they give it back to the calling client, hereby passing the ownership (and responsibility to delete it or add it to StoreGate) onto the calling client, and even further on (cascade). The TrackSummaryTool in the example above is such a factory, and if it shows up in a valgrind leak report it needs some investigation to find out if the tool is ‘leaking’ objects internally or if the tool is working correctly but the calling client failed to take over the ownership. This investigation is complicated by the fact that often the factory tools have an internal structure with such ownership transfer, and the clients can themselves again employ a factory design to add the (potentially leaking) object onto the next bigger object. For instance, a track summary is first made (by factory, the TrackSummaryTool) and then added to a track, but the track is leaked. In such a case the developer of the TrackSummaryTool will not enjoy getting a bug report filed, for which the tool is not responsible.

To investigate such leaks, a few recommendations should be considered:

  • increase the stack trace report length via the number in --num-callers=N so that a common or suspicious caller does not get away unnoticed.
  • get an overview on how widely the basic class or tool (at the top of stack trace) is used and how often the leak report associated with the same class appears in your valgrind report. A leak report for the same class by a variety of callers makes the leak probable to come from the class. A single leak report for an otherwise widely used class makes the leak probable to come from the caller.
  • maybe this is obvious, but just to state clearly: get an overview of which code in the (long) stack trace is newly introduced, changed, or called under rare conditions, and search there first. Leak checking has been active for several years now guaranteeing stable memory for large productions so it is unlikely that the bread-and-butter tools are suddenly leaking objects.

Problems

64bit Platforms and 32bit athena builds

Running Athena in 64bit platforms but with 32bit-built libraries you may hit errors like this:

  • Valgrind: failed to start tool memcheck for platform ‘amd64-linux’: No such file or directory This is solved by selecting the 64bit valgrind build: i.e.
  • export PATH=/afs/cern.ch/sw/lcg/external/valgrind/3.5.0/x86_64-slc5-gcc43-opt/bin:$PATH
  • export LD_LIBRARY_PATH=/afs/cern.ch/sw/lcg/external/valgrind/3.5.0/x86_64-slc5-gcc43-opt/lib:$LD_LIBRARY_PATH

N.B. This should be fixed for all releases newer than 16.5.0.

Memory limit on lxplus

valgrind: mmap(0x9065000, -1493585920) failed during startup.
valgrind: is there a hard virtual memory limit set?

This should not happen. Lxplus machines still have a hard memory limit, but it is so large (768 Mb at 06.12.2006, check with ulimit -a) that it should be possible to run valgrind without a problem.