Software Performance Optimization Team (SPOT)

Introduction

VTune Profiler performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86-based machines. It’s among a range of Intel tools that are installed and available at CERN through CVMFS. Since the compilers and performance tools are installed on CVMFS, they are available from any CVMFS-enabled Linux machine at CERN.

Starting `Vtune`

The basic usage requires sourcing the necessary setup script and passing your executable to vtune (historically amplxe-cl). In practice, this means:

# Setup Intel Tools
# Do this before running asetup to avoid python clashes
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;

# Setup the latest 24.0 Athena nightly for Reco_tf.py
lsetup "asetup Athena,24.0,latest"

# Run a simple q455 job with 1 event to generate runargs.HITtoRDO.py
Reco_tf.py --AMI q445 --perfmon none --outputRDOFile myRDO.pool.root --maxEvents 1

# change the number of events in the generated runargs.HITtoRDO.py to what you need
# Run profiling
vtune -mrte-mode=native -collect hotspots  $(which athena.py) -- preloadlib=$ATLASMKLLIBDIR_PRELOAD/libintlc.so.5:$ATLASMKLLIBDIR_PRELOAD/libimf.so runargs.HITtoRDO.py

May 20, 2021: With VTune 2021.2+ and Python 3.7+, trying to profile a python job that would use cppyy (so all athena.py jobs…) makes the process hang. (See ATLINFR-4105 for a more detailed description of the problem.) To circumvent that issue, the -mrte-mode flag, as shown above, is needed. This should produce a folder called r000hs and a file called r000hs.vtune therein, which contains the profiling output. The results can be visualized using the GUI via:

# Invoke the GUI for visualization
vtune-gui r000hs/r000hs.vtune

The “Summary” tab contains a table called “Top Hotspots” that shows the top 5 functions that used the most CPU time. The tabs named “Bottom-up”, “Caller/Callee”, and “Top-down Tree” contains more detailed information. The user will probably be interested in the “Bottom-up” tab where s/he can see the function/call-stack that is ordered in utilized CPU time. Clicking on a given function will show the call stack on the right hand-side panel, i.e.:

It is also possible to create call-graphs using VTune output. First, you need to obtain a Python script called gprof2dot from this GitHub project and then:

vtune -report gprof-cc -result-dir output -format text -report-output output.txt
gprof2dot.py -f axe output.txt | dot -Tpng -o output.png

Some useful arguments for the latter command are --strip and --color-nodes-by-selftime. The last one gives what GPerfTool normally does, and the first strips the full variables’ names in functions.

More useful information can be obtained at:

3rd CERN OpenLab/Intel hands-on workshop on code optimization
Openlab Workshop 3
TriggerProfiling twiki

Software vs Hardware Event-based sampling

If you’re running a Hotspots analysis, vtune uses software-based sampling by default. If you have an Intel chip and install sep drivers (as root) as described on this webpage, you can enable hardware-based sampling by:

$ vtune -mrte-mode=native -collect hotspots -knob sampling-mode=hw ...;

This mode will provide additional information as described on this webpage.

Profiling a list of specific algorithms or a range of events in Athena jobs

We have a service called PerfMonVTune that allows users to profile either a list of specific algorithms or a range of events in Athena jobs. The current implementation doesn’t allow mixing these two (i.e. profiling a specific algorithm in a range of events) but that can be easily provided if there is demand.

PerfMonVTune isn’t built as part of nightlies since it relies on VTune which is not provided by default. However, the user can clone the package and easily compile it on top of VTune + a master Athena nightly as:

$ source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
$ lsetup "asetup Athena,main,latest" "git";
$ git atlas init-workdir https://:@gitlab.cern.ch:8443/atlas/athena.git -p PerfMonVTune;
$ mkdir build; cd build;
$ cmake ../athena/Projects/WorkDir; cmake --build .;
$ source x86_64-*/setup.sh; cd ..;

Then to profile a range of events the user can add the following snipped to his/her jO:

from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
cfg.merge(VTuneProfilerServiceCfg(configFlags, ResumeEvent = 5, PauseEvent = 15)

which will profile the entire job between the 5th (inclusive) and 15th (exclusive) events. Of course, this makes sense in either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from other parallel events in flight (VTune profiles the entire process).

If the user wants to profile a specific algorithm, then he/she can add

from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
configFlags.PerfMon.VTune.ProfiledAlgs = ["foo", "bar"]
cfg.merge(VTuneProfilerServiceCfg(configFlags))

where foo and bar (exact match) are the algorithms to be profiled (profiling starts before and stops after calling the execute). Again, this makes the most sense if it’s used in conjunction with either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from algorithms in other parallel events in flight (VTune profiles the entire process).

Then the sampling should be started in “paused state” as: $ vtune -mrte-mode=native -collect hotspots -start-paused -- athena --threads 1 my_job_options.py

Running `VTune` through the job transform

It is also possible to run VTune through the job transform. Three main flags control the behavior:

vtune: A boolean flag that toggles on/off the job execution under VTune vtuneDefaultOpts: A boolean flag that toggles on/off the default (hardcoded) VTune options vtuneExtraOpts: A comma separated list of additional VTune arguments By default, running your favorite transform job w/ the --vtune="True" flag, you’ll get a hotspots analysis result. It is possible to collect a different analysis type with the extra options flag, e.g., --vtuneExtraOpts="-collect=threading". By default, we use the following VTune options: -run-pass-thr=--no-altstack and -mrte-mode=native. If you do not want them, you can simply do --vtuneDefaultOpts="False".

Note that, you still have to setup VTune (before Athena) and might need to preload your favorite libraries, e.g., tcmalloc, etc., by hand when you run VTune with this method (which is typically as simple as setting the right environment variable before running the job, e.g., export LD_PRELOAD="${TCMALLOCDIR}/libtcmalloc_minimal.so:${ATLASMKLLIBDIR_PRELOAD}/libimf.so").

VtuneAmplifier

Introduction

Starting Vtune

Software vs Hardware Event-based sampling

Profiling a list of specific algorithms or a range of events in Athena jobs

Running VTune through the job transform

Starting `Vtune`

Running `VTune` through the job transform