VTune Profiler performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86-based machines. It’s among a range of Intel tools that are installed and available at CERN through CVMFS. Since the compilers and performance tools are installed on CVMFS, they are available from any CVMFS-enabled Linux machine at CERN.
Vtune
The basic usage requires sourcing the necessary setup script and passing your executable to vtune
(historically amplxe-cl)
. In practice, this means:
# Setup Intel Tools
# Do this before running asetup to avoid python clashes
source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
# Setup the latest 24.0 Athena nightly for Reco_tf.py
lsetup "asetup Athena,24.0,latest"
# Run a simple q455 job with 1 event to generate runargs.HITtoRDO.py
Reco_tf.py --AMI q445 --perfmon none --outputRDOFile myRDO.pool.root --maxEvents 1
# change the number of events in the generated runargs.HITtoRDO.py to what you need
# Run profiling
vtune -mrte-mode=native -collect hotspots $(which athena.py) -- preloadlib=$ATLASMKLLIBDIR_PRELOAD/libintlc.so.5:$ATLASMKLLIBDIR_PRELOAD/libimf.so runargs.HITtoRDO.py
May 20, 2021: With VTune 2021.2+
and Python 3.7+
, trying to profile a python job that would use cppyy
(so all athena.py jobs…) makes the process hang. (See ATLINFR-4105 for a more detailed description of the problem.) To circumvent that issue, the -mrte-mode
flag, as shown above, is needed.
This should produce a folder called r000hs
and a file called r000hs.vtune
therein, which contains the profiling output. The results can be visualized using the GUI
via:
# Invoke the GUI for visualization
vtune-gui r000hs/r000hs.vtune
The “Summary” tab contains a table called “Top Hotspots” that shows the top 5 functions that used the most CPU time. The tabs named “Bottom-up”, “Caller/Callee”, and “Top-down Tree” contains more detailed information. The user will probably be interested in the “Bottom-up” tab where s/he can see the function/call-stack that is ordered in utilized CPU time. Clicking on a given function will show the call stack on the right hand-side panel, i.e.:
It is also possible to create call-graphs using VTune
output. First, you need to obtain a Python script called gprof2dot
from this GitHub project and then:
vtune -report gprof-cc -result-dir output -format text -report-output output.txt
gprof2dot.py -f axe output.txt | dot -Tpng -o output.png
Some useful arguments for the latter command are --strip
and --color-nodes-by-selftime
. The last one gives what GPerfTool
normally does, and the first strips the full variables’ names in functions.
More useful information can be obtained at:
TriggerProfiling
twikiIf you’re running a Hotspots analysis, vtune
uses software-based sampling by default. If you have an Intel chip and install sep
drivers (as root) as described on this webpage, you can enable hardware-based sampling by:
$ vtune -mrte-mode=native -collect hotspots -knob sampling-mode=hw ...;
This mode will provide additional information as described on this webpage.
We have a service called PerfMonVTune that allows users to profile either a list of specific algorithms or a range of events in Athena jobs. The current implementation doesn’t allow mixing these two (i.e. profiling a specific algorithm in a range of events) but that can be easily provided if there is demand.
PerfMonVTune
isn’t built as part of nightlies since it relies on VTune
which is not provided by default. However, the user can clone the package and easily compile it on top of VTune
+ a master Athena nightly as:
$ source /cvmfs/projects.cern.ch/intelsw/oneAPI/linux/all-setup.sh;
$ lsetup "asetup Athena,main,latest" "git";
$ git atlas init-workdir https://:@gitlab.cern.ch:8443/atlas/athena.git -p PerfMonVTune;
$ mkdir build; cd build;
$ cmake ../athena/Projects/WorkDir; cmake --build .;
$ source x86_64-*/setup.sh; cd ..;
Then to profile a range of events the user can add the following snipped to his/her jO
:
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
cfg.merge(VTuneProfilerServiceCfg(configFlags, ResumeEvent = 5, PauseEvent = 15)
which will profile the entire job between the 5th (inclusive) and 15th (exclusive) events. Of course, this makes sense in either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from other parallel events in flight (VTune
profiles the entire process).
If the user wants to profile a specific algorithm, then he/she can add
from PerfMonVTune.PerfMonVTuneConfig import VTuneProfilerServiceCfg
configFlags.PerfMon.VTune.ProfiledAlgs = ["foo", "bar"]
cfg.merge(VTuneProfilerServiceCfg(configFlags))
where foo
and bar
(exact match) are the algorithms to be profiled (profiling starts before and stops after calling the execute). Again, this makes the most sense if it’s used in conjunction with either a serial or a single-thread Athena job. For multi-thread jobs, depending on the configuration, you might get contributions from algorithms in other parallel events in flight (VTune
profiles the entire process).
Then the sampling should be started in “paused state” as:
$ vtune -mrte-mode=native -collect hotspots -start-paused -- athena --threads 1 my_job_options.py
VTune
through the job transformIt is also possible to run VTune
through the job transform. Three main flags control the behavior:
vtune
: A boolean flag that toggles on/off the job execution under VTune
vtuneDefaultOpts
: A boolean flag that toggles on/off the default (hardcoded) VTune
options
vtuneExtraOpts
: A comma separated list of additional VTune
arguments
By default, running your favorite transform job w/ the --vtune="True"
flag, you’ll get a hotspots analysis result. It is possible to collect a different analysis type with the extra options flag, e.g., --vtuneExtraOpts="-collect=threading"
. By default, we use the following VTune
options: -run-pass-thr=--no-altstack
and -mrte-mode=native
. If you do not want them, you can simply do --vtuneDefaultOpts="False"
.
Note that, you still have to setup VTune
(before Athena) and might need to preload your favorite libraries, e.g., tcmalloc
, etc., by hand when you run VTune
with this method (which is typically as simple as setting the right environment variable before running the job, e.g., export LD_PRELOAD="${TCMALLOCDIR}/libtcmalloc_minimal.so:${ATLASMKLLIBDIR_PRELOAD}/libimf.so"
).