Creating and running our steering macro

Last update: 22 Jun 2022 [History] [Edit]

To actually run this EventLoop algorithm we need some steering code. This can be a root macro in either C++ or python or some compiled C++ code. For this tutorial we will focus on writing a python macro, as that will be required to include the common CP algorithms, but the other options are equally valid (if you don’t intend to use the common CP algorithms).

tip This is only needed when working in EventLoop, inside Athena this will not be used. As such if you know that you will never work in EventLoop you can leave this out (or add it later). However, if you expect never to run in EventLoop you may be better off using the native athena algorithms instead of the AnaAlgorithm class.

We will use another ASG tool called SampleHandler which is a nice tool that allows for easy sample management. In this example we will create and configure a SampleHandler object. We will specify the path to the main directory, under which there could be several subdirectories (typically representing datasets) and within those the individual input files. Here we will tell SampleHandler we are only interested in one input xAOD file (specified by the exact name, but wildcards are accepted to using several specific inputs). More information and options for using SampleHandler to ‘find’ your data is found on the dedicated SampleHandler wiki.

You can really put your steering code anywhere, but it is probably a good idea to keep it in your source area (which is in version control), probably even in your package (typically in the share directory). However, for simplicity within this tutorial we will just place it directly into the source directory.

Writing a Python macro

Create a file called source/MyAnalysis/share/ATestRun_eljob.py, and make it executable (chmod +x source/MyAnalysis/share/ATestRun_eljob.py). Fill it with the following:

#!/usr/bin/env python

# Read the submission directory as a command line argument. You can
# extend the list of arguments with your private ones later on.
import optparse
parser = optparse.OptionParser()
parser.add_option( '-s', '--submission-dir', dest = 'submission_dir',
                   action = 'store', type = 'string', default = 'submitDir',
                   help = 'Submission directory for EventLoop' )
( options, args ) = parser.parse_args()

# Set up (Py)ROOT.
import ROOT
ROOT.xAOD.Init().ignore()

# Set up the sample handler object. See comments from the C++ macro
# for the details about these lines.
import os
sh = ROOT.SH.SampleHandler()
sh.setMetaString( 'nc_tree', 'CollectionTree' )
inputFilePath = os.getenv( 'ALRB_TutorialData' ) + '/mc21_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_PHYS.e8357_s3802_r13508_p5057/'
ROOT.SH.ScanDir().filePattern( 'DAOD_PHYS.28625583._000007.pool.root.1' ).scan( sh, inputFilePath )
sh.printContent()

# Create an EventLoop job.
job = ROOT.EL.Job()
job.sampleHandler( sh )
job.options().setDouble( ROOT.EL.Job.optMaxEvents, 500 )
job.options().setString( ROOT.EL.Job.optSubmitDirMode, 'unique-link')

# Create the algorithm's configuration.
from AnaAlgorithm.DualUseConfig import createAlgorithm
alg = createAlgorithm ( 'MyxAODAnalysis', 'AnalysisAlg' )

# later on we'll add some configuration options for our algorithm that go here

# Add our algorithm to the job
job.algsAdd( alg )

# Run the job using the direct driver.
driver = ROOT.EL.DirectDriver()
driver.submit( job, options.submission_dir )

Read over the comments carefully to understand what is happening. Notice that we will only run over the first 500 events (for testing purposes). Obviously if you were doing a real analysis you would want to remove that statement to run over all events in a sample.

tip The way of creating an algorithm we are showing you above is the dual-use way, i.e. it is the same in EventLoop and Athena. An alternative EventLoop-only way of creating the algorithm is to use:

from AnaAlgorithm.AnaAlgorithmConfig import AnaAlgorithmConfig
alg = AnaAlgorithmConfig( 'MyxAODAnalysis/AnalysisAlg' )

Add the following lines to MyAnalysis/CMakeLists.txt to enable the use of your macro:

# Install files from the package:
atlas_install_scripts( share/*_eljob.py )

To make sure that the newly used file gets installed and can be found we need to recompile (we need to call cmake explicitly as we did create a new file):

cd ../build/
cmake ../source/
make

tip Don’t forget to run source x86_64-*/setup.sh

To execute the job using this script, go to your run directory, and simply execute your macro:

cd ../run
ATestRun_eljob.py --submission-dir=submitDir

tip

If your algorithm does not run, make sure that you have defined the environment variable ALRB_TutorialData, as explained here.

If it still doesn’t run, there is sometimes an issue with the “Shebang” line. You can just override this and directly run with python using python ../build/x86_64-centos7-gcc8-opt/bin/ATestRun_eljob.py --submission-dir=submitDir

tip

Note that submitDir is the directory/location where the output of your job is stored. We set the mode for the directory to “unique-link”, which means that EventLoop will attach the date and time to that name to make it unique, and then create a link that points to the latest directory created. That way it is guaranteed that outputs from your job don’t get overwritten when you re-run your job, while at the same time making it easy for you to find the latest result.

For test runs this is generally a good setup, for actual production runs you may want to put some more thought in how you organize your output directories. If you want to avoid appending a unique suffix to your directory name you can switch “unique-link” above with “no-clobber”, which will just take submitDir as the actual directory name, and fail if the directory already exists.

tip While you are in principle free where you put your submitDir, avoid putting them into the source directory, as that is usually version controlled and you risk your data files being added to the repository (which is bad). Also avoid putting them into the build directory, as you often want to keep the contents of submitDir around, while the build directory should only contain files you don’t mind losing. Putting it inside the run directory is a reasonable choice if you have enough space there, but if it ends up containing large files you may need to put it onto a separate data disk. If you run in batch you may also need to put it inside a directory that is accessible from the worker nodes.


⭐️ Bonus Exercise

  • Create a second instance of your algorithm (with a different name) and add it to the job. Can you see if the two algorithms are running in series or in parallel?
  • Add a command line option that allows you to change your input filepath