Configuring your algorithm

Last update: 20 Nov 2024 [History] [Edit]

To actually run this algorithm we need some way to configure and steer the job. When running in EventLoop, this is done in the form of a steering macro. When running in Athena, this is done with a jobOptions file, similar to what was used in the MC Generation section. A steering macro and a jobOptions file are both included in the MyAnalysis repository. This section will walk you through understanding and running both, but it is recommended to use EventLoop for your first time through this tutorial.

tip Your steering macro and/or jobOptions can be stored anywhere, but it is good practice to keep them in your package, typically under share.

dir tutorial/AnalysisTutorial/source/MyAnalysis

EventLoop steering macro

The steering macro can be a root macro in C++ or python or compiled C++ code. The latest recommendations, which are followed in this tutorial, are to use a python macro. The macro for our algorithm can be seen here or in your local version of MyAnalysis as share/ATestRun_eljob.py. This section highlights some important parts of the macro and further exploration is left as an exercise for the reader.

Looking at the macro

The macro is called share/ATestRun_eljob.py. In order for it to be called directly, it needs to be executable. The permissions are set correctly for ATestRun_eljob.py, but if you make another similar macro, you need to call chmod +x share/<macro_name>.py.

tip The following line in CMakeLists.txt adds the macro to $PATH so it can be called from the command line with just ATestRun_eljob.py:

atlas_install_scripts( share/*_eljob.py )

The first part of the macro we will look at is getting the input file(s). This is done using SampleHandler, a tool that provides numerous methods of defining and finding input files. The implementation used in our example creates a local sample object using the filename directly:

# Set up the SampleHandler object to handle the input files
sh = ROOT.SH.SampleHandler()

# Set the name of the tree in our files in the xAOD the TTree
# containing the EDM containers is "CollectionTree"
sh.setMetaString( 'nc_tree', 'CollectionTree' )

# Select the sample associated with the data type used
if dataType not in ["data", "mc"] :
    raise Exception (f"invalid data type: {dataType}")
if dataType == 'mc':
    testFile = os.getenv ('ALRB_TutorialData')+'/mc20_13TeV.312276.aMcAtNloPy8EG_A14N30NLO_LQd_mu_ld_0p3_beta_0p5_2ndG_M1000.deriv.DAOD_PHYS.e7587_a907_r14861_p6117/DAOD_PHYS.37791038._000001.pool.root.1'
else:
    testFile = os.getenv('ASG_TEST_FILE_DATA')

# Use SampleHandler to get the sample from the defined location
sample = ROOT.SH.SampleLocal("dataset")
sample.add (testFile)
sh.add (sample)

tip You can add multiple input files to your job by repeating the add command.

tip SampleHandler offers several methods to add files for local running, including an option (ScanDir) to scan a directory and find files matching a pattern. More details are available here.

The macro allows you to set parameters for your job, such as whether you are running over Monte Carlo or detector data. This is done in the macro with the lines:

# Set data type
dataType = "mc" # "mc" or "data"

Next, the job is created, the SampleHandler object is added to it, and some options are specified:

# Create an EventLoop job.
job = ROOT.EL.Job()
job.sampleHandler( sh )

# Add some options for the job
job.options().setDouble( ROOT.EL.Job.optMaxEvents, 500 )
job.options().setString( ROOT.EL.Job.optSubmitDirMode, 'unique-link')

The first option tells the job to run over only 500 events, which is useful for testing purposes but not for running actual analysis jobs. When running over full datasets, this option should be set to -1 to indicate no event number limit, or removed to use the default behavior of no event number limit.

The second option modifies the naming convention for the output directory from your job. The unique-link option causes a unique timestamp to be appended to the output directory name and a link is created to point at the latest directory. This prevents your outputs from being overwritten the next time the job is re-run.

tip The unique-link option is useful for local testing, but not for full production runs. Turn off this behavior by setting the option to no-clobber. If the specified output directory already exists, the job will fail.

Finally, the driver is specified and the job is submitted:

# Run the job using the direct driver.
driver = ROOT.EL.DirectDriver()
driver.submit( job, options.submission_dir )

warning Make sure that anything you want to do to configure your job is added before these lines submitting the job, otherwise they it won’t be picked up.

In this case, we are using the direct driver to run locally. Other drivers (such as for running on batch systems or the grid) are also available. More details about the available drivers can be found in the Analysis Tools guide.

Read through the macro carefully to understand what it is doing.

Finally, run a quick test to prove to yourself that this job works!

cd ../run
ATestRun_eljob.py

One of the last messages you see in the terminal should say “worker finished successfully”.

Athena jobOptions (optional)

The jobOptions used to run your algorithm follow the same principles as the jobOptions you used for MC event generation, but are different because this is a fundamentally different use-case. You can find more details about using a jobOptions file for running analyses in Athena in the main Athena tutorial.

The jobOptions for this tutorial are available here

Looking at the jobOptions

The jobOptions file is called share/ATestRun_jobOptions.py. It can be kept anywhere, but it is a good idea to keep it in your source area, probably in your package.

tip The following line in CMakeLists.txt enables the use of the jobOptions by adding them to the $JOBOPTSEARCHPATH:

atlas_install_joboptions( share/*_jobOptions.py )

The first part we will look at is getting the input file(s):

# Select the sample associated with the data type used
if dataType not in ["data", "mc"] :
    raise Exception (f"invalid data type: {dataType}")
if dataType == 'mc':
    testFile = os.getenv ('ALRB_TutorialData')+'/mc20_13TeV.312276.aMcAtNloPy8EG_A14N30NLO_LQd_mu_ld_0p3_beta_0p5_2ndG_M1000.deriv.DAOD_PHYS.e7587_e7400_a907_r14861_r14919_p6026/DAOD_PHYS.37773721._000001.pool.root.1'
else:
    testFile = os.getenv('ASG_TEST_FILE_DATA')

# Override next line on command line with: --filesInput=XXX
jps.AthenaCommonFlags.FilesInput = [testFile]

Next, the algorithm is added to the Athena sequence (this is analogous to submitting an EventLoop job to a driver):

# Add our algorithm to the main alg sequence
athAlgSeq += alg

Finally, some options are specified:

# Limit the number of events (for testing purposes)
theApp.EvtMax = 500

# Optional include for reducing printout from athena
include("AthAnalysisBaseComps/SuppressLogging.py")

Read through the jobOptions carefully to understand what it is doing.

Finally, run a quick test to prove to yourself that this job works!

cd ../run
athena MyAnalysis/ATestRun_jobOptions.py - -c "../source/MyAnalysis/data/config.yaml"

This uses a default configuration provided in the MyAnalysis package. The last message you see in the terminal should say “successful run”.


⭐️ Bonus Exercise

  • Add a command line option that allows you to change your input filepath