Configuring and running your algorithm

Last update: 14 Nov 2022 [History] [Edit]

To actually run this algorithm we need some way to configure and steer the job. When running in EventLoop, this is done in the form of a steering macro. When running in Athena, this is done with a jobOptions file, similar to what was used in the MC Generation section. A steering macro and a jobOptions file are both included in the AnalysisTutorial repository. This section will walk you through understanding and running both, but it is recommended to use EventLoop for your first time through this tutorial.

tip Your steering macro and/or jobOptions can be stored anywhere, but it is good practice to keep them in your package, typically under share.

EventLoop steering macro

The steering macro can be a root macro in C++ or python or compiled C++ code. The latest recommendations, which are followed in this tutorial, are to use a python macro. The macro for our algorithm can be seen here or in your local version of MyAnalysis as share/ATestRun_eljob.py. This section highlights some important parts of the macro and further exploration is left as an exercise for the reader.

Looking at the macro

The macro is called source/MyAnalysis/share/ATestRun_eljob.py. In order for it to be called directly, it needs to be executable. The permissions areset correctly for ATestRun_eljob.py, but if you make another similar macro, you need to call chmod +x source/MyAnalysis/share/<macro_name>.py.

tip The following line in MyAnalysis/CMakeLists.txt enables the use of the macro:

atlas_install_scripts( share/*_eljob.py )

The first part of the macro we will look at is getting the input file(s). This is done using SampleHandler, a tool that provides numerous methods of defining and finding input files. The implementation used in our example creates a local sample object using the filename directly (currently stored as the environment variable ASG_TEST_FILE_MC):

# Set up the SampleHandler object to handle the input files
sh = ROOT.SH.SampleHandler()

# Set the name of the tree in our files in the xAOD the TTree
# containing the EDM containers is "CollectionTree"
sh.setMetaString( 'nc_tree', 'CollectionTree' )

# Use SampleHandler to get the sample from the defined location
sample = ROOT.SH.SampleLocal("dataset")
sample.add (os.getenv ('ASG_TEST_FILE_MC'))  
sh.add (sample)

tip The astute observer may note that ASG_TEST_FILE_MC points to a ttbar sample. This is due to a small issue with the existing LQ signal samples. We are working on fixing the issue and will integrate them into this step in the tutorial as soon as possible. In the meantime, please use the available ttbar sample. The methods taught in this part of the tutorial are independent of the input sample.

tip You can add multiple input files to your job by repeating the add command.

tip SampleHandler offers several methods to add files for local running, including an option (ScanDir) to scan a directory and find files matching a pattern. More details are available here.

The macro allows you to set parameters for your job, such as whether you are running over Monte Carlo or detector data. This is done in the macro with the lines:

# Set data type
dataType = "mc" # "mc" or "data"

Next, the job is created, the SampleHandler object is added to it, and some options are specified:

# Create an EventLoop job.
job = ROOT.EL.Job()
job.sampleHandler( sh )

# Add some options for the job
job.options().setDouble( ROOT.EL.Job.optMaxEvents, 500 )
job.options().setString( ROOT.EL.Job.optSubmitDirMode, 'unique-link')

The first option tells the job to run over only 500 events, which is useful for testing purposes but not for running actual analysis jobs. When running over full datasets, this option should be set to -1 to indicate no event number limit, or removed to use the default behavior of no event number limit.

The second option modifies the naming convention for the output directory from your job. The unique-link option causes a unique timestamp to be appended to the output directory name and a link is created to point at the latest directory. This prevents your outputs from being overwritten the next time the job is re-run.

tip The unique-link option is useful for local testing, but not for full production runs. Turn off this behavior by setting the option to no-clobber. If the specified output directory already exists, the job will fail.

Finally, the driver is specified and the job is submitted:

# Run the job using the direct driver.
driver = ROOT.EL.DirectDriver()
driver.submit( job, options.submission_dir )

tip Make sure that anything you want to do to configure your job is added before these lines submitting the job, otherwise they it won’t be picked up.

In this case, we are using the direct driver to run locally. Other drivers (such as for running on batch systems or the grid) are also available. More details about the available drivers can be found in the Analysis Tools guide.

Read through the macro carefully to understand what it is doing.

Running your job in EventLoop

To execute the job using this script, go to your run directory, and execute your macro with the command:

ATestRun_eljob.py --submission-dir=submitDir

tip

If your algorithm does not run, there could be an issue with the “Shebang” line. You can just override this and directly run with python using python ../build/x86_64-centos7-gcc8-opt/bin/ATestRun_eljob.py --submission-dir=submitDir.

tip While you are in principle free where you put your submitDir, avoid putting them into the source directory, as that is usually version controlled and you risk your data files being added to the repository (which is bad). Also avoid putting them into the build directory, as you often want to keep the contents of submitDir around, while the build directory should only contain files you don’t mind losing. Putting it inside the run directory is a reasonable choice if you have enough space there, but if it ends up containing large files you may need to put it onto a separate data disk. If you run in batch you may also need to put it inside a directory that is accessible from the worker nodes.

Athena jobOptions (optional)

The jobOptions used to run your algorithm follow the same principles as the jobOptions you used for MC event generation, but is different because this is a fundamentally different use-case. You can find more details about using a jobOptions for running analyses in Athena in the main Athena tutorial.

The jobOptions for this tutorial is available here

Looking at the jobOptions

The jobOptions file is called MyAnalysis/share/ATestRun_jobOptions.py. It can be kept anywhere, but it is a good idea to keep it in your source area, probably in your package.

tip The following line in MyAnalysis/CMakeLists.txt enables the use of the jobOptions:

atlas_install_joboptions( share/*_jobOptions.py )

The first part we will look at is getting the input file(s):

# Specify local input file name
testFile = os.getenv('ASG_TEST_FILE_MC')

# Override next line on command line with: --filesInput=XXX
jps.AthenaCommonFlags.FilesInput = [testFile]

Next, the algorithm is added to the Athena sequence (this is analogous to submitting an EventLoop job to a driver):

# Add our algorithm to the main alg sequence
athAlgSeq += alg

Finally, some options are specified:

# Limit the number of events (for testing purposes)
theApp.EvtMax = 500

# Optional include for reducing printout from athena
include("AthAnalysisBaseComps/SuppressLogging.py")

Read through the jobOptions carefully to understand what it is doing.

Running your job in Athena

Go to your run directory and execute your jobOptions with Athena using the following command:

athena MyAnalysis/ATestRun_jobOptions.py

tip You can override many of the options specified in the jobOptions when calling the athena command. For example, you can set the number of events to process with the --evtMax option (-1 is the default value and causes all events to be processed):

athena MyAnalysis/ATestRun_jobOptions.py --evtMax=-1 

Or you can override the input files used with the --filesInput option:

athena MyAnalysis/ATestRun_jobOptions.py --filesInput=another.file.root 

⭐️ Bonus Exercise

  • Create a second instance of your algorithm (with a different name) and add it to the job. Can you see if the two algorithms are running in series or in parallel?
  • Add a command line option that allows you to change your input filepath