Further Job Configuration

Last update: 16 Aug 2024 [History] [Edit]

Apart from the basic job configuration, each driver can support a variety of custom options, as can the algorithms. The job and the driver object both have an option field that can be used for that purpose. While running the job those options get merged into the sample meta-data, with the job options taking precedence over the driver options, and the actual sample meta-data taking precedence over both. For options that have to be set for the entire job the sample meta-data is ignored.

The practical motivation for this is that it allows you to set options for your entire job, but then override them for specific samples. Here are some examples of how to set options:

sample->setMetaDouble ("option", value);
sh.get("sampleName")->setMetaDouble ("option", value);
sh.setMetaDouble ("option", value);
job.options()->setDouble ("option", value);
driver.options()->setDouble ("option", value);

For most of the standard options, the names of the options are defined in the class EL::Job. So if you have a slow day, you may want to browse through the header file EventLoop/Job.hh. Most of the driver specific options are described below with the individual drivers. Collected here are some more generic options

Omitting Events for Test Runs

During debugging runs you often face the situation that you would rather not process all events in a sample in order to save some time. Most commonly you just want to limit the number of events per sample to process (supported for DirectDriver, ProofDriver):

job.options()->setDouble (EL::Job::optMaxEvents, 1000);

Or you are debugging a particular event that gives you trouble and you may want to skip the events leading up to that event (supported for DirectDriver, ProofDriver):

job.options()->setDouble (EL::Job::optSkipEvents, 10000);

Using TTreeCache

TTreeCache is a mechanism for speeding up the reading of files, particularly over the network. To turn TTreeCache on, you first need to select a cache size (in this case 10MB):

job.options()->setDouble (EL::Job::optCacheSize, 10*1024*1024);

By default, TTreeCache looks at the first 10 events of a file to determine your access pattern and then uses that to optimize access for the rest. If you want to, you can change that number to suit your needs:

job.options()->setDouble (EL::Job::optCacheLearnEntries, 20);

Automatically Removing Submission Directories

Since EventLoop stores the job configuration, most of the output files and various temporary files in the submission directory, this directory has to be unique to each job. You can configure how EventLoop handles multiple jobs using the same name for the submission directory:

  • no-clobber: abort if the directory already exists
  • overwrite: remove the existing directory and recreate it (dangerous and not recommended)
  • unique: append a unique suffix to the directory name, ensuring that it is unique
  • unique-link: like unique, but it creates a link under the originally requested name that points to the latest submission directory created, meaning you can still access any files created under the name you originally requested.

At this time we recommend using the unique-link option, and that is what we did use in the tutorial:

job.options().setDouble( ROOT.EL.Job.optSubmitDirMode, 'unique-link')

Collecting Performance Statistics With xAODs

If you are using xAODs you can monitor your jobs data-access patterns:

job.options()->setDouble (EL::Job::optXAODPerfStats, 1);

This should put the access statistics into your histogram file, as well as print out the statistics at the end of the job.

warning warning: This feature hasn’t been tested yet, beyond checking that it compiles. I rushed it in to allow the distributed-i/o team to do some more precise measurements.

Printing Cache Statistics For Each File

If you are interested in performance measurements you can print out the TTree cache statistics at the end of each file:

job.options()->setDouble (EL::Job::optPrintPerFileStats, 1);

warning warning: This feature hasn’t been tested yet, beyond checking that it compiles and produces some outputs. I rushed it in to allow the distributed-i/o team to do some more precise measurements.

Retrying File Opening

Some users have trouble where opening a file fails due to transient errors. For that case there is the option to retry opening the file after a brief pause. You can set the number of retries like this:

job.options()->setDouble (EL::Job::optRetries, 3);

If you want to customize the wait period, you can do that too.

job.options()->setDouble (EL::Job::optRetriesWait, 900);

Please note that the wait period is randomized and this is the upper limit (in seconds). The reason for the randomization is that if the transient failure is that too many jobs try to open a file at once, you do not want all of them to retry at the same time as they probably overload the server again.

Please note that this only works for DirectDriver and BatchDriver.