EventLoop Grid Driver

Last update: 06 Nov 2019 [History] [Edit]

The basic solution for running on the grid is the PrunDriver. To use it you must first set up panda and rucio clients:

    lsetup rucio
    lsetup panda

Note: At least the dq2 part should normally be done before setting up ROOT or an ASG release or there is a risk of configuration clashes.

To submit jobs to the grid, create an instance of the PrunDriver.

    EL::PrunDriver driver;

Optionally, you can specify how to name the grid output datasets. The naming is based on a simple rule, which you specify like so:

    driver.options()->setString("nc_outputSampleName", "user.amadsen.test.%in:name[2]%");

This string should always begin with user.yourgridnickname. to be consistent with rucio naming rules. The rest of the string is arbitrary, and some substitutions can be used to derive the name from each input sample. %nickname% will be replaced with your grid nickname. %in:name% will be replaced with the name of the input sample. %in:name[n]% will be replaced with the n-th field of the input name, split by .. %in:metastring% will be replaced with the value of the (string) meta data field metastring of the input sample.

For example, using the string above user.amadsen.test.%in:name[2]%, the output sample created from the input sample mc11_7TeV.105200.T1_McAtNlo_Jimmy.merge.NTUP_TOP.e835_s1272_s1274_r3043_r2993_p834 will be called user.amadsen.test.105200.

Job configuration is done using the meta data system, so options can be set on a per sample basis:

driver.options()->setString(EL::Job::optGridNFilesPerJob,  "MAX"); //By default, split in as few jobs as possible
sh.get("data12_8TeV.00202668.physics_Muons.merge.NTUP_COMMON.r4065_p1278_p1562/")->SetMetaDouble(EL::Job::optGridNFilesPerJob, 1); //For this particular sample, split into one job per input file
driver.options()->setDouble(EL::Job::optGridMergeOutput, 1); //run merging jobs for all samples before downloading (recommended) 

The full list of supported options can be found in the EL::Job documentation here (look for variables starting with optGrid). For full explanation of each option, see the prun documentation (prun --help).

The grid drivers work with SampleGrid samples. A scanDQ2() function is available to create these:

SH::SampleHandler sh;
SH::scanDQ2 (sh, "user.krumnack.pat_tutorial_*.v1");
sh.setMetaString ("nc_tree", "CollectionTree");

Please see the SampleHandler documentation for more information. Note that you can specify a subset of files in a dataset or container by setting the meta data string nc_grid_filter to for example “.root” to process only root files in a dataset also containing log files. (The last wildcard is significant as files may often be named e.g. something.root.1)

Create your Job object as usual and then submit it:

driver.submit(job, "uniqueJobDirectory");

Passing Non-Standard Options to the Grid Driver

In case the option you need to use is not available as an explicit option, you can pass it as a generic option:

job.options()->setString (EL::Job::optSubmitFlags, "-x -y -z");

For options that are supported by EventLoop it is preferred to pass them via the explicit option instead of the generic mechanism, as it makes EventLoop aware of what options you chose and gives it the opportunity to do extra actions (if required).

Processing multiple datasets in one JEDI task

Note that Panda accepts a comma separated list of datasets as input. This allows us to speed up job submission of multiple datasets that should all be processed with the same meta data. To set up such a task, you can do:

std::unique_ptr<SH::SampleGrid> sample(new SH::SampleGrid("AllMyData"));
sample->meta()->setString(SH::MetaFields::gridName, "data15_13TeV.periodA-J.physics_Main.PhysCont.DAOD_EXOT14.grp15_v01_p9999,data16_13TeV.periodA-L.physics_Main.PhysCont.DAOD_EXOT14.grp16_v01_p9999");
sample->meta()->setString(SH::MetaFields::gridFilter, SH::MetaFields::gridFilter_default); sh.add(sample.release());

where sh is your sample handler. This should be sufficient for data, but for MC samples we usually want a way to keep track of which output came from which input sample. With the PrunDriver we can use the option described above to pass in some extra flags that will help with that:

job.options()->setString (EL::Job::optSubmitFlags, "--addNthFieldOfInDSToLFN=1,2,3 --useContElementBoundary");

Here, useContElementBoundary ensures that only files that come from the same input dataset are processed together, and the numbers after addNthFieldOfInDSToLFN will add (in this case) the first, second and third part of the name of that input dataset to the names of the produced output files. Note that this only really makes sense if you submitOnly(), as the retrieve() command would just add all the histograms back together again.

Using Ganga

A more advanced solution for running on the grid is the Ganga based GridDriver. This driver works a bit differently, it uses the Ganga Service to keep monitoring jobs and downloading output in the background even if you log out. This can be handy for very long running jobs. To use it, replace EL::PrunDriver driver with:

EL::GridDriver driver;

Before running for the first time: The GridDriver uses Ganga, which currently needs to create a few configuration defaults before it can be run for the first time. If you have never used Ganga before, start it once in interactive mode before using the GridDriver:

export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
localSetupGanga ganga

Answer the questions, then press Ctrl+d to exit. You only need to do this once on each computer where you want to use the GridDriver.

Instead of metadata options, the GridDriver is configured using member variables (this might change in the future):

driver.outputSampleName = "user.amadsen.test.%in:name[2]%";

Note that the submit command can take a minute or two to complete. Even then, not all jobs will start immediately. The driver employs GangaService which will continue to run in the background and submit the jobs over a period of time. It will also automatically restart any jobs that fail a maximum of 4 times, and download the output of completed jobs. This process will continue unattended for 24 hours even if you quit ROOT or even log out of your session. If your jobs are not completed by then, it will be restarted whenever you come back to call retrieve() or wait(). Note however that while the output files are cached locally, they are not actually merged and made available in the job directory until you call retrieve() or wait(). Note also that, as with all batch drivers, only the histogram files are downloaded by the grid driver, output streams are left on the grid. You may download them manually using SampleHandler.

warning WARNING: Since there is probably little value in maintaining two EventLoop grid drivers, support for the Ganga based GridDriver may go away at some point.