The basic solution for running on the grid is the PrunDriver
. To use it
you must first set up panda and rucio clients:
lsetup rucio
lsetup panda
Note: At least the dq2 part should normally be done before setting up ROOT or an ASG release or there is a risk of configuration clashes.
To submit jobs to the grid, create an instance of the PrunDriver
EL::PrunDriver driver;
Optionally, you can specify how to name the grid output datasets. The naming is based on a simple rule, which you specify like so:
driver.options()->setString("nc_outputSampleName", "user.amadsen.test.%in:name[2]%");
This string should always begin with user.yourgridnickname.
to be
consistent with rucio naming rules. The rest of the string is arbitrary,
and some substitutions can be used to derive the name from each input
sample. %nickname%
will be replaced with your grid
nickname. %in:name%
will be replaced with the name of the input sample.
will be replaced with the n-th field of the input name,
split by .
. %in:metastring%
will be replaced with the value of the
(string) meta data field metastring of the input sample.
For example, using the string above
, the output sample created from the
input sample
will be called user.amadsen.test.105200
Job configuration is done using the meta data system, so options can be set on a per sample basis:
driver.options()->setString(EL::Job::optGridNFilesPerJob, "MAX"); //By default, split in as few jobs as possible
sh.get("data12_8TeV.00202668.physics_Muons.merge.NTUP_COMMON.r4065_p1278_p1562/")->SetMetaDouble(EL::Job::optGridNFilesPerJob, 1); //For this particular sample, split into one job per input file
driver.options()->setDouble(EL::Job::optGridMergeOutput, 1); //run merging jobs for all samples before downloading (recommended)
The full list of supported options can be found in the EL::Job
(look for variables starting with optGrid
). For full explanation of
each option, see the prun
documentation (prun --help
The grid drivers work with SampleGrid samples. A scanDQ2()
function is
available to create these:
SH::SampleHandler sh;
SH::scanDQ2 (sh, "user.krumnack.pat_tutorial_*.v1");
sh.setMetaString ("nc_tree", "CollectionTree");
Please see the SampleHandler documentation for more information. Note
that you can specify a subset of files in a dataset or container by
setting the meta data string nc_grid_filter
to for example
“.root” to process only root files in a dataset also containing
log files. (The last wildcard is significant as files may often be named
e.g. something.root.1
Create your Job object as usual and then submit it:
driver.submit(job, "uniqueJobDirectory");
In case the option you need to use is not available as an explicit option, you can pass it as a generic option:
job.options()->setString (EL::Job::optSubmitFlags, "-x -y -z");
For options that are supported by EventLoop it is preferred to pass them via the explicit option instead of the generic mechanism, as it makes EventLoop aware of what options you chose and gives it the opportunity to do extra actions (if required).
Note that Panda accepts a comma separated list of datasets as input. This allows us to speed up job submission of multiple datasets that should all be processed with the same meta data. To set up such a task, you can do:
std::unique_ptr<SH::SampleGrid> sample(new SH::SampleGrid("AllMyData"));
sample->meta()->setString(SH::MetaFields::gridName, "data15_13TeV.periodA-J.physics_Main.PhysCont.DAOD_EXOT14.grp15_v01_p9999,data16_13TeV.periodA-L.physics_Main.PhysCont.DAOD_EXOT14.grp16_v01_p9999");
sample->meta()->setString(SH::MetaFields::gridFilter, SH::MetaFields::gridFilter_default); sh.add(sample.release());
where sh
is your sample handler. This should
be sufficient for data, but for MC samples we usually want a way to keep
track of which output came from which input sample. With the PrunDriver
we can use the option described above to pass in some extra flags that
will help with that:
job.options()->setString (EL::Job::optSubmitFlags, "--addNthFieldOfInDSToLFN=1,2,3 --useContElementBoundary");
Here, useContElementBoundary
ensures that
only files that come from the same input dataset are processed together,
and the numbers after addNthFieldOfInDSToLFN
will add (in this case) the
first, second and third part of the name of that input dataset to the
names of the produced output files. Note that this only really makes
sense if you submitOnly()
, as the retrieve()
command would just add all
the histograms back together again.