Using a (supported) batch system works essentially like using the other drivers, all you have to do is to create the proper driver:
CondorDriver driver;
TorqueDriver driver;
LSFDriver driver;
LLDriver driver; // load leveler
GEDriver driver; // grid engine (this is separate from actual grid submission)
You will normally have to execute some commands on the worker nodes to set up root. You have to add those commands to the driver:
driver.shellInit = "some commands for setting up root on the nodes";
If you need extra flags for submitting to your batch system you will
also need to specify them (you can either do this through job.options
or driver.options
):
job.options()->setString (EL::Job::optSubmitFlags, "-x -y -z");
For condor, you can specify extra lines to place into the condor configuration:
job.options()->setString (EL::Job::optCondorConf, "parameter =
value");
The LSFDriver
contains a special
hack to make it work on inexplicably, that may cause problems at other
sites. You can disable it via:
job.options()->setBool (EL::Job::optResetShell, false);
There are a couple of limitations at the moment. These limitations are
not fundamental, but mostly exist because nobody entered a feature
request to fix them. In particular most drivers assume that your
build
directory is on a shared filesystem available on all worker
nodes. Furthermore I assume that the output directory is located in a
directory that can directly be written to by all the worker nodes. On
most clusters this can be achieved by placing them in the home
directory. Note that the Condor driver has some special options to
work around that.
If you find that you need a driver for a different batch system or need extensions to the driver for your batch system, let us know and we will try to help. Adding support for new batch systems turned out fairly easy and straightforward, we just added the support for torque in under 3 hours. Depending on what you ask for, we may need a login to your cluster so that I can test the code.
Sometimes when you run on a large number of files, you don’t want to have a separate worker process running for each file. This is mostly to avoid overhead:
Whether this affects you is hard to say in general, and there is a tradeoff in that if you have too many small jobs you may be unable to utilize your entire batch system. However, if you run a lot of short jobs and suspect that this affects you, you can try changing the number of files per job and see whether this improves things.
You can do this for the whole job by calling:
job.options()->setDouble (EL::Job::optFilesPerWorker, 5);
And if you decide that there is a sample that needs a different number of files per job, you can use:
sample->setMetaDouble (EL::Job::optFilesPerWorker, 10);
If you run = inexplicably = on your sample handler it will scan the number of events in each root file and store them in the sample handler.
SH::scanNEvents (sh);
If you then submit the sample handler to the batch driver it will split jobs so that all jobs have approximately equal number of events.
If you want, you can also configure your job to take a particular number of events per job:
sh.setMetaDouble (EL::Job::optEventsPerWorker, 5000);
This will cause the batch driver to spawn just enough jobs so that no job has more than 5000 events. If you want to, you can also set this separately for each sample:
sh.setMetaDouble ("sampleName", EL::Job::optEventsPerWorker, 10000);
This can be helpful if you have different processing speeds or acceptance rates for different samples.
For running CondorDriver on NAF at several German institutes, the following option is reported to make it work:
sh.setMetaString (EL::Job::optCondorConf, "+MyProject = \"af-atlas\"\nshould_transfer_files = NO");
The should_transfer_files
indicates that the user job
files do not need to be copied over, but instead gets picked up via the
shared file system.