Adding a new Driver to the EventLoop Package

Last update: 06 Nov 2019 [History] [Edit]

warning warning WARNING: This section may no longer be up-to-date warning warning

This section should not be relevant to the typical user. It documents the details of how to implement a new driver to make the EventLoop package work in a new environment. If this is what you are trying to do, it may be a good idea to contact me up-front with the details of what you are trying to do, so I can give you some additional guidance.

The first decision you have to make is whether your driver should be a part of the EventLoop package or live in a separate package. That’s really up to you, but so far I am keeping everything in one package for simplicity. However, even if you keep your driver in a separate package, there are probably some changes that need to be made to the EventLoop package anyway.

The basic driver design will consist of three or four components:

  • A class deriving from the Driver class which runs all the code that needs to run on the submission node.
  • A class deriving from the Worker class which runs all the code that needs to be run on the worker node.
  • Some steering code which sets up your worker object on the worker node.
  • Optional: A unit test which the user can run to check that the driver actually works in his setup.

This part of the EventLoop design is still very fluid. As we add more drivers some of the interfaces may change to accommodate their needs. That is one of the reasons why it is better if I know which drivers are out there, so that I can go and fix them if I break things. Anyway, this also means that you can request changes to the way EventLoop works behind the scenes to make your driver implementation easier.

When designing your driver, you have the choice of storing additional information inside the unique submission directory, as long as it doesn’t collide with any “official” files put there. If your files are fairly large you may consider removing them after the job has finished in order to save space.

The Driver Class

The Driver class provides an interface for code that runs on the submission node. As such your class needs to derive from that class and override its virtual functions. So far the only virtual function is doSubmit, which is called when submitting a new job.

Depending on the nature of your driver, you may also want to add further configuration options. These can go either into your Driver class itself, or into the driver-independent Job class. Which of the two is preferable mostly depends on whether this is something you expect the user to set on a job-by-job basis, or something that they would want to keep the same for all their jobs. A combination is also possible, with a field in the Driver class that can be overridden by a field in the Job class. Configuration options that affect output datasets have the additional option of going into OutputStream.

Notes:

  • Sometime soon the doSubmit method will be split into a doSubmit and doGather method, which allows drivers to disconnect from a running job and then reconnect at a later stage.
  • So far the only error handling strategy we have is to abort if a user algorithm reports an error and then resubmit. While this can be somewhat inefficient it is the way most users operate anyway. However, this does not include batch system errors, and you are certainly encouraged to add automatic retries where appropriate.

The doSubmit Function

The basic functionality of the doSubmit method can be summarized like this:

  • Persistify the information from the Job object that will be needed on the worker node. This is mostly the list of algorithms, but also the actual samples being run over (for meta-data access), and potentially the list of output datasets.
  • Pack up the persistified information and the current installation of RootCore for delivery to the batch system.
  • Loop over all the input samples to determine which datasets to run over and submit your job for those datasets. This can be a separate submission for each dataset, or one joint submission for all datasets. The later is typically more efficient.
  • Wait for the job to finish.
  • For each input sample add the histogram outputs for the different sub-jobs together into a single file location/hist-sample.root. The root tool hadd can be used for merging.
  • For each input sample and each output dataset create a Sample object that describes how to access the output files created. Then create a separate SampleHandler object for each output dataset that contains the samples going with that output dataset. Store those output datasets as location/out-label.

When writing your driver you will have to interact fairly heavily with the SampleHandler package. This package is still fairly new and not much utilized, which means that we can still fix things that seem broken or impractical. As a first thing you have to decide how your samples will be represented. If what you need is a list of files, call makeTChain or makeTDSet to get the list. If on the other hand your system is aware of datasets, you may want to use SampleGrid objects and store the information in the meta-data. You may have to define the appropriate meta-data fields if they don’t exist already.

For your output datasets the preferred method is not to copy them back to the submission node, but directly to a storage element (see reasons in the section on output datasets above). For you that means that you should try to figure out how to do that. Once you have done that, you need to figure out how to access those files and create a new Sample object to do so. In most cases this will be a SampleLocal or a SampleGrid object, but if your storage element is sufficiently special you may need a whole new Sample class. If that is the case, I can help with that.

Now your output histograms have to go a separate way from your output datasets. How this works depends on your batch system. Most batch systems send some information back to the submission node, so you can just include it there. Once all of them have arrived, you can use hadd to combine the output histogram files into a single one. Or if you want, you can also try to add together histogram files as they arrive. The later saves some time when running with a large number of sub-jobs.

Notes:

  • There is another class SampleComposite that needs to be supported, but is not supported right now. From a practical perspective a SampleComposite holds an entire SampleHandler that you then need to run over and combine. Not too much changes for you, except that you may have to combine histogram and output files over multiple datasets. I hope to address this issue soon.
  • If you want, for efficiency sake, you can send out a stripped down version of the sample to the worker nodes, containing only meta-data.

The Worker Class

The worker class contains the code that actually runs on the worker node. As such, it is both in control of running the job as well as providing all the hooks the user algorithms need to access their inputs and outputs. To facilitate that, the Worker base class contains a fair amount of functionality itself and does some translating between the algorithms and the implementation of the derived classes.

When initializing the Worker object you have to do a couple of things:

  • Pass the meta-data for the sample being worked on and the output list into the base class constructor.
  • Open an output file for each output dataset and register it with the base class.
  • Create all the algorithms and register them with the base class.

Then when actually running you have to do a couple of things per event, and in this order:

  • If you opened a new file, call Worker::tree(tree) with the new input tree.
  • Call Worker::treeEntry(entry) with the index of the tree entry currently processed.
  • If you opened a new file, call Worker::algsChangeInput(), which will notify the algorithms that a new input file is available. It is important that this happens after you register both the tree and the next entry to process.
  • Call Worker::algsExecute to tell algorithms to do the actual processing of the event.

After a Worker object has finished its processing events, it needs to do a couple more things:

  • It should call Worker::algsFinalize to tell all algorithms that they are finished processing and need to perform any final work left.
  • Then it should save the output list somewhere, so that it can be transported back to the user.
  • It should copy the output files to the appropriate storage element.

Notes:

  • The output list can be written to a file using Driver::saveOutput. This is a public static function, so it can be called by either the Driver or the Worker.
  • I may at some point merge algsChangeInput into algsExecute, simplifying the process by one step.
  • Right now errors are reported through exceptions. If that proves impractical we can think of other ways of doing so.
  • If you want to process multiple input samples on the same worker node you should create a new Worker object for each of them, as they will have to create different output objects.

The Steering Code

What you need for your steering code will be highly system dependent. You probably need a shell script that runs on the worker and a binary that creates your Worker object. For creating that binary you can just add another if clause to the util/event_loop_worker.cxx source file. Please don’t add a lot of code to that file, just call a function that does everything for your particular driver. Or you can add a completely new binary if you choose. I prefer not to do that, since I don’t want to have a large number of binaries sitting in my path.

The Unit Test

I’m still working out, how best to do the unit test. For now take a look at test/ut_driver_direct.cxx, which shows how I do it now. However, I am not really happy with it, so it is probably going to change.