Some datasets contain duplicate events, as such it is often wise to
check whether your datasets contains duplicate events and if necessary
remove them. To that end we provide the algorithm
that provides such functionalities. This is mostly a stop-gap measure
until the official ASG solution using EventIndex is in place.
WARNING: This will not fix any meta-data that was constructed incorporating those duplicate events. This is a fundamental issue of any duplicate event removal procedure, and this is no exception. In other words, you better make sure that you either don’t include duplicate events into your meta-data in the first place, or that their presence doesn’t affect your result.
In the simplest case you just add this algorithm to your list of algorithms:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker; job.algsAdd (duplicates);
It is important that you add it before any other algorithms, because this will essentially tell EventLoop to skip any subsequent algorithms in the case a duplicate event is encountered.
This will remove duplicate events within each sub-job. However, if your job is split into multiple sub-jobs then the same (duplicate) event may be able to sneak in. Still, if you only have a single sub-job per sample (e.g. you use DirectDriver) or if you know that all duplicates get processed by the same sub-job this is good enough.
If you want to be really sure that you don’t have any duplicate events at all you can instruct the checker to write out the necessary event information:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker; duplicates->setOutputTreeName ("duplicate_info"); job.algsAdd (duplicates);
This will create a tree
duplicate_info inside your
histogram output stream. The choice of name is up to you, just pick
something that doesn’t collide with anything else.
Then after your jobs are finished you can run a basic checker on that tree:
bool good = EL::DuplicateChecker::processSummary (submitdir, "duplicate_info");
submitdir is the name of the
submission directory and
duplicate_info is the tree name you
configured during submission.
In case everything is OK, i.e. every event was processed once and none
good is set to true. Otherwise it is set to
false and the
problems are printed on the screen. It also creates a file
inside the submission directory that contains a list of all duplicates
event (independent of whether we filtered them out or not).
Suppose that you found that you have duplicate events distributed across
subjobs, you’d probably still want to filter them out. One option is to
DirectDriver and just process everything in a single big
job. Another is to configure the duplicate checker with an explicit list
of duplicate events to filter out:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker; duplicates->addKnownDuplicatesFile ("duplicates"); job.algsAdd (duplicates);
This will then just assume that all the listed events are duplicates, so they will be filtered out no-matter-what. It also includes the sample name with each event, so a single file can describe the duplicates for all your samples.
Ideally you’d want to generate that file using EventIndex, which can do
that automatically for you for every dataset. However, as an
alternative, you can also take the
duplicates file we generated in the
last step. In fact you could submit an extra EventLoop job running only
EL::DuplicateChecker to create that file (as it only reads a few fields
in EventInfo it should be reasonably fast, but I’ve done no
There are a couple of limitations to the duplicate checker:
processSummary()function may not scale well to a lot of events as well, i.e. it may get really slow. if you hit that limitation, contact me and I see if I can speed it up (there may be a similar issue with the in-job filtering of duplicate events as well, but it normally has to deal with fewer events)
duplicatesfile is stored and handled assumes that we only have a small number of duplicates. if you instead have a large number of duplicates, you may run into trouble. this would unfortunately require some redesign to fix