Some datasets contain duplicate events, as such it is often wise to
check whether your datasets contains duplicate events and if necessary
remove them. To that end we provide the algorithm DuplicateChecker
that provides such functionalities. This is mostly a stop-gap measure
until the official ASG solution using EventIndex is in place.
WARNING: This will not fix any meta-data that was constructed incorporating those duplicate events. This is a fundamental issue of any duplicate event removal procedure, and this is no exception. In other words, you better make sure that you either don’t include duplicate events into your meta-data in the first place, or that their presence doesn’t affect your result.
In the simplest case you just add this algorithm to your list of algorithms:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
job.algsAdd (duplicates);
It is important that you add it before any other algorithms, because this will essentially tell EventLoop to skip any subsequent algorithms in the case a duplicate event is encountered.
This will remove duplicate events within each sub-job. However, if your job is split into multiple sub-jobs then the same (duplicate) event may be able to sneak in. Still, if you only have a single sub-job per sample (e.g. you use DirectDriver) or if you know that all duplicates get processed by the same sub-job this is good enough.
If you want to be really sure that you don’t have any duplicate events at all you can instruct the checker to write out the necessary event information:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
duplicates->setOutputTreeName ("duplicate_info");
job.algsAdd (duplicates);
This will create a tree duplicate_info
inside your
histogram output stream. The choice of name is up to you, just pick
something that doesn’t collide with anything else.
Then after your jobs are finished you can run a basic checker on that tree:
bool good = EL::DuplicateChecker::processSummary (submitdir, "duplicate_info");
Where submitdir
is the name of the
submission directory and duplicate_info
is the tree name you
configured during submission.
In case everything is OK, i.e. every event was processed once and none
twice, good
is set to true. Otherwise it is set to false
and the
problems are printed on the screen. It also creates a file duplicates
inside the submission directory that contains a list of all duplicates
event (independent of whether we filtered them out or not).
Suppose that you found that you have duplicate events distributed across
subjobs, you’d probably still want to filter them out. One option is to
switch to DirectDriver
and just process everything in a single big
job. Another is to configure the duplicate checker with an explicit list
of duplicate events to filter out:
EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
duplicates->addKnownDuplicatesFile ("duplicates");
job.algsAdd (duplicates);
This will then just assume that all the listed events are duplicates, so they will be filtered out no-matter-what. It also includes the sample name with each event, so a single file can describe the duplicates for all your samples.
Ideally you’d want to generate that file using EventIndex, which can do
that automatically for you for every dataset. However, as an
alternative, you can also take the duplicates
file we generated in the
last step. In fact you could submit an extra EventLoop job running only
EL::DuplicateChecker to create that file (as it only reads a few fields
in EventInfo it should be reasonably fast, but I’ve done no
benchmarks).
There are a couple of limitations to the duplicate checker:
processSummary()
function may not scale well to a lot of
events as well, i.e. it may get really slow. if you hit that
limitation, contact me and I see if I can speed it up (there may be
a similar issue with the in-job filtering of duplicate events as
well, but it normally has to deal with fewer events)duplicates
file is stored and handled assumes that we
only have a small number of duplicates. if you instead have a large
number of duplicates, you may run into trouble. this would
unfortunately require some redesign to fix