Removing Duplicate Events

Last update: 16 Aug 2024 [History] [Edit]

Some datasets contain duplicate events, as such it is often wise to check whether your datasets contains duplicate events and if necessary remove them. To that end we provide the algorithm DuplicateChecker that provides such functionalities. This is mostly a stop-gap measure until the official ASG solution using EventIndex is in place.

warning WARNING: This will not fix any meta-data that was constructed incorporating those duplicate events. This is a fundamental issue of any duplicate event removal procedure, and this is no exception. In other words, you better make sure that you either don’t include duplicate events into your meta-data in the first place, or that their presence doesn’t affect your result.

Removing Duplicate Events Within Each Subjob

In the simplest case you just add this algorithm to your list of algorithms:

EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
job.algsAdd (duplicates);

It is important that you add it before any other algorithms, because this will essentially tell EventLoop to skip any subsequent algorithms in the case a duplicate event is encountered.

This will remove duplicate events within each sub-job. However, if your job is split into multiple sub-jobs then the same (duplicate) event may be able to sneak in. Still, if you only have a single sub-job per sample (e.g. you use DirectDriver) or if you know that all duplicates get processed by the same sub-job this is good enough.

Checking For Duplicate Events Across Subjobs

If you want to be really sure that you don’t have any duplicate events at all you can instruct the checker to write out the necessary event information:

EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
duplicates->setOutputTreeName ("duplicate_info");
job.algsAdd (duplicates);

This will create a tree duplicate_info inside your histogram output stream. The choice of name is up to you, just pick something that doesn’t collide with anything else.

Then after your jobs are finished you can run a basic checker on that tree:

bool good = EL::DuplicateChecker::processSummary (submitdir, "duplicate_info");

Where submitdir is the name of the submission directory and duplicate_info is the tree name you configured during submission.

In case everything is OK, i.e. every event was processed once and none twice, good is set to true. Otherwise it is set to false and the problems are printed on the screen. It also creates a file duplicates inside the submission directory that contains a list of all duplicates event (independent of whether we filtered them out or not).

Removing Duplicate Events Across Subjobs

Suppose that you found that you have duplicate events distributed across subjobs, you’d probably still want to filter them out. One option is to switch to DirectDriver and just process everything in a single big job. Another is to configure the duplicate checker with an explicit list of duplicate events to filter out:

EL::DuplicateChecker duplicates = new EL::DuplicateChecker;
duplicates->addKnownDuplicatesFile ("duplicates");
job.algsAdd (duplicates);

This will then just assume that all the listed events are duplicates, so they will be filtered out no-matter-what. It also includes the sample name with each event, so a single file can describe the duplicates for all your samples.

Ideally you’d want to generate that file using EventIndex, which can do that automatically for you for every dataset. However, as an alternative, you can also take the duplicates file we generated in the last step. In fact you could submit an extra EventLoop job running only EL::DuplicateChecker to create that file (as it only reads a few fields in EventInfo it should be reasonably fast, but I’ve done no benchmarks).

Limitations

There are a couple of limitations to the duplicate checker:

  • it doesn’t repair any meta-data corrupted by the presence of duplicate events. if you have only few duplicates it probably won’t matter, if you have a lot of them you are probably screwed.
  • the output tree can grow to a large size if you have a lot of events. in my tests it was about 5 bytes per event, but that may be too optimistic. for large datasets you may need to put the output stream in a regular output stream instead of the histogram stream. if you run into that limitation let me know and I’ll add an option for that. I may also be able to shave a bit off the per-event size if needed
  • the processSummary() function may not scale well to a lot of events as well, i.e. it may get really slow. if you hit that limitation, contact me and I see if I can speed it up (there may be a similar issue with the in-job filtering of duplicate events as well, but it normally has to deal with fewer events)
  • the way the duplicates file is stored and handled assumes that we only have a small number of duplicates. if you instead have a large number of duplicates, you may run into trouble. this would unfortunately require some redesign to fix
  • currently this only works for xAODs. support for n-tuples would be possible, though it may take some work. if requested I could look into it