Url for the GitLab repository: https://gitlab.cern.ch/tiany/FTag-WorkingPointDerivation.git
For how to run this framework, go directly to section 3: What to Do, and consult section 4 when necessary.
The framework:
There are 3 stages in total, each stage uses the result from the previous one.
Stage 1 is a preparatory stage. It processes your ntuple(s) and produces tagging score distributions in the form of histograms. (See section 2 for the requirements on the input ntuples. See section 5 for details of the histogram.)
Stage 2 uses tagging score distributions to derive the working point for the efficiency you specified and produces two files: one contains plots, the other is your custom CDI file.
Stage 3 contains 2 python codes to produce prettier plots from stage 2. This stage is optional.
(1). Codes:
For stage 1:
For stage 2:
For stage 3 (they are in the folder “PlotScripts”):
(2). Example ntuples subdirectory
(3). outputsExample subdirectory:
This code derives the working point (WP) corresponding to a desired efficiency/rejection specified in a config file by a user, and it provides:
For fixed cut option: the cut value on the discriminant output for the b-tagging efficiency specified by the user.
For flat efficiency option: the cut values vs jet pT in the pT bins specified by the user.
For flat rejection option: the cut values vs jet pT in the pT bins specified by the user.
For hybrid option: the cut vs pT. (Hybrid: a mix of fixed cut and flat efficiency. Fixed cut at lower pT region, flat efficiency at higher.)
2D histograms of eta and jet pT, for making MC/MC efficiency maps.
Various plots like efficiency vs jet pT for easy validation
(!!! The user is advised to use the cut values in higher jet pT region with caution. If the number of entries in the histogram is very small, the cut values are no longer meaningful due to statistical fluctuations.)
For the input file for stage 1, the FTag group now provides centralised ntuples produced using the nonallhadronic ttbar MC sample. The central ntuples will be placed on eos. It is also possible to produce ntuples with AnalysisTop. Code: https://gitlab.cern.ch/mcristof/algontuples/-/tree/master
You could also use your own ntuples as long as it contains the necessary information and has a suitable format. In particular, directly in the file, there should be a branch containing the following leaves:
(1). jet pT, jet eta.
(2). The flag that labels the jet flavour. E.g. label 5 means b jet, 4 means c jet, <4 means light jet. The branch name could be e.g. HadronConeExclTruthLabelID. You may find the label in this Twiki page (especially if you are using the central ntuples): https://twiki.cern.ch/twiki/bin/view/AtlasProtected/FlavourTaggingLabeling
(3). The tagging algorithm output for the chosen tagger(s) for b jet, c jet and light (labelled u) jet. You may have several leaves containing the outputs of different taggers, e.g. MV2c10, DL1r.
The codes for stage 1 are:
(1) Write your “.json” config file: specify the parameters you want. (For detail, see “How to write your config file”.)
(2) Produce the histograms. In the command line, type:
python makeHistograms.py -c stage1small.json
(Replace “stage1small.json” with the title of your config file, as long as it is in “.json” format.)
“makeHistograms.py” processes the json config file and provides default values if some parameters are undefined, then passes the parameters to “MakeHistograms.cxx”, which processes the ntuples and generates a “.root” file with the histograms. This output file is used as input in stage 2.
The codes for stage 2 are:
(1) In the “.json” config file, specify the parameters. (For details, see “How to write your config file”.)
(2) Use deriveWP.py to produce the working points. In terminal, do:
python deriveWP.py -c stage2.json
(Or replace “stage2.json” with the name of your config file)
deriveWP.py runs DeriveWP.cxx and uses the two WorkingPointTool.* files. It produces two “.root” output files, one contains various plots, the other one, a custom CDI file, contains the cut values (either fixed or jet pT dependent). It will also produce a “.txt” file with the same name as the custom CDI file, which also saves the cut values for different profiles for easy inspection.
For details of what are in the output files, see section 5.
The code for stage 3 is in the folder “PlotScripts”. There are two scripts:
DrawHybridCuts.py
makeROC_curves.py
The config file from stage 2 should be used in this stage as well.
These scripts use the two “.root” files from stage 2 and produce plots in “.png” format. They are only for plotting and don’t change root files.
If you did not choose hybrid profile, you do not need DrawHybridCuts.py. It creates a folder “HybridCuts”, in which it produces plots related to the hybrid profile.
makeROC_curves.py creates the folder “ROC_curves”. The produced plots are rejection vs b efficiency.
python <codename> -c ../stage2.json
Remember to not miss the “../” before the config file name, since the config file is not in the same directory.
An example of the config file is available in section 4.4.1. It is the same as “stage1central.json” in the repository.
In the config file you can specify the following:
1) “inFileOrDir”: the input file (your ntuples). The file should be in the form of “.root”.
For the input file there are two possibilities:
One input ntuple file. The code checks the specified name, if it ends with “.root”, the code will process it as only one input file. When specifying the name please include the path.
A folder with all the ntuples placed in it. Please put all ntuples directly in one folder, and specify the folder name including the path in the config file.
E.g.
"inFileOrDir" : "/eos/user/f/fdibello/Ntuples_MCMC/Nominal/user.fdibello.410470.PhPy8EG.DAOD_FTAG1.e6337_s3126_r9364_p4062.mcmc400_output_root/user.fdibello.23304121._000001.output.root"
or
"inFileOrDir" : "/eos/user/f/fdibello/Ntuples_MCMC/Nominal/user.fdibello.410470.PhPy8EG.DAOD_FTAG1.e6337_s3126_r9364_p4062.mcmc400_output_root/"
2) “outFileName”: the output file name** (end with “.root”. If it doesn’t end with “.root”, the code will append one for you), to store the output histograms.
3) “jetCollection”: this could be “AntiKt4EMPFlowJets_BTagging201903”, “AntiKtVR30Rmax4Rmin02TrackJets_BTagging201903”, etc. This is only for information and affects the default value for the jet selection criteria “cutString” (see item 8 below for information). Thus if you specified the “cutString” variable by yourself and don’t need the default value, the jet collection would not affect the output at all.
The jets in your input ntuples should belong to the same jet collection, because only one collection can be specified.
The default value for jet collection is “AntiKt4EMPFlowJets_BTagging201903”. I.e. if you do not specify a jet collection, the code would automatically assign this value to the corresponding variable.
E.g.
"jetCollection" : "AntiKt4EMPFlowJets_BTagging201903"
4) “branchName”: the name of the branch/folder in the input ntuple that contains all necessary inputs specified above (jet pT, jet eta, tagger outputs, etc). This name, as well as all the names of the leaves must correspond to the name of the branch and the leaves in your ntuples. E.g.
"branchName" : "nominal"
5) “jetParameterNames”: the name of the leaves in your ntuples that store the jet pT, eta and jvt (Jet Vertex Tagger). The name for the leaf that has the jet jvt is not necessary. It only affects the default value for “cutString” (item 8 in this list). E.g.
"jetParameterNames" : ["jet_pt","jet_eta","jet_jvt"]
6) “flagLeafName1”: the name for the leaf in your ntuples that stores the flag that labels the jet flavour. E.g.
"flagLeafName1" : "HadronConeExclTruthLabelID"
The name depends on your ntuples. For example the above is the name for this leaf in the central ntuples. In the example small ntuples included in the folder, the name is “jet_LabDr_HadF”.
You DO NOT need the name of the leaf that contains other extended jet flavours, since it increases runtime. However if you do not care about runtime or are interested in the histograms for various other jets, specifying the name of the leaf that contains flag that labels the other jet flavours (“flagLeafName2” in addition to “flagLeafName1”) will give you histograms for jets like BB, BD, etc. They are not needed for deriving b tagging working points. It is safe to not mention this parameter in your config file.
E.g. “flagLeafName2” : “HadronConeExclExtendedTruthLabelID”
(Equivalent name in the small ntuples: “jet_DoubleHadLabel”.)
As mentioned before in section 2, you can find the label in the Twiki page below (especially if you are using the central ntuples): https://twiki.cern.ch/twiki/bin/view/AtlasProtected/FlavourTaggingLabeling
7) “taggerLeafNames”: the taggers and tagger scores.
You have to use taggers whose outputs exist in your input ntuple and those that are calibrated. Currently the calibrated taggers are: DL1, DL1r, DL1rmu. Make sure to write the name of the leaves correctly. For example:
"taggerLeafNames" : [
["MV2c10","jet_mv2c10"],
["DL1r","log( DL1r_pb / (0.018*DL1r_pc + 0.982*DL1r_pu ) )"]
]
Note that “MV2c10” is NOT CALIBRATED. It is here ONLY for illustration.
“taggerLeafNames” is a list whose elements are also lists. Each element specifies a tagger. For example, the element [“MV2c10”,”jet_mv2c10”] is also a list. The first element, “MV2c10”, can be an arbitrary name. It defines the string which will be appended to the histogram name in your output file. Therefore you are advised to choose a clear name, preferably the tagger name. The second element “jet_mv2c10” has to correspond to the name of the leaf in your input ntuple that stores the mv2c10 tagger output.
In this example, your input ntuples should have the leaves with the following names: jet_mv2c10, DL1r_pb, DL1r_pc, DL1r_pu.
For the DL1r tagger the second element is the formula for the discriminant instead of a leaf name:
DDL1 =log( pb / (fc*pc + (1-fc)*pu) )
It is due to the fact that DL1 type taggers are multiclass classifiers which provide 3 output values: probability that a jet is a b-/c-/light jet (pb
/pc
/pu
). These outputs have to be combined via this formula to give the final score. In this example, the fraction fc
=0.018. It controls the importance of the c-jet rejection and was optimised separately for several DL1 versions. You can find the information regarding the recommended value in this page (under “Algorithms Optimizations ” - “DL1”): https://twiki.cern.ch/twiki/bin/view/AtlasProtected/FTAGAlgorithms2019Taggers#Jet_Selection
8) “cutString”: the cuts you want to apply to your jets. You can specify a range for jet pT, eta, and also specify complicated logical relations. Make sure to use correct comparison and logical operators in the expression and express jet pT in MeV. For example:
"cutString":"!(abs(jet_jvt)<0.2&&abs(jet_eta)<2.4&&jet_pt<60000)&&jet_pt>20000&&abs(jet_eta)<2.5"
In this example, you are specifying that the jets should have jet_pt>20000 MeV, abs(jet_eta)<2.5, and should not satisfy the following 3 conditions at the same time corresponding to the JVT selection: abs(jet_jvt)<0.2, abs(jet_eta)<2.4, jet_pt<60000
Again, please make sure to use the correct leaf names from your input ntuples.
This example also corresponds to the default cuts if your jet collection name contains the word “PFlow”. Otherwise, the default value is cutString=””.
9) 10) 11) “ptBins”, “etaBins”, “nTaggerDiscriminantBins”: the binning for jet pT, eta, and the tagger output discriminant. All the binning settings in stage 1 define the binnings in stage 2.
There are two ways of specifying the binning:
Uneven binning. In this case, define a list with all the edges of the bins. This is recommended for the jet pT bins. The numbers should have format XX.Y (not XX.)
This is not recommended for eta or tagger output bins. Especially, tagger output can have different ranges, thus the binning specified in this way may not be suitable for all the taggers chosen.
Even binning: give the number of bins, the code will divide the bins evenly in these ranges:
pT: 10.0 - 3000.0
eta: 0.0 - 2.5
tagger output discriminant: from -6 to 12 for DL1 series taggers, from -1 to 1 for other taggers.
Default:
"ptBins": [10.0, 12.5, 15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 120.0, 140.0, 160.0, 180.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0, 600.0, 700.0, 800.0, 900.0, 1000.0, 1500.0, 2000.0, 3000.0]
The pT range you see in the stage 2 output files may not be the same as specified here. The reason is that there might be very few entries in some of the bins and the efficiency/rejection cannot reach the specified value.
Since the number of entries in the high pT region is usually small, even binning may not be a good choice.
"etaBins" : 50
eta does not affect the derivation of the WP, but will be used for producing efficiency maps. Even binning is a better idea here.
"ntwbin" : 2000
Summary:
Variable Name | Example of Possible Value | Default | |
---|---|---|---|
1 | "inFileOrDir" |
"/eos/user/f/fdibello/Ntuples_MCMC/Nominal/ user.fdibello.410470.PhPy8EG. DAOD_FTAG1.e6337_s3126_r9364_p4062.mcmc400_output_root/" or "/eos/user/f/fdibello/Ntuples_MCMC/Nominal/ user.fdibello.410470.PhPy8EG. DAOD_FTAG1.e6337_s3126_r9364_p4062.mcmc400_output_root/ user.fdibello.23304121._000001.output.root" |
|
2 | "outFileName" |
"Histos2020-10-25.root" |
|
3 | "jetCollection" |
"AntiKt4EMPFlowJets" |
"AntiKt4EMPFlowJets_BTagging201903" |
4 | "branchName" |
"nominal" |
|
5 | "jetParameterNames" |
["jet_pt","jet_eta","jet_jvt"] |
|
6 | "flagLeafName1" "flagLeafName2" (2 is optional) |
"HadronConeExclTruthLabelID" "HadronConeExclExtendedTruthLabelID" |
|
7 | "taggerLeafNames" |
[["MV2c10","jet_mv2c10"] ,["DL1r","log( DL1r_pb / (0.018*DL1r_pc + 0.982*DL1r_pu ) )"] ] |
|
8 | "cutString" |
"!(abs(jet_jvt)<0.2&&abs(jet_eta)<2.4&&jet_pt<60000)&&jet_pt>20000&&abs(jet_eta)<2.5" |
For PFlow: same as in the example, otherwise: "" (no restrictions) |
9 | "ptBins" |
[10.0, 12.5, 15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 120.0, 140.0, 160.0, 180.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0, 600.0, 700.0, 800.0, 900.0, 1000.0, 1500.0, 2000.0, 3000.0] |
As in example. |
10 | "etaBins" |
50 | 50 |
11 | "nTaggerDiscriminantBins" |
2000 | 2000 |
The content of the output file is explained in section 3.
!!! It was tested that these ROOT versions can be used without problems: 6.22/00 or 6.22/02.
You can find a list of ROOT releases in this page: https://root.cern/install/all_releases/
Clicking on a certain release you can find the command to set up the release from CVMFS.
Currently if you run stage 1 on central ntuples, there is a warning:
ReadStreamerInfo, class:string, illegal uid=-2
.
This is due to the ROOT version. This warning can be ignored.
There is an example in section 4.4.2
In the config file you can specify the following:
1) "inputFileName"
: input file name with the path (this should be the “.root” file created in stage 1).
2) 3) "plotOutFileName"
and "customCDIfile"
: output file to store the plots (in the form of “.root”. If you forget about the extension “.root”, the code will append it). There are two output file names to specify, the file that stores the plots, and the custom CDI file. The output .txt file will have the same name as the custom CDI file since they contain the same information.
4) "taggers"
: tagger names should correspond to the tagger names in your input ntuple used in stage 1. The format (e.g. upper case, lower case) also should match the names specified in stage 1. For taggers that have a c fraction (fc
), please write both the tagger name and fc
as a two-element list, as in this example:
"taggers" : ["MV2c10",["DL1r",0.018]]
This fc
value should be the same as in the expression of the discriminant specified in stage 1. It will be saved in the custom CDI file. However if it’s not the same as stage 1, the code will not report a problem or crash, so please make sure it’s the same by yourself.
5) "jetCollection"
: This is only for information and does not affect the WP derivation. It is used as the name for the folder in your output files and does not have to be exactly the same as specified in stage 1, for example, you can add dates, etc. as information for yourself.
Currently a date in the name is required, since the calibration framework (the next framework you should run. Your custom CDI file will be the input file for the calibration framework.) requires one.
The default value is “AntiKt4EMPFlowJets_BTagging201903”.
6) "operatingPoints"
: the working points (also called operating points).
The working point is the percentage of true b-jets tagged as b-jets in the ttbar dilepton MC.
Choose the number you want to obtain the cuts for. You can specify one or multiple working points. If you want a flat rejection profile, you should specify the rejection here. In this case, please write a two-element list, the 1st element should contain either the letter “c” or “u”, indicating whether the specified rejection is c jet rejection or light jet rejection. It is not case-sensitive. The 2nd element is the specified rejection value. Rejection is the inverse of the c or light mistag rate, thus the specified rejection value should be >=1.
E.g.
"operatingPoints" : [60,70,77,85,90,["uRej",502]]
7) "WP_profiles"
: There are 4 possible options you can choose from. However, only 3 can be specified here. E.g.
"WP_profiles" : ["fixedCut","flatEfficiency","hybrid"]
If you want the flat rejection profile, please specify it in “operatingPoints”.
“fixedCut” means that there is one cut value applied at a certain tagger score corresponding to the user specified operating point. This cut value will be applied to all jets: every jet with the score above this cut value will be identified as a b-jet. The overall efficiency is the defined WP. This WP derivation code only derives the cut value corresponding to the desired WP. Applying the cut to data is beyond the scope of this code.
“flatEfficiency” means that for each jet pT bin, the b-jet efficiency is the same and corresponds to the user defined WP while the cut on tagger output score varies accordingly to ensure this same efficiency in jet pT bins. The pT bins correspond to the ones specified in json config file in stage 1 when the histograms were produced.
“hybrid” means when the pT is above a certain jet pT threshold, the profile changes from fixed cut to flat efficiency.
This jet pT value depends on the tagger, the chosen working point and jet collection, and is calculated within the code based on this information. For fixed cut profile, as jet pT goes up, the efficiency first rises to a value higher than the specified WP, then decreases. When the efficiency decreases till the specified WP, the corresponding pT is the threshold pT where the profile effectively changes. (See figure 1 left.) In this case, compared with the fixed cut profile, you get a higher efficiency at the higher pT region.
Figure 1. Left: comparison of the cut value for the hybrid and fixed efficiency profile. Right: comparison of the b efficiency for hybrid and fixed cut profile.
8) "minPtForHybridSpline"
: if a hybrid profile is selected, please specify a value for this variable, below which all cut values would be changed. The code finds the tagger output cut value for the next bin that has jet pT >= minPtForHybridSpline
, and assigns this cut value to all the jet pT bins with lower pT. E.g.:
"minPtForHybridSpline" : 25.0
This min pT should be kept at a minimum. It is needed because the dependence of the cut value on jet pT for hybrid profile is smoothed with a TSpline3 method. This min pT is only to avoid a jump at the beginning of the curve, which would show up because of the spline method, if there is abrupt change (see figure 2).
If this value is not specified, the code would assign a default value. Currently the default value is 25.0.
Figure 2: cut vs pT zoomed in at low pT region, with minPtForHybridSpline=20.0
9) "printout"
: if set as true, the cut values and errors (vs jet pT for pT dependent profiles) will be printed out in the terminal when running the code.
* About the errors for the cut values: they are not stored in the .txt file because they do not matter. When applying certain WP, the cut values are fixed and have no error.
Summary:
Variable Name | Example of Possible Value | Default | |
---|---|---|---|
1 | "inputFileName" |
"Histos2020-10-25.root" |
|
2 | "plotOutFileName" |
"Plots_2020-11-22_temp.root" |
|
3 | "customCDIfile" |
"customCDI_2020-11-22_temp.root" |
|
4 | "taggers" |
["MV2c10",["DL1r",0.018]] |
|
5 | "jetCollection" |
"AntiKt4EMPFlowJets_BTagging201903" |
"AntiKt4EMPFlowJets_BTagging201903" |
6 | "operatingPoints" |
[60,70,77,85,90,["uRej",502]] |
|
7 | "WP_profiles" |
["fixedCut","flatEfficiency","hybrid"] |
|
8 | "minPtForHybridSpline" |
25.0 | 25.0 |
9 | "printout" |
true |
true |
(1) The parameter names should be surrounded by a pair of double quotes. (Not single quotes.)
(2) There should be a comma separating one parameter from the next. But for the last parameter, do not add a comma.
(3) All the elements in an array must have the same type. (Despite that they will be processed as lists in the python code.)
The example config file “stage1.json” looks like this:
{
"inFileOrDir" : "/eos/user/f/fdibello/Ntuples_MCMC/Nominal/user.fdibello.410470.PhPy8EG.DAOD_FTAG1.e6337_s3126_r9364_p4062.mcmc400_output_root/",
"outFileName" : "Histos2020-11-22_central.root",
"jetCollection" : "AntiKt4EMPFlowJets_BTagging201903",
"branchName" : "nominal",
"jetParameterNames":["jet_pt","jet_eta","jet_jvt"],
"flagLeafName1" : "HadronConeExclTruthLabelID",
"taggerLeafNames" : [
["MV2c10","jet_mv2c10"],
["DL1r","log( DL1r_pb / (0.018*DL1r_pc + 0.982*DL1r_pu ) )"]
],
"cutString":"(!(jet_jvt<=0.5&&abs(jet_eta)<2.4&&jet_pt<60000))&&jet_pt>20000&&abs(jet_eta)<2.5",
"ptBins": [10.0, 12.5, 15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 120.0, 140.0, 160.0, 180.0, 200.0, 250.0, 300.0, 350.0, 400.0, 450.0, 500.0, 550.0, 600.0, 700.0, 800.0, 900.0, 1000.0, 1200.0, 1400.0, 1600.0, 1800.0, 2000.0, 2200.0, 2400.0, 2600.0, 2800.0, 3000.0],
"etaBins":50,
"nTaggerDiscriminantBins":2000
}
The example config file “stage2.json” looks like this:
{
"inputFileName" : "outputsExample/Histos2020-10_25.root",
"plotOutFileName" : "Plots_2020-11-22_temp.root",
"customCDIfile" : "customCDI_2020-11-22_temp.root",
"taggers" : ["mv2c10",["DL1r",0.018]],
"jetCollection" : "AntiKt4EMPFlowJets_BTagging201903",
"operatingPoints" : [60,70,77,85,90,["uRej",502]],
"WP_profiles" : ["flatEfficiency","fixedCut","hybrid"],
"minPtForHybridSpline":25.0,
"printout" : false
}
The name of the branches tell you what parameters there are, the tagger (as written by you in the config file), the flavour of the jets.
E.g. The name of a 3D tagger score distribution could be: twetaptDL1r_B
which means the histogram shows the distribution for tagger output (also called tagger weight) for DL1r tagger, jet eta, and jet pt for jets originate from B hadrons. The name of the corresponding 2D histogram containing these jets is: twptDL1r_B. The name of 1D histogram is: twDL1r_B.
In the “.root” containing all the plots. You can inspect them in TBrowser and get an idea of which WP is desirable. For different tagger and jet collection, there are:
The rejection vs efficiency plot is normally used to evaluate the performance.
The cut vs pT and efficiency vs pT plots make it straightforward to check the cut values and resulting efficiency (more convenient than checking the values stored in custom CDI files).
The closure plots are obtained by applying the cut values obtained back on the entries. These plots are useful to check that after applying the determined cut values on the discriminant from the flat efficiency/flat rejection/hybrid profile, we indeed get the desired efficiency/rejection back.
The efficiency maps are 2D histograms. These should act as input for making MC/MC efficiency maps for your custom WP. The corresponding plots ending with “ColZ”
are intended for easy inspection. However, due to a large range of jet pT, one may not get a lot of information from them.
In the “.root” that has CDI format, for different tagger, jet collection:
fc
the c fraction stored as TVector.
To use them, a dedicated code for reading CDI provided by the FTag group is needed.The overall rejection and cut values (to the 6th digit) are also stored in the .txt file for easy inspection.