Run CP Algorithms on the Grid

CPGridRun.py is a script to submit the analysis job to the PanDA grid (we call the remote computing service a grid job) and you can monitor it on bigPanDA. You want to submit a job when your root files are too big for a local machine, or you are working with officially produced MC samples by ATLAS production team.

Submitting a grid job yourself has a steep learning curve because you are opened up to a whole set of grid errors, which most of the time you will be swammed by the computing server technicalities while debugging. CPGridRun.py is a centralized script to help you submit the job in a working and suggested way. The script has a lot default settings, in particular, the script is designed to streamline with CPRun.py. In this section we focus on running CPGridRun.py with CPRun.py. The core of the CPGridRun.py is generating a working prun (PanDA run) command.

Lets run a demonstration first!

setupATLAS
asetup AnalysisBase,main,latest
touch gridinput.txt
echo "mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYS.e6337_s3681_r13167_r13146_p6490" >> gridinput.txt
echo "mc20_13TeV.700341.Sh_2211_Wmunu_maxHTpTV2_BFilter.deriv.DAOD_PHYS.e8351_s3681_r13145_p6490" >> gridinput.txt

After setting up and created an input text file, run

CPGridRun.py -i gridinput.txt --testRun --exec "CPRun.py -t test_configuration_Run2.yaml -e 50" --prefix myTutorial

You should see

Py:CPGridRun         INFO
Input: mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYS.e6337_s3681_r13167_r13146_p6490
  Datasetname: mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYS.e6337_s3681_r13167_r13146_p6490
  Projectname: mc20_13TeV
  Campaign: mc20
  Energy: 13TeV
  Dsid: 410470
  Main: PhPy8EG_A14_ttbar_hdamp258p75_nonallhad
  Step: deriv
  Format: DAOD_PHYS
  Tags: ['e6337', 's3681', 'r13167', 'r13146', 'p6490']
  Etag: e6337
  Stag: s3681
  Rtag: r13146
  Ptag: p6490
Py:CPGridRun         INFO Command:
...

test_configuration_Run2.yaml can be called out of nowhere because is a test configuration installed in AnalysisBase. It is a very useful configuration that you can use to test your code on your machine, it is a good practice to use it when you are not sure if your code is working properly.

You should see the first part is about metadata of your input sample, for the detail check the ATLAS Production naming format section below. The second part starts with prun command, which is the grid submission command you just learned in the previous tutorial. CPGridRun.py is generating a working prun command for you to run your CP algorithms on the grid with CPRun.

Py:CPGridRun         INFO Command:
prun \
--inDS mc20_13TeV.410470.PhPy8EG_A14_ttbar_hdamp258p75_nonallhad.deriv.DAOD_PHYS.e6337_s3681_r13167_r13146_p6490 \
--outDS user.$USER.myTutorial.410470.DAOD_PHYS.e6337_s3681_r13167_r13146_p6490.test_214093 \
--useAthenaPackages \
--cmtConfig x86_64-el9-gcc13-opt \
--writeInputToTxt IN:in.txt \
--outputs output:output.root \
--exec "CPRun.py --input-list in.txt --output-name output --max-events 50 --text-config test_configuration_Run2.yaml --merge-output-files" \
--memory 2000 \
--addNthFieldOfInDSToLFN 2,3,6 \
--mergeOutput \
--outTarBall cpgrid.tar.gz \
--nEventsPerFile 300 \
--nFiles 10

This is a working prun command line that you can copy and paste on lxplus; of course you can also use CPGridRun.py to run the command line for you. There are a few flags we should discuss.

--outDS user.$USER.myTutorial.410470.DAOD_PHYS.e#####.test_##### we see the user identity(user or group), username is set, followed by the prefix myTutorial. At the end, the suffix is test_#####, it is set automatically because we passed --testRun
--exec "CPRun.py --input-list in.txt --output-name output --max-events 50 --text-config test_configuration_Run2.yaml --merge-output-files"
- The --exec is different from what we have entered, CPGridRun will help you to set the input and output correctly, and make sure the necessary flags are set.
- It sets the --input-list to in.txt, you may have found it is from --writeInputToTxt IN:in.txt. After the grid receive the MC samples you requested, it will read through its database, and find out all the related .root files, and write it into in.txt; which a format that CPRun.py can take.
--outputs output:output.root also another preset that ensure the IO is set correctly.
--outTarBall is asking prun to (re)compress the repository to cpgrid.tar.gz, if you see --inTarBall it means it uses cpgrid.tar.gz but not re-compressing.
--nEventsPerFile 300 & --nFiles 10 because we have --testRun enabled. Sometimes you want to test your code on the grid, but you don’t want to wait for a long time to get the results. --testRun will limit the number of files per job to 10 and number of events per file to 300. This is useful when you want to test a small run on the grid.

At the end you will see a confirmation prompt, press y and this will be sufficient to submit a job to the grid.

ATLAS Production naming format (Optional)

One challenge to setup properly is to get the correct formatting on the grid. The input name has a format which the ATLAS Production team uses to name the samples they produced. Getting the name correct is crucial because it is the name used on the grid, and it is a format that CPGridRun.py can recognize and help streamlining.

The ATLAS Production naming format as follow:

Project name: It is either mc##_%%TeV or data_##.
DSID: dataset ID, a 6 digit unique number that characterize your samples. It may be Standard Model or some exotic simulation.
Main: It can be quite arbitrary but usually contains simulator information and process.
Step: deriv stands for derivation, simul, evgen, recon etc.
Format: The file storage format, different format has their own purpose and benefit. AOD, EVNT, etc.
Tags: The simulation configuration, i.e., the settings they used in different steps, which are documented by Particle model group. Check the link above for more information.

The full format usually follows: ProjectName.DSID.Main.Step.FORMAT.tags

CPGridRun arguments (Optional)

Let see the help message

setupATLAS
asetup AnalysisBase,main,latest
CPGridRun.py -h

There are two main sections, one is the CPGridRun.py arguments, the other is extracted from CPRun.py. Under the CPGridRun.py section, it is divided into 4 subsections. You will also see some arguments help message have “(PanDA)”, which means it is an identical flag taken from prun.

Important Input/Output file configuration

-i or --input-list, it is NOT identical to the CPRun.py input list. It takes two formats,
- A name that is recognizable by the PanDA grid, it should be following the ATLAS Production team naming convention. See the sub-section above.
- A text file contains multiple names that follows the ATLAS Production team naming convention.
- User may also use their own files on the grid, but it is out of the tutorial scope.
--output-files, on the grid NOT all files generated can be downloaded because it takes extra effort for the grid to collect your files to a desired location from multiple computing servers. Users need to notify the grid what to download in advance. --output-files "A.root,B.txt,B.root" results in outDS/A/A.root, outDS/B/B.txt, outDS/B/B.root in the output directory. If you are using CPRun.py you don’t need to set it.

Important Input/Output naming configuration

Each time a user submit a grid job they must have a unique outDS. The outDS is a unique identifier for the grid, and every specified file will be put under the directory outDS. If a duplicated outDS is submitted to the grid, the grid will return an error and asking you to change the outDS, even your previous submission with the same outDS has FAILED. We offer a preset (that is commonly used) to simplify the process.

outDS preset: {group/user}.{username}.{prefix}.{DSID}.{format}.{tags}.{suffix}

username is obtained automatically, DSID, format, tags is derived from your input samples. User only need to set the prefix and suffix
--prefix Normally a fixed name that user wants to keep using for that sample, for example ttbar2WWnunu
--suffix Mainly for version control, a name that user is happy to change for unique outDS, like test_v1, v_05 etc. If a submission failed for v_03, user can change the suffix to v_04 and submit again
--outDS User can override all the preset and set it manually.
--gridUsername it is obtained automatically for single user. If the user is submitting an official group production, user can set it to --gridUsername PHYS-HMBS etc.

Grid configuration

--groupProduction will enable some preset for the group production, including naming and computation resources arrangement. User is expected to have the proper authentication.
--exec The executive line that user want to run on the grid. Must encapsulate in double quote “”. There are a few things user should know before using the CPRun.py preset
1. User should not set the input and output flag, they are streamlined to make sure the grid navigation is correct.
2. A working example is simply --exec "CPRun.py -t analysis_config.yaml"
3. Run custom script: --exec "customRun.py -i inputs -o output --text-config config.yaml --flagA --flagB"
  Submission configuration
--noSubmit will NOT submit anything to the grid
--testRun will submit jobs to the grid with a random suffix .test_uuid. It will also greatly limit the number of files per job (10) and number of events (300). It is useful when you want to test a small run on the grid.
--recreateTar During submission with prun, user required to manually ask prun to compress the user’s repository with its source code, and submit alongside to the grid. We found that users always forget to re-compress after updating the source code (which always takes a few hours before users realized this mistake), therefore CPGridRun.py has a file changes detection to detect if anything changed in the source code or build directory. If so CPGridRun.py will ask prun to compress again. But user can force re-compression with this flag.