Basics of using the Grid

Last update: 25 Jun 2022 [History] [Edit]

To run analysis jobs on the Grid you will need some mechanism to send your software and files to the appropriate Grid site with the datasets you want to run over, and some tool to monitor the progress of those jobs. In this tutorial we will show you how to submit your jobs to the Grid using a tool called PanDA (Production and Distributed Analysis).

Below are some monitoring links that can help you track the status of your jobs:

  • Big PanDA: The default panda monitoring site for your jobs
  • ADC monitoring tools: Contains all the tools you could need to monitor grid operations.

We will use these tools throughout the tutorial to check on the status of the jobs.

Submitting jobs to the GRID using PanDA

The PanDA clients package contains a number of tools you can use to submit and manage analysis jobs on PanDA.

Detailed information about each of the tools can be found at the following links.

  • pathena is used to submit Athena user jobs to PanDA
  • prun is used to submit more general jobs (e.g. ROOT and Python scripts)
  • phpo is used for hyperparameter optimization jobs
  • pbook is a python-based bookkeeping tool for all PanDA analysis jobs.

The simplest possible PanDA job using prun

On lxplus, the client is already installed, so to use it you only need to do:

setupATLAS
lsetup panda

Here we have set up the cvmfs software environment and asked to set up the Panda Clients.

We will now try to run a ‘Hello world’ job with prun!

Create a new directory mkdir prunTest and go into the directory cd prunTest (this is important as prun sends to the grid (almost) all files in, and below, your current directory).

Now, create a python script called HelloWorld.py (using your favourite editor), that contains the following lines:

#!/usr/bin/python
print "Hello world!"

We can make the python script executable with:

chmod u+x HelloWorld.py

And run it locally using:

./HelloWorld.py

Now that we have tested the job locally (it’s always important to do this), we can submit the prun command to run this script on the grid.

prun --outDS user.<nickname>.pruntest --exec HelloWorld.py

Here <nickname> is your grid nickname/grid name (which is the same as your lxplus username).

This will submit a task to the grid with two jobs, one build job that recreates your job environment and then one corresponding to the actual Hello World job. The build job will execute first, and once it has finished the Hello World job will be executed. When the second job has finished, we will try to find the ‘Hello world’ message in the output!

At each submission prun will display a number called jediTaskID and corresponding to the submission, we will need this number in a minute.

To monitor the progress and check the log files and the output of a job, we can use the BigPanDA monitor https://bigpanda.cern.ch front page search functionality. Scroll down to the field for Task ID and enter the number we noted above, and click on search.

This will send you to a page associated with this task, and shows that there are two jobs (in some stage of running). Have a look on this page to see the various pieces of information provided

We will need both jobs to finish to look at the output. If the jobs do not seem to have started running after a few minutes, it is suggested to carry on with the tutorial, and to check back frequently on the jobs’ status.

From the web page search for the link labeled job list (access to job details and logs), and click on it. (if you want to get to this page directly, you can enter the URL http://bigpanda.cern.ch/jobs/?jeditaskid=4203786 and modify the task id to the corresponding number.

If you click on a particular job link you should now see ‘Logs’ on the left, hover over this and click ‘Log Files’ on the pop down menu.

The log containing the Hello World output is called payload.stdout, open it and try to find the “Hello World” message.

This forms the basis of simple debugging of jobs that fail on the Grid. As you will have tested the job locally first, if there is a problem, it may be a transient grid error, but it is useful to know how to search for problems in the output files. Note, it is also possible to download the log files as a dataset using rucio.

Now, because we did not need to compile any code to run this job - it’s just a simple python script - we do not actually need the build stage of the job.

To run prun, without the build stage, type the following command:

prun --noBuild --outDS user.<nickname>.pruntest --exec HelloWorld.py

Now, only the script will be run. Usually though, you will probably want to compile some code first. You can read more about the PanDA clients here.

Using the Big Panda monitor / Atlas Dashboard to monitor the job

So far, only the basics of the atlas BigPanDA monitoring have been described. As an optional exercise, see if you can find the link that shows all of your jobs. This is a good page to bookmark, to come back to in future.

Retrieving the log file from the Grid

Above, we saw how to find and open the log file within the web-browser. But what if we wanted to download it? Here we can use the Rucio tools.

Once one of the above jobs has completed we will now find and download the log file. Note; when using rucio, it is almost always better to use it in a separate terminal to where you are running your code or submitting grid jobs, in order to minimise the potential conflicts between different python versions.

Setup the rucio tools if you haven’t already

lsetup rucio

Go back to the big panda web page and find the page with the taskID that we found previously. search for the box labelled “Output containers”, and note the log file container name, .e.g user.aparker.pruntest.log. Back in the terminal, we will try to find this log file in the grid.

$ rucio list-dids user.aparker:*pruntest*log*
+--------------------------------------------------+--------------+
| SCOPE:NAME                                       | [DID TYPE]   |
|--------------------------------------------------+--------------|
| user.aparker:user.aparker.pruntest.log           | CONTAINER    |
| user.aparker:user.aparker.pruntest.log.340520924 | DATASET      |
+--------------------------------------------------+--------------+

This now allows us two options:

  1. Download the container, and all log files within it (e.g. if the task contained many subjobs),
  2. Download just the dataset specific to the single set of jobs.
rucio download user.aparker:user.aparker.pruntest.log.340520924

cd user.aparker.pruntest.log.340520924/
tar -xvf user.aparker.pruntest.log.23186476.000001.log.tgz

This will give you access to the log file (and in fact much more related information) related to your job. This can be useful for debugging.

There will be a lot of information in here but when you have extracted the logs, the file you are probably looking for is payload.stdout

Running a simple ROOT script using prun

It is possible to set up root, and run root-based code on the grid too.

First you should log out of lxplus, and then log back in. We will create a new directory area:

mkdir -p tutorial/grid/RootGridTest
cd tutorial/grid/RootGridTest
setupATLAS

Next we will set up a standalone version of root (and also the pandaclient tools); you can see which versions are available by typing:

lsetup 'root --help'

In our case, we will use the following version (see how we added panda as well, so that all steps are configured together):

lsetup "root 6.20.06-x86_64-centos7-gcc8-opt" panda

Next we will create a simple macro to create a root file, a histogram, fill the histogram with random values, and write the output.

Create a file called HistTest.C and copy and paste the following lines into it:

void HistTest() {
  TFile * foo = TFile::Open("foo.root","recreate");
  TH1D * h = new TH1D("h_gaus","h_gaus",30,-5,5);
  TRandom3 rand(0);
  for (unsigned int i=0; i< 100000; ++i) {
        h->Fill(rand.Gaus(0.2,1.0));
  }
  h->Write();
  foo->Close();
}

As usual, we check that the code runs normally first locally:

root -b -q HistTest.C

You should see that a root file called foo.root was created with a single histogram.

We will now run the same command on the grid and retrieve the output into a rucio dataset. (Note, remove the local foo.root file you just made before executing)

Run the command:

prun --exec="root -b -q HistTest.C" --nJobs=1 --outputs=foo.root \
--outDS=user.<nickname>.prunroottest1 --rootVer=6.20/06 \
--cmtConfig=x86_64-centos7-gcc8-opt

remember to replace <nickname> again

We added some arguments to the prun command, the root version and config, so that the same version of root is instantiated on the grid worker node. Note that certain files (such as root files – e.g. foo.root –) are automatically uploaded to the grid site storage with the grid job.

Once the job completes successfully, you should be able to use rucio to download the dataset containing the root file output.

Unfortunately the job will probably take longer than the length of the tutorial but if it does finish and you are having trouble with this step, let us know.