Direct File Access With Rucio

Last update: 20 May 2019 [History] [Edit]

Besides allowing to download datasets from the grid, rucio has a mechanism that allows you to read data directly from the grid without downloading it first. Note that this is only a good idea in some very specific cases and you should usually run your jobs as regular jobs on the grid which avoids large data transfers and takes advantage of the computing resources on the grid. Or if that is not feasible to download the entire to a local disk via rucio which is more efficient than direct access and avoids you downloading the same dataset repeatedly. The typical reasons for directly accessing files via rucio are:

  • You just want to run a test on a couple events of a dataset, instead of processing the full dataset. In that case it is much more important that you use your time efficiently and just do things in the way that’s most convenient for you, and you are not reading a lot of data overall.

  • You are confident that you will only process this dataset once, and are only using a subset of the data in each event. In that case you will actually be transferring less data than if you copy entire files.

  • The dataset is actually stored locally, i.e. a lot of Tier 1 and Tier 2 sites have a local Tier 3 with high speed network access to the Tier 1/2 storage elements. In that case it would be a waste of disk space to copy the data to a Tier 3 disk at the same site. Please note that a lot of sites have their own data access mode that they prefer over using rucio like this.

  • You don’t actually have any/enough storage at your Tier 3 and the CPU at your site would mostly go unused otherwise. Overall this is not an ideal situation, and you may want to opt for buying more disk in the future, but for now you have to make do. It still depends on how CPU intensive your jobs are whether it is better to run them on your Tier 3 or just at the grid site.

Warning: While not strictly necessary, it is strongly recommended that you use TTreeCache when accessing files remotely, otherwise your performance is likely to be very poor. So if you haven’t done so already, you should work through the section above on TTreeCache first.

In the past you needed to set up the grid tools to be able to use the grid tools inside SampleHandler, but now SampleHandler will set up the grid tools internally, which both makes life a little easier for you and avoids potential interference between grid tools and the rest of the analysis software. This means that the first time you call the grid tools from within your analysis script it will prompt you to enter the password to set up your VOMS proxy (if you don’t already have one). If you want to, you can still manually set up the grid tools and VOMS proxy.

Navigate to your working area, and from there setup your Analysis Release, following the recommendations above What to do everytime you log in.

Now it’s time to actually use rucio to access files directly. For that, in ATestRun_eljob.cxx comment out the part where we scan the local directory for samples, and instead scan Rucio:

  // const char* inputFilePath = gSystem->ExpandPathName ("$ALRB_TutorialData/r9315/");
  // SH::ScanDir().filePattern("AOD.11182705._000001.pool.root.1").scan(sh,inputFilePath);
 
  SH::scanRucio (sh, "data16_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_ZMUMU.repro21_v01/");

(Note we are using a small SUSY input dataset so your test job can run quickly.)

That should do it. You can now just run your script the same way you did before and with a little luck you will get the same result as before. The initialization will take a little longer, as this will actually query dq2 to find the datasets matching your request, and then again for each dataset to locate the actual files. However, compared to the overall run-time this overhead should be small, and the power you gain is most likely worth it.