Reporting an issue with ATLAS software

Last update: 06 Aug 2024 [History] [Edit]

While a comprehensive suite of tests and procedures are employed to ensure that ATLAS software runs without issues for all known and supported workflows, given the scope of the project it is inevitable that issues will sometimes occur that result in crashes or other issues (missing outputs, configuration options being ignored, heavy technical overheads, etc). This is especially true when developing new code, since aside from issues directly introduced by the code in development, it may also result in different paths through existing code being triggered which have not previously been fully explored.

Fortunately, there are many experts who are happy to help with such issues. However, in order to help as efficiently as possible it is crucial that the appropriate information is provided. Multiple iterations cost time (especially if experts and developers sit in different timezones) and so please take note of the following considerations, which apply regardless of which channel is used to report the issue.

Information needed for efficient resolution of problems

Please include the following items if possible when reporting a problem:

  • Excerpt from a log file showing details of the issue. Typically, serious run-time issues in Athena jobs will invoke the printing ERROR or FATAL messages. This will often allow experts to identify the component where the issue is occurring and also give some information on the nature of the issue, and in which part of the code it is occurring. Please include details of any ERROR or FATAL messages directly in your report. If your job crashes without such messages being invoked (should be rare) please include the last few lines of Athena output, as well as e.g. the first lines of any system messages/stack trace.
  • Link to the full log file. Sometimes, the issue causing the crash can be related to a (non-FATAL) ERROR or WARNING message much earlier in the job, or potentially an incorrect configuration that may be being applied. This may only be apparent when looking through the full log file to give the proper context. Since log files can be very long, please do not paste them directly into the mail. Rather, place them in an accessible location (e.g. CERNbox) and provide a link to them.
  • Reproducer for the job. The expert will likely want to know at an absolute minimum which Athena release (or nightly) you are running when the issue occurs (and if the job was working previously, in what release it was last successful). In addition, a way for the expert to run the job themselves to further diagnose the issue should be provided.
    • This should be the simplest and quickest possible way for the issue to be reproduced. This means that in case it is related to a specific event in a file, this should be specified so that it can be skipped to. Any steps or algorithm chains which are not implicated in the crash should be disabled if possible, so that the issue can be reached as quickly as possible.
    • The job should not rely on files residing on local filesystems. Any input files, or other files accessed during the job, should ideally be placed at a location from which they can be read. If this is not possible, a way to download the file should be provided. Please try and avoid requiring experts to download files of multiple GB themselves if it can be avoided.
    • In case the issue only occurs when applying local changes on top of an Athena release, please make sure this code is committed to git and a link to the relevant branch is provided. It may be practical at this stage to open a Draft Merge Request, to allow the expert to see the full changeset and comment on it inline via the Gitlab Web UI.
    • Test your reproducer. Make sure that your reproducer can actually be run and successfully demonstrates the issue. This should be done in a clean working directory and ideally on a resource like lxplus which the expert would also have access to, in order to ensure no dependence on local files/resources which may not be accessible.

While this can seem like extra work, please be aware that if you do not provide this, it is very likely that the expert will ask you for it later (or they will have to spend time themselves in order to work out how to reproduce it). Especially if this is a problem with your work which the expert is helping to resolve, the onus should be on the reporter to do as much preparation as possible to reduce the burden on experts.

Where to report issues

Before reporting an issue, you should first check that it has not already been reported. There, you should first of all check egroup archives and JIRA (see below) to see that the issue is not already under discussion.

If you have an expert that you have corresponded with directly in the past, it is often tempting to contact them directly. While in many cases this will lead to a quick resolution of the problem, it is often better to use other channels in order to reach multiple experts simultaneously and also to have the issue and its resolution (as well as other potentially useful information) available for others.

ATLAS-talk (Discourse) / Mailing lists

The dedicated ATLAS-talk Categories, which are also set up as mailing lists, are in many cases the best place to report an issue initially. For Athena issues, a good starting point is Athena Help / atlas-sw-help@cern.ch. This will ensure the widest range of people will see it, which can be useful in case the specific area in which the issue lies is not known. For issues in Trigger software, Trigger Help / atlas-trigger-help@cern.ch will usually be the more appropriate list. This may be re-directed to other lists or experts (or have them added in cc to the thread) if appropriate. If the thread becomes long, it may then move to discussions on JIRA.

Make sure you log in to ATLAS-talk at least once to setup your account. You have significant flexibility around notifications and other settings for the site. You can also check the archive for each Category before asking your question if you wish; if you use the ATLAS-talk web interface for submitting questions, it will automatically suggest some topics that appear similar, if there are any.

JIRA

The majority of software issues and developments are discussed in detail in JIRA tickets. This may be where discussions started in mail threads end up, but you can also simply open a ticket relating to an issue directly. There are multiple JIRA projects covering different aspect of ATLAS software. These include:

  • ATLASRECTS: Issues related to reconstruction
  • ATLASSIM: Issues related to simulation
  • ATR: Issues related to trigger
  • ATLINFR: Issues related to software infrastructure (nightly builds, CI system, ART, etc)

as well as many more for other activities and domains. If you use the wrong project, the ticket can usually be easily moved to a different project. Tickets in different projects may differ slightly in the options available, but typically the following points should be observed:

  • There may be a list of components and labels available - please look through these to select what seems like the most fitting when creating a ticket. This may determine who will get an automatic notification about the ticket, and who it gets assigned to initially.
  • Please select an appropriate priority. Note that tickets marked with priority low may taken longer to be looked at, but also consider whether the issue will affect many people or just you before choosing priorities like high, critical or blocker.
  • Other people can also be notified on JIRA tickets. If there are people you would cc on a mail, you can also do so on JIRA tickets. You can start typing @ and then the first letters of the username in the issue description or a comment to get autofill options for users to give a one-time notification to. If you think someone should get notification for all activity on the ticket, you can add them as a watcher (if the permissions allow). Please be aware that they will then get notifications for all activity (including edits to text, settings being changed, etc).

Gitlab/Github issues

While we do not use Gitlab issues for issue reporting in Athena, other related projects such as GeoModel or ACTS do use these issue trackers, and so you might be asked to open an issue there in some cases.

Mattermost

In addition to the above, several Mattermost channels exist for software help such as general-atlas-software-discussion and HLT Software. These are frequently very useful for quick questions and advice. However, due to the superior searchability, issue reporting should be done via mailing lists or JIRA.