Hummingbird Summer Maintenance 2024

The Hummingbird cluster will be going down for our Summer maintenance from Wednesday June 26 at 5:00pm through Monday July 8, 2024 at 9:00am. New job submissions will be restricted after 5pm on the first day of maintenance, but access for users who are retrieving results and copying data will remain available until Sunday June 30, 2024. Beginning on Monday July 1, 2024, access will be restricted for system upgrades until the cluster returns to service on or about July 8, 2024 at 8:00am.

Users are encouraged to retrieve results prior to the beginning on the maintenance window.

The main focus of this maintenance will be to upgrade the cluster’s operating system from CentOS 7 to Alma Linux 9. This upgrade is critical for security and continued support of the cluster, while also opening the door to additional features like the ability to run containerized workflows (i.e. – Docker and Singularity). CentOS 7 operating system goes into End-of-Life on June 30, 2024, at which point we will no longer be able to get security updates.

Planning ahead will make this transition easier, so please let us know if you have any questions or concerns about completing your jobs or retrieving your results before the maintenance window begins. We will endeavor to do this upgrade as quickly and efficiently as possible, as always we hope to have the cluster back by the end of the period, but check in on Slack for updates on timing.

Here is a detailed view of the maintenance steps:

  1. Create external backup of all core operating system files (note – this is not client research data; this is ONLY system-critical files)
  2. Break the mirroring of the boot drive (this allows for us to roll back to previous state easily if needed)
  3. Target one of the boot disks from the mirror pool; format the disk and install the new OS on it (This will be Alma Linux 9.4, which as an end of active support in June 2027)
  4. Install InfiniBand drivers (These need to be custom built to get the optimal functionality)
  5. Install BeeGFS drivers (This is how we communicate with our file storage system)
  6. Install and configure OpenHPC supplemental ecosystem (This provides SLURM and all the software needed to run the cluster)
  7. Verify cluster operations
  8. Return cluster to service on or before July 8, 2024
  9. Supplemental work (see below) with additional testing and verification:
    • Small reconfigurations in some of our queues
    • Restructuring our module system including removing no longer used modules
    • Adding additional cluster maintenance software to ease and enrich the services we can provide (potentially allowing for us to provide an open science gateway for web-based job submission)