All posts by rkparson

Winter Maintenance 2024

Hello Hummingbird user community, we will be offline for a few days in early January to perform our winter maintenance on the cluster.

We will be offline starting Thursday, January 2nd at 8am and will resume service by (latest) Sunday, January 5th at 8pm. Hopefully this will actually be a shorter window than scheduled, as we only need to do minor OS updates across the cluster (no big moves and shifts this time).

All cluster access will be shut off starting at 8am, so please ensure all jobs and file transfers are completed prior to this time. I will put the queues into DRAIN mode (no new jobs) the morning of the Tuesday December 31st at 8am (this will allow any remaining jobs 48 hours to complete).

If you have any questions or concerns, please reach out on Slack or open a ticket by emailing hummingbird@ucsc.edu.

Extended Summer Maintenance

Unfortunately, the Hummingbird maintenance window will be extended for another few days due to the complexity of the maintenance and the intervening holiday weekend. We made great progress, but there is still work to do. So far, we have performed the following major improvements:

  • External backup of all core operating system files (not user files)
  • Hummingbird has been upgraded to Alma Linux 9
  • Infiniband and BeeGFS drivers have been updated
  • Newest version OpenHPC has been installed and configured (HPC management software)
  • Most base packages for the login node have been installed

Remaining work:

  • Network changes pending verification
  • Node provisioning images are still under construction
  • Software modules reconstruction pending
  • Partition rebalancing / queue reconfigurations (we may possibly defer to a later time)
  • Verify cluster operations (after the above is completed)

While we expect the work to go smoothly, we cannot at this time give a definitive time and date for the restoration to full functionality. We will be posting regular updates to this email list as well as to the Hummingbird Slack channel and website (as appropriate).

Please feel free to contact us at hummingbird@ucsc.edu should you have any questions or concerns. We look forward to a new and improved Hummingbird soon!

Hummingbird Summer Maintenance 2024

The Hummingbird cluster will be going down for our Summer maintenance from Wednesday June 26 at 5:00pm through Monday July 8, 2024 at 9:00am. New job submissions will be restricted after 5pm on the first day of maintenance, but access for users who are retrieving results and copying data will remain available until Sunday June 30, 2024. Beginning on Monday July 1, 2024, access will be restricted for system upgrades until the cluster returns to service on or about July 8, 2024 at 8:00am.

Users are encouraged to retrieve results prior to the beginning on the maintenance window.

The main focus of this maintenance will be to upgrade the cluster’s operating system from CentOS 7 to Alma Linux 9. This upgrade is critical for security and continued support of the cluster, while also opening the door to additional features like the ability to run containerized workflows (i.e. – Docker and Singularity). CentOS 7 operating system goes into End-of-Life on June 30, 2024, at which point we will no longer be able to get security updates.

Planning ahead will make this transition easier, so please let us know if you have any questions or concerns about completing your jobs or retrieving your results before the maintenance window begins. We will endeavor to do this upgrade as quickly and efficiently as possible, as always we hope to have the cluster back by the end of the period, but check in on Slack for updates on timing.

Here is a detailed view of the maintenance steps:

  1. Create external backup of all core operating system files (note – this is not client research data; this is ONLY system-critical files)
  2. Break the mirroring of the boot drive (this allows for us to roll back to previous state easily if needed)
  3. Target one of the boot disks from the mirror pool; format the disk and install the new OS on it (This will be Alma Linux 9.4, which as an end of active support in June 2027)
  4. Install InfiniBand drivers (These need to be custom built to get the optimal functionality)
  5. Install BeeGFS drivers (This is how we communicate with our file storage system)
  6. Install and configure OpenHPC supplemental ecosystem (This provides SLURM and all the software needed to run the cluster)
  7. Verify cluster operations
  8. Return cluster to service on or before July 8, 2024
  9. Supplemental work (see below) with additional testing and verification:
    • Small reconfigurations in some of our queues
    • Restructuring our module system including removing no longer used modules
    • Adding additional cluster maintenance software to ease and enrich the services we can provide (potentially allowing for us to provide an open science gateway for web-based job submission)

Winter Maintenance 2023

Heads up! Hummingbird Winter Maintenance will happen from 21 DEC 2023 through 04 JAN 2024. The system will be off-line so that we can migrate data over to the new high-speed, parallel access file servers. Users will need to have all jobs completed and results copied off by midnight on 20 DEC 2023. Jobs that are still running at that time will be terminated.

Since we are copying over home directories, please lend us a hand by taking a few minutes to curate your home directory. If there are old data, or data that you are not using, please delete it to reduce the overall size of the data we must copy. Be aware that if there are any results or data that you really need to keep safe, it’s best to copy them off of Hummingbird well before the maintenance window is to begin.

We plan to resume normal operations by 05 JAN 2024 (start of the new term), but because we have over 200TB of data to copy, and we cannot resume operations until the copy is completed, we may run over into the weekend.

If you are unsure if you can complete your work by then, require assistance moving, copying or deleting data; or need help formulating checkpoints so you can efficiently resume a job after the maintenance period ends, please contact hummingbird@ucsc.edu to open a ticket

Short Maintenance Window: Wednesday September 20, 2023 at 7:00am

Dear Hummingbird Users,


Please be advised that we are conducting a short maintenance cycle on Wednesday September 20, 2023 from 6:00am to 7:00am.

During this window, we will be rebooting the cluster login node. At that time, you will not be able to login to the cluster. In-flight jobs will continue, but pending jobs may be disrupted. If you submitted a job before the maintenance window and it was in a pending status, please check to make sure the job properly launched after the maintenance window is announced to be closed. You can still log into hb-feeder to access files and move data, but please log in to that machine directly (not through the login node). Any “screen” or “tmux” (or similar) sessions should be closed manually before the maintenance window begins. Sessions left open will be terminated automatically on reboot.

Please feel free to email hummingbird@ucsc.edu if you have any questions or concerns regarding this maintenance window.

Summer Maintenance 2023

July 31 – August 4

The Hummingbird cluster will shut down for a maintenance window at the beginning of August. All jobs must be completed by 8:00am (PDT) July 31, 2023. Access will be turned off starting July 31 and continue through August 4, 2023. Users are strongly encouraged to retrieve results before the maintenance period begins.

For this maintenance cycle, we will make some minor hardware upgrades to prepare for future improvements and perform minor software updates.

Fall 2021 Maintenance

September 9th – September 13th


Changes and updates that were performed

  • Migrated authentication from using CruzID Blue to using CruzID Gold
  • Updated the head node, login node, storage node, and all compute nodes to the newest CentOS 7 version (CentOS 7.9.2009, kernel 3.10.0-1160.42.2.el7.x86_64) and updated all installed packages
  • Implementation of Quality of Service (QoS) features on cluster:
    • the cluster is now set to have maxjobs=3 and maxsubmitjobs=9
    • created a QoS per partition with MaxTRES=“node=3” MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 with same name as partition
    • created a account per partition with the defaultQoS and allowed QoS set to the partition name
    • created all users with MaxTRES=“node=3″ MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 and enabled access to the 4 open access partitions
    • Configured the partition definitions to have their appropriate QoS and accounts defined
    • setup a script to fire on account login to setup user with access to the 4 default partitions/QoS
    • setup a SUDO ACL to allow non-privileged users to run the add script upon login
    • Modified the 4 open access partitions to allow “ALL” groups rather than ucsc_p_all_usrs
    • Net result of all these changes: 3-node limits are now more strictly enforced than before (we had an edge case where users submitting multiple single-node jobs could exceed the 3-node limit)
  • We have set an intentionally low “default memory per node” value for the cluster. The purpose is so that users will be encouraged to learn about memory allocation during job submission, so that we can ensure optimal utilization of cluster resources as a result.
    • short version: users should be sure to include #SBATCH --mem=5000 (or whatever other value is fit for their job) in their slurm script preamble.
  • Slightly re-ordered the queue (swapped the positions of nodes 3 and 5, so now the Instruction queue is contiguously numbered)
  • Installed latest NVidia drivers for the GPU compute node
    • Installed new CUDA toolkit as well (using version 11.1 for compatibility reasons, as PyTorch is not built for 11.4 yet)
    • Created new Python stack for GPU (Python 3.9.7, TensorFlow 2.6.0, Torch 1.9.0)
  • Resolved networking issue that was causing intermittent boot problems for the compute nodes
  • Pruned old user accounts that have been unused in >2 years and no longer have valid UCSC login authorization