Category Archives: Maintenance Calendar

Maintenance and Upgrades, Maintenance Calendar

Winter Maintenance 2023

December 14, 2023 rkparson

Heads up! Hummingbird Winter Maintenance will happen from 21 DEC 2023 through 04 JAN 2024. The system will be off-line so that we can migrate data over to the new high-speed, parallel access file servers. Users will need to have all jobs completed and results copied off by midnight on 20 DEC 2023. Jobs that are still running at that time will be terminated.

Since we are copying over home directories, please lend us a hand by taking a few minutes to curate your home directory. If there are old data, or data that you are not using, please delete it to reduce the overall size of the data we must copy. Be aware that if there are any results or data that you really need to keep safe, it’s best to copy them off of Hummingbird well before the maintenance window is to begin.

We plan to resume normal operations by 05 JAN 2024 (start of the new term), but because we have over 200TB of data to copy, and we cannot resume operations until the copy is completed, we may run over into the weekend.

If you are unsure if you can complete your work by then, require assistance moving, copying or deleting data; or need help formulating checkpoints so you can efficiently resume a job after the maintenance period ends, please contact hummingbird@ucsc.edu to open a ticket

Maintenance and Upgrades

Short Maintenance Window: Wednesday September 20, 2023 at 7:00am

September 15, 2023 rkparson

Dear Hummingbird Users,

Please be advised that we are conducting a short maintenance cycle on Wednesday September 20, 2023 from 6:00am to 7:00am.

During this window, we will be rebooting the cluster login node. At that time, you will not be able to login to the cluster. In-flight jobs will continue, but pending jobs may be disrupted. If you submitted a job before the maintenance window and it was in a pending status, please check to make sure the job properly launched after the maintenance window is announced to be closed. You can still log into hb-feeder to access files and move data, but please log in to that machine directly (not through the login node). Any “screen” or “tmux” (or similar) sessions should be closed manually before the maintenance window begins. Sessions left open will be terminated automatically on reboot.

Please feel free to email hummingbird@ucsc.edu if you have any questions or concerns regarding this maintenance window.

Maintenance and Upgrades

Summer Maintenance 2023

July 21, 2023 rkparson

July 31 – August 4

The Hummingbird cluster will shut down for a maintenance window at the beginning of August. All jobs must be completed by 8:00am (PDT) July 31, 2023. Access will be turned off starting July 31 and continue through August 4, 2023. Users are strongly encouraged to retrieve results before the maintenance period begins.

For this maintenance cycle, we will make some minor hardware upgrades to prepare for future improvements and perform minor software updates.

Maintenance and Upgrades

Fall 2021 Maintenance

September 13, 2021 rkparson

September 9th – September 13th

Changes and updates that were performed

Migrated authentication from using CruzID Blue to using CruzID Gold
Updated the head node, login node, storage node, and all compute nodes to the newest CentOS 7 version (CentOS 7.9.2009, kernel 3.10.0-1160.42.2.el7.x86_64) and updated all installed packages
Implementation of Quality of Service (QoS) features on cluster:
- the cluster is now set to have maxjobs=3 and maxsubmitjobs=9
- created a QoS per partition with MaxTRES=“node=3” MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 with same name as partition
- created a account per partition with the defaultQoS and allowed QoS set to the partition name
- created all users with MaxTRES=“node=3″ MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 and enabled access to the 4 open access partitions
- Configured the partition definitions to have their appropriate QoS and accounts defined
- setup a script to fire on account login to setup user with access to the 4 default partitions/QoS
- setup a SUDO ACL to allow non-privileged users to run the add script upon login
- Modified the 4 open access partitions to allow “ALL” groups rather than ucsc_p_all_usrs
- Net result of all these changes: 3-node limits are now more strictly enforced than before (we had an edge case where users submitting multiple single-node jobs could exceed the 3-node limit)
We have set an intentionally low “default memory per node” value for the cluster. The purpose is so that users will be encouraged to learn about memory allocation during job submission, so that we can ensure optimal utilization of cluster resources as a result.
- short version: users should be sure to include #SBATCH --mem=5000 (or whatever other value is fit for their job) in their slurm script preamble.
Slightly re-ordered the queue (swapped the positions of nodes 3 and 5, so now the Instruction queue is contiguously numbered)
Installed latest NVidia drivers for the GPU compute node
- Installed new CUDA toolkit as well (using version 11.1 for compatibility reasons, as PyTorch is not built for 11.4 yet)
- Created new Python stack for GPU (Python 3.9.7, TensorFlow 2.6.0, Torch 1.9.0)
Resolved networking issue that was causing intermittent boot problems for the compute nodes
Pruned old user accounts that have been unused in >2 years and no longer have valid UCSC login authorization

Hummingbird Computational Cluster

Category Archives: Maintenance Calendar

Winter Maintenance 2023

Short Maintenance Window: Wednesday September 20, 2023 at 7:00am

Summer Maintenance 2023

July 31 – August 4

Fall 2021 Maintenance

September 9th – September 13th

Changes and updates that were performed

SAFEGUARD AGAINST MISSING DATA

COPY versus MOVE

UC Santa Cruz Research Computing