September 9th – September 13th
Changes and updates that were performed
- Migrated authentication from using CruzID Blue to using CruzID Gold
- Updated the head node, login node, storage node, and all compute nodes to the newest CentOS 7 version (CentOS 7.9.2009, kernel 3.10.0-1160.42.2.el7.x86_64) and updated all installed packages
- Implementation of Quality of Service (QoS) features on cluster:
- the cluster is now set to have maxjobs=3 and maxsubmitjobs=9
- created a QoS per partition with MaxTRES=“node=3” MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 with same name as partition
- created a account per partition with the defaultQoS and allowed QoS set to the partition name
- created all users with MaxTRES=“node=3″ MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 and enabled access to the 4 open access partitions
- Configured the partition definitions to have their appropriate QoS and accounts defined
- setup a script to fire on account login to setup user with access to the 4 default partitions/QoS
- setup a SUDO ACL to allow non-privileged users to run the add script upon login
- Modified the 4 open access partitions to allow “ALL” groups rather than ucsc_p_all_usrs
- Net result of all these changes: 3-node limits are now more strictly enforced than before (we had an edge case where users submitting multiple single-node jobs could exceed the 3-node limit)
- We have set an intentionally low “default memory per node” value for the cluster. The purpose is so that users will be encouraged to learn about memory allocation during job submission, so that we can ensure optimal utilization of cluster resources as a result.
- short version: users should be sure to include
#SBATCH --mem=5000
(or whatever other value is fit for their job) in their slurm script preamble.
- Slightly re-ordered the queue (swapped the positions of nodes 3 and 5, so now the Instruction queue is contiguously numbered)
- Installed latest NVidia drivers for the GPU compute node
- Installed new CUDA toolkit as well (using version 11.1 for compatibility reasons, as PyTorch is not built for 11.4 yet)
- Created new Python stack for GPU (Python 3.9.7, TensorFlow 2.6.0, Torch 1.9.0)
- Resolved networking issue that was causing intermittent boot problems for the compute nodes
- Pruned old user accounts that have been unused in >2 years and no longer have valid UCSC login authorization
SAFEGUARD AGAINST MISSING DATA
COPY versus MOVE
It has come to our attention that some users who are moving large chunks of data between locations on Hummingbird (e.g. from one directory to another) have unexpectedly seen the data disappear from their directories after the move has completed. This behavior seems to be associated only with moving TBs of data, at least more than just hundreds of gigabytes, but not smaller chunks of data.
The recommended solution is to COPY data (using the command rsync or cp) and NOT move the data (using the command mv). Once you have successfully copied the data, please check to see that the metadata is the same (e.g. the file size in bits), before you remove the old data from their original location.
If you observe that you have used MOVE and data are missing, we have no way to recover the lost data. This is a reminder that you should always have a copy of the data off Hummingbird for anything critical. We recognize that it’s not practical to make copies of everything on systems other than Hummingbird, but for final results and critical, non-intermediate data, this is good practice.
We will continue to investigate this problematic behavior, but your assistance by using COPY for moving large files or directories will help ensure that your data remain intact and available to you.
UC Santa Cruz Research Computing