Migrated authentication from using CruzID Blue to using CruzID Gold
Updated the head node, login node, storage node, and all compute nodes to the newest CentOS 7 version (CentOS 7.9.2009, kernel 3.10.0-1160.42.2.el7.x86_64) and updated all installed packages
Implementation of Quality of Service (QoS) features on cluster:
the cluster is now set to have maxjobs=3 and maxsubmitjobs=9
created a QoS per partition with MaxTRES=“node=3” MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 with same name as partition
created a account per partition with the defaultQoS and allowed QoS set to the partition name
created all users with MaxTRES=“node=3″ MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 and enabled access to the 4 open access partitions
Configured the partition definitions to have their appropriate QoS and accounts defined
setup a script to fire on account login to setup user with access to the 4 default partitions/QoS
setup a SUDO ACL to allow non-privileged users to run the add script upon login
Modified the 4 open access partitions to allow “ALL” groups rather than ucsc_p_all_usrs
Net result of all these changes: 3-node limits are now more strictly enforced than before (we had an edge case where users submitting multiple single-node jobs could exceed the 3-node limit)
We have set an intentionally low “default memory per node” value for the cluster. The purpose is so that users will be encouraged to learn about memory allocation during job submission, so that we can ensure optimal utilization of cluster resources as a result.
short version: users should be sure to include #SBATCH --mem=5000 (or whatever other value is fit for their job) in their slurm script preamble.
Slightly re-ordered the queue (swapped the positions of nodes 3 and 5, so now the Instruction queue is contiguously numbered)
Installed latest NVidia drivers for the GPU compute node
Installed new CUDA toolkit as well (using version 11.1 for compatibility reasons, as PyTorch is not built for 11.4 yet)
Created new Python stack for GPU (Python 3.9.7, TensorFlow 2.6.0, Torch 1.9.0)
Resolved networking issue that was causing intermittent boot problems for the compute nodes
Pruned old user accounts that have been unused in >2 years and no longer have valid UCSC login authorization