September 9th – September 13th
Changes and updates that were performed
- Migrated authentication from using CruzID Blue to using CruzID Gold
- Updated the head node, login node, storage node, and all compute nodes to the newest CentOS 7 version (CentOS 7.9.2009, kernel 3.10.0-1160.42.2.el7.x86_64) and updated all installed packages
- Implementation of Quality of Service (QoS) features on cluster:
- the cluster is now set to have maxjobs=3 and maxsubmitjobs=9
- created a QoS per partition with MaxTRES=“node=3” MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 with same name as partition
- created a account per partition with the defaultQoS and allowed QoS set to the partition name
- created all users with MaxTRES=“node=3″ MaxJobs=3 MaxNodes=3 MaxSubmitJobs=9 and enabled access to the 4 open access partitions
- Configured the partition definitions to have their appropriate QoS and accounts defined
- setup a script to fire on account login to setup user with access to the 4 default partitions/QoS
- setup a SUDO ACL to allow non-privileged users to run the add script upon login
- Modified the 4 open access partitions to allow “ALL” groups rather than ucsc_p_all_usrs
- Net result of all these changes: 3-node limits are now more strictly enforced than before (we had an edge case where users submitting multiple single-node jobs could exceed the 3-node limit)
- We have set an intentionally low “default memory per node” value for the cluster. The purpose is so that users will be encouraged to learn about memory allocation during job submission, so that we can ensure optimal utilization of cluster resources as a result.
- short version: users should be sure to include
#SBATCH --mem=5000
(or whatever other value is fit for their job) in their slurm script preamble.
- Slightly re-ordered the queue (swapped the positions of nodes 3 and 5, so now the Instruction queue is contiguously numbered)
- Installed latest NVidia drivers for the GPU compute node
- Installed new CUDA toolkit as well (using version 11.1 for compatibility reasons, as PyTorch is not built for 11.4 yet)
- Created new Python stack for GPU (Python 3.9.7, TensorFlow 2.6.0, Torch 1.9.0)
- Resolved networking issue that was causing intermittent boot problems for the compute nodes
- Pruned old user accounts that have been unused in >2 years and no longer have valid UCSC login authorization
UC Santa Cruz Research Computing