Hive nodes draining

Incident Report for HPC at UCD

Monitoring

System administrators have identified a specific workload that was doing concurrent writes to the same file from different clients. This eventually caused the Quobyte clients to stop making progress for those files. Once that happens, Slurm is unable to fully clean up the jobs and puts the nodes into maintenance mode (Kill task failed).

The impacted nodes are being drained so they can be rebooted.

The user's jobs have been killed, and the user has been contacted to modify their workload.
Posted Jan 23, 2026 - 13:22 PST

Investigating

Several nodes are in status 'Kill task failed" and are now in the draining state in anticipation of a reboot. System administrators are looking for a root cause.
Posted Jan 23, 2026 - 08:31 PST
This incident affects: Compute Nodes and GPU Nodes.