Monitoring - System administrators have identified a specific workload that was doing concurrent writes to the same file from different clients. This eventually caused the Quobyte clients to stop making progress for those files. Once that happens, Slurm is unable to fully clean up the jobs and puts the nodes into maintenance mode (Kill task failed).
The impacted nodes are being drained so they can be rebooted.
The user's jobs have been killed, and the user has been contacted to modify their workload.
Jan 23, 2026 - 13:22 PST
Investigating - Several nodes are in status 'Kill task failed" and are now in the draining state in anticipation of a reboot. System administrators are looking for a root cause.
Jan 23, 2026 - 08:31 PST
Update - Some Quobyte clients continue to have issues. HPC@UCD has escalated to Quobyte engineering team. Please stay tuned.
Jan 23, 2026 - 13:08 PST
Investigating - It appears that a new workload is causing the Quobyte client to crash on Hive nodes. We are opening a case with the vendor.
Jan 15, 2026 - 15:53 PST
Hive Login Node
Operational
90 days ago
100.0
% uptime
Today
Compute Nodes
Degraded Performance
90 days ago
100.0
% uptime
Today
GPU Nodes
Degraded Performance
90 days ago
100.0
% uptime
Today
Hive Network
Operational
90 days ago
100.0
% uptime
Today
Storage
Partial Outage
90 days ago
96.53
% uptime
Today
Quobyte Parallel File System
Partial Outage
90 days ago
89.6
% uptime
Today
Hive Home Directories
Operational
90 days ago
100.0
% uptime
Today
Legacy Storage
Operational
90 days ago
100.0
% uptime
Today
Module System and Software
Operational
90 days ago
100.0
% uptime
Today
Hippo User Portal
Operational
90 days ago
100.0
% uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Related
No incidents or maintenance related to this downtime.