Multiple HTC and HPC servers down


Date: Fri, 19 Jul 2019 10:00:39 -0500
From: chtc-users@xxxxxxxxxxx
Subject: Multiple HTC and HPC servers down
Greetings CHTC Users,

Due to power irregularities likely related to morning fires at two power stations in Madison, CHTC currently has multiple (but not all) servers down in the HTC System and HPC Cluster.

The HPC Cluster has roughly 50 nodes down due to a power-related issue with the cooling system in the cluster's server room (Discovery building). Jobs previously running on those servers will have failed. It is possible that we may need to shut down more of the HPC Cluster if cooling issues persists, and we'll provide updates as things progress.

The HTC System's submit-2.chtc.wisc.edu submit server and multiple execute servers are down in several server rooms (Discovery and Computer Sciences), and some other group-specific submit servers may have been affected. Jobs running on the affected execute servers will have been interrupted, but will return to "Idle" status to re-run on another server. Jobs queued from submit-2 (or any other submit server with power loss) have been interrupted, but will similarly return to "Idle" to re-run once we have the submit-2 server stably rebooted. As with the HPC Cluster, we may need to take down additional execute or submit servers, and will provide updates as things progress.

Thank you for your patience. We hope you are safe following this morning's fires and that you are able to stay cool this weekend, given any persisting power outages in the city.

Best,
Your CHTC Team

Lauren Michael -ÂResearch Computing Facilitator,ÂCenter for High Throughput ComputingUniversity of Wisconsin - Madison
lmichael@xxxxxxxxtinyurl.com/LMichaelCalendarDiscovery 2262, (608)316-4430
[← Prev in Thread] Current Thread [Next in Thread→]
  • Multiple HTC and HPC servers down, chtc-users <=