Simulated Failure Analysis of a Distributed Liquid Cooled Data Center
Michael Gagnon, Cool Centric
Energy management of high-performance computing (HPC) system data centers is evolving. The removal of heat generated by computing, networking, and storage equipment from the data center is changing from the practice of exclusively moving chilled air to also include the removal of heat by liquid cooling. Here, liquid cooling involves the application of rack door heat exchangers fed and controlled by coolant distribution units (CDU). Each CDU can effectively remove sensible heat from several 42U racks. Therefore, the use of CDU technology represents a distributed heat removal paradigm, one that requires less energy than traditional computer room air conditioning methods.
A section of our data center was designed so that racks are cooled by a plurality of CDUs and their accompanying rack door heat exchangers. The system is interleaved into groups, each group cooled by several CDUs, effectively creating a physical heat removal system where adjacent racks are not cooled by the same CDU. This design is intended to reduce "hot spots" in the room, by spreading the heat over a wider area, so that racks adjacent to those that have lost cooling due to the failure of a CDU can aid their overheated neighbors.
This work presents the results of experiments conducted to simulate the failure of a single CDU in a live HPC data center. By simulating a single CDU failure, we were able to study the reactions of the remaining CDUs, the effects on the racks affected by the CDU failure, and those racks still serviced by the remaining CDUs. We also report on the effects of intuitive counter-measures taken and the relative reaction time required to manage the system as compared to an air-only data center cooling system.