Blog: Designing for AI Workloads

Blog: Designing for AI Workloads

Data center cooling is evolving to more efficiently manage higher power densities.
The rapid expansion of artificial intelligence (AI) workloads has reshaped data center thermal design. Traditional air-cooled data centers, once sufficient for racks drawing 10–20 kilowatts (kW), are now overwhelmed by AI IT cabinets that can exceed 100 kW per rack. As a result, operators are transitioning to direct liquid cooling systems engineered for reliability under high thermal loads that can greatly fluctuate.

Robert Sty
To support this shift, engineers increasingly rely on industry standards such as the American Society of Heating, Refrigeration, and Air Conditioning Engineers Technical Committee 9.9 (ASHRAE TC9.9) and the ASME piping standards to guide engineering and design for dynamic, high-density loads.
 

Re-emergence of liquid cooling

Surprisingly, liquid cooling in data center environments is not a new concept. From the 1960s to the 1980s, liquid cooling was used in supercomputers and mainframes. Eventually, the industry shifted to air cooled cabinets similar to those used in enterprise and cloud computing, leveraging aisle containment strategies to efficiently direct cooling air and minimizing the recirculation of hot air exiting the cabinets. With higher power loads driven by AI, liquid cooling has become a mainstream requirement in the design and operation of data centers.
 

Role of cooling distribution units

The Cooling Distribution Unit (CDU) has become commonplace for high-density liquid cooled environments. CDUs separate the Facility Water System (FWS) from the Technology Cooling System (TCS), managing heat exchange between the cooling loops. The role of the CDU is to provide consistent performance within the IT manufacturer’s recommendations of regulating flow, temperature, and pressure to each IT cabinet in the data hall. Since a CDU’s reliability is critical to data center operations, this equipment is typically installed in an N+1 redundant arrangement, similar to data hall air cooling equipment.
 

Dynamic power loads and thermal transients

Photo: Adobe/courtesy HDR
Unlike traditional cloud IT equipment with predictable, relatively stable power draw, AI workloads generate extreme power variations in both training and inference cycles. These fluctuations can place stress on cooling loops, CDUs, and facility water systems as they respond to the power increase. The result of a temperature excursion outside of the recommended thermal guidelines can impact the performance and lifespan of the IT equipment, potentially causing failure.

To manage rapid load fluctuations, CDUs use variable speed drives to modulate pump flow efficiently. TCS cooling piping systems are designed with sufficient volume to buffer temperature increases while the facility control system responds to changes at the central plant. In some cases, this is achieved by adding a small thermal storage tank to the system or increasing the volume of the main cooling loops, with system performance validated through computational fluid dynamics modeling software.
 

Liquid cooling temperatures

Within ASHRAE TC9.9’s guiding framework, liquid cooling environments are classified by supply temperature ranges and cooling equipment types. There are two classes: the FWS “W” class, which addresses temperatures in the overall facility, and the TCS “S” class, which corresponds to the temperatures of the piping system from the CDU to the IT equipment. These classes define appropriate temperature envelopes for facility water systems and technology cooling systems, helping protect both chip level reliability and facility wide energy efficiency.

General arrangement of the piping from a CDU to the IT cabinets. Photo: HDR
Care should be taken to design cooling water system temperatures above the dew point of the facility to avoid condensation. In addition, the appropriate ASHRAE S-class temperature should be coordinated with the IT equipment manufacturer, as fluid temperatures that are too high or too low can impact equipment performance.
 

Piping system considerations

Photo: Getty
In both liquid cooled and air-cooled deployments, pipe routing is typically installed in a looped configuration to allow multiple pathways for water to flow to the equipment for redundancy purposes. Piping material and joint selection are critical, as leaks can cause catastrophic downtime and damage to equipment. Leak detection sensors can alert the operator of such events.

The ASME B31.9 (Building Services) commercial building piping standard has provided direction on material selection, construction, testing, and wall thickness. The ASHRAE standards will guide the design engineer on items such as efficient system performance, sizing, insulation thickness, and fluid velocity considerations. ASHRAE standards 90.1, 90.4, and 189.1 provide guidance on efficiency and performance, with the ASHRAE TC9.9 Datacom Encyclopedia providing a resource for the application to data center facilities.
 

Looking ahead

As AI workloads drive higher IT equipment power densities, direct liquid cooling has become a more common approach to heat rejection. While many familiar principles and standards are used from air cooled applications, liquid cooling does introduce new considerations of thermal transients, piping material selection, and reliability.

By applying established guidance from both ASHRAE and ASME standards, and coordinating with the IT manufacturer requirements, the design engineer can provide resilient infrastructure capable of supporting the next generation of AI computing.

Robert Sty, P.E., is vice president and Global Data Center Practice director for HDR Engineering, Inc. Over his almost 30-year career in the consulting engineering field, he has specialized in the development of data centers and mission critical facilities. He is an ASME member and is a professional engineer in the state of Arizona.
Data center cooling is evolving to more efficiently manage higher power densities.