AI
Don’t want to be left behind in the AI era? How to solve AI chip cooling issues?
Ensuring chip quality and reliability is crucial.
AI Chip Evolution and Challenges
1. he Challenge of High-Power AI Chip: Thermal Dissipation and Balance Capabilities
The cloud AI chips deployed in data centers are set to deep learning so the computing performance must be raised. This results in huge power consumption (over 200W by single chip) which, in turn, it will generate substantial heat that accelerate the aging process of the chips. Therefore, for cloud computing AI chips that must operate continuously 365 days a year, the reliability issues caused by aging need to be carefully evaluated.
The principle of reliability testing involves sampling a certain number of IC to predict the lifespan and failure probability of the entire population. Typically, 77 chips are sampled. When these 77 high-power chips undergo a 1000-hour reliability test in a single system, the resulting tens of thousands of watts of power will severely test the system’s thermal dissipation and thermal balance capabilities.
Only with precise thermal dissipation and thermal balance can each chip maintain a stable junction temperature (Tj) while performing various computational modes. This stability is crucial for accurately predicting the IC’s lifespan. Therefore, managing and controlling the heat generated by high-performance cloud AI chips is a significant challenge in the design of IC reliability experiments.
2. The Challenge of Heterogeneous Integration: Complex Heat Dissipation Path
Additionally, AI chips utilize heterogeneous integration in advanced packaging. This integrates heterogeneous chips of different processes into one package to speed up transmission bandwidth among them. Chips are aligned or stacked to enhance data transmission efficiency and lower power consumption.
This comes at a price. That is, a complex heterogeneous packaging stacking structure would complicate the ways that heat gets generated and dissipated. Take as an example: Chips of greater power consumption may go off the center of the packaging and the chips may come with different thicknesses. All these would make the heat dissipation and thermal sensing of the chip different from the traditional packaging and complicating the measurement and temperature monitoring operations throughout the reliability verification period.
3. The Challenge of Low-Power AI Chips: Balancing Performance and Voltage Control Increases Testing Complexity
Low-power end device AI chips face a different level of voltage control issues. End device AI chips, including ASICs and SoCs, are primarily used for computations in end devices such as smartphone assistants, drones, and ADAS (Advanced Driver Assistance Systems). Since these devices rely on battery power, they must have both high performance and low power consumption. Reducing power consumption has become the biggest design challenge for these chips.
To reduce power consumption, in addition to adopting low operating voltage designs, multi-operating voltage and multi-gate voltage designs are also very common. However, this introduces two challenges for reliability testing:
- Multiple operating voltages mean multiple power supplies need to be tested simultaneously, increasing the complexity of testing and challenging the limits of the power supply capacity of reliability testing equipment.
- When operating voltage is lowered and high current flows through the circuit board traces, issues such as IR drop and ripple are likely to occur on the circuit board, further complicating hardware design and testing.
Thus, planning a High Temperature Operating Life (HTOL) reliability testing environment that meets the needs of edge AI chips requires meticulous attention to every detail and design consideration, from equipment selection to PCB circuit board simulation and production. This process is inherently more rigorous compared to standard logic ICs.
Based on the above, addressing the challenges of heat dissipation and thermal balance, the voltage limits of the testing system, and the complexities of thermal dissipation paths in heterogeneous integration are critical when performing reliability design verification.
How to Overcome the Reliability Challenges of AI Chips
1. Steadily Control Heat Generated by AI Chip of High-Power Consumption with a Liquid Cooling System
Cloud AI chips used in HPC and servers experience high power consumption and heat generation due to prolonged high-performance computing, making traditional air cooling insufficient for effective heat dissipation. Especially when conducting high-temperature reliability tests on such high-power ICs, the testing system must have a rapid cooling capability.
In recent years, liquid cooling systems, commonly known as ” liquid cooling,” are expected to become the mainstream cooling solution for cloud AI chips.
The solution proposed by the iST’s Reliability Verification Laboratory is to utilize a more efficient liquid cooling system, paired with custom liquid circulation sockets (Figure 2). This system takes advantage of the superior heat exchange rate of liquid compared to gas, as well as methods such as real-time monitoring of chip’s temperature and adjusting liquid flow rate to stably control the heat generated by ultra-high-power AI chips, successfully collecting reliability test data.
Figure 2: Liquid cooling socket(Courtesy: Enplas)
2. Using Thermal Diode Circuit to Monitor the Internal Temperature of ICs
The high-power consumption of cloud-based AI chips can lead to unexpected failures during reliability testing, such as thermal runaway, due to rapid temperature fluctuations within the chip itself. Therefore, when the IC incorporates a thermal diode component, iST’s reliability verification laboratory can customize the thermal diode monitoring circuit to monitor the internal temperature of the IC, allowing for real-time and accurate measurement of the junction temperature (Figure 3).
Figure 3: IC thermal diode monitoring circuit example(Source: iST)
This approach offers fast response times and, when combined with the highly efficient liquid cooling control adjustment system mentioned earlier, is more suitable for rapidly changing temperatures in ultra-high-power AI chips, thereby providing real-time heat dissipation. In addition, the thermal diode monitoring circuit can independently measure the temperature of each chip in multi-chip structures, such as 3D packaging, to achieve more accurate reliability data collection.
3. Tailor-Made Jigs to Fit Dies of Different Thickness
Heterogeneous integrated AI chips may come with dies at different thicknesses. This mandates jigs for the reliability verification featuring chip-specific IC socket, heat sink and sensor to fit dies at different heights for better heat dissipation and more accurate temperature measurement and monitoring (Figure 4).
Figure 4: Customized IC test socket(Source: iST)
4. Pre-simulation on Burn-in Boards in Advance to Prevent Poor Performance after Assembly
As previously mentioned, AI chips require multiple system power supplies, which can lead to voltage drops or noise issues, thereby increasing the complexity and difficulty of designing reliability tests. To address this issue, iST’s Reliability Lab adopts a new design concept for Burn-in Module (BI module), breaking away from traditional circuit board design thinking. Instead of testing multiple chips on a single board, the circuit board is downsized to test only one chip. By utilizing various layout assistance design tools, issues related to operating voltage, signal source IR drop, and power layer impedance can be improved through software analysis and simulation in the early stages of reliability circuit board design. By doing so, it prevents the occurrence of performance issues encountered only after the completion of production assembly.
Based on hands-on experiences from consumer chips, automotive chip, 5G chips and the AI chips now, the iST’s Reliability Verification Lab is confident to tackle the issues of super high power, super low voltage and heterogeneous integration faced by AI chip when designing a reliability verification. The lab is committed to providing you with reliability verification data (e.g., accurate temperature and voltage value) to improve the reliability of AI chips.