Issued Date：2020/8/11AI Chip Reliability
Applying AI technology to the COVID-19 pandemic prevention is a hot topic now,
How to apply AI in accurate pandemic prediction and prevention?
The key is quality and reliability of the AI chip.
AI Chip Reliability
People are talking about and suffering from COVID-19 around the world in the first half-year of 2020. Once it gets contained in each of the states, the pandemic prevention measures would be eased, however, we have to stay cautious with extreme caring before a COVID-19 vaccine is widely available. AI (Artificial Intelligence) technology has been playing a key role in pandemic prevention and drug development and providing fast data analysis capabilities in terms of “thermal image identification and pandemic prevention,” “virus DNA mutation and pandemic data analysis” and “candidate drug screening.”
The AI technology is used to retrieve object characteristics parameters and make judgment based on human brain simulation by replicating neural network and deep learning. It may look odd but it has been penetrating into our daily lives as shown in the human-equivalent voice recognition available in almost every smartphone.
In addition to supporting algorithm and big data evolution, AI chips are moving in big and fast paces toward high performance, high bandwidth and low power consumption as required by different application domains (see table 1). That is, continuously improving chip performance is the key to support progress in AI applications.
Table 1: Types of AI chip applications
In terms of COVID-19 pandemic prevention by AI application, one of the keys is AI Chip Reliability and performance. The high power consumption feature in AI chips for cloud computing and low voltage for edge computing not only determines performance and lifetime of AI chips but also imposes great challenges on their reliability verification design approaches and equipment. There are three challenges as comprehended by the iST reliability verification lab (iST RA lab).
1. The Challenge of Super High Power Consumption by Cloud AI Chip: Heat Dissipation and Balance
The cloud AI chips deployed in data centers are set to deep learning so the computing performance must be raised. This results in huge power consumption (over 200W by single chip) which, in turn, ages chips in greater speed by heat so generated.
The cloud AI chips running 24 hours whole year mandate careful planning to address issues of reliability caused by aging。
You are required to do the reliability verification with a given amount of ICs to predict the lifetime and failure rate of the entire population. In general, your IC sample amount is 77 and place them on one reliability system equipment for reliability verification spanning 1000 hours. Heat generated by these 200W chips may go up to tens of thousands of watts and impose high orders on the temperature system for its heat dissipation and balance capabilities.AI Chip Reliability
Precision heat dissipation and balance capability is the only way to maintain each and every chip at steady junction temperature (Tj, the PN interface temperature) when running different computing models such that valid lifetime of ICs can be determined. That is, the challenge faced by IC reliability verification design is to dissipation and control heat generated by high performance cloud AI chips.
2. The Challenge of Super Low Voltage by Edge AI Chip: Limit of Reliability Verification Imposed By Multi-System Power Requirements
On account of its special application field, edge AI chips mandate computing performance and low power consumption as they are most likely embedded battery powered devices including mobile device, IoT, drone and electric car’s autopilot assistance.
In spite of ever improving semiconductor process and reducing dynamic current consumed by the same number of logic gates, the physical effects of shrinking sizes are raising static leakage current by transistors. The benefits of the Moore’s Law, area of the transistor would halve every two years, do not halve the power consumption density of chips. On the contrary, chips of the same area would consume more current than their earlier generation.
In addition to lower the operating voltage, multi-gate voltages are widely adopted to reduce power consumption. This imposes another demand on reliability verification system as 10 or more system power requirements are so challenging to the capacity of power supply of reliability verification equipment.AI Chip Reliability
The low operation voltage of core power at 1V or less makes IC’s power margin smaller and smaller. Any power IR drop or power ripple caused by PCBs would fail the progress of reliability experiment. It’s essential to design an HTOL reliability verification environment for edge AI chips with much more prudent equipment selection, PCB Power Integrity(PI) simulation and preparation, and other design considerations then ordinary logic ones.
3. The Challenge of Heterogeneous Integration: Complex Heat Dissipation Path
AI chips are trending toward heterogeneous integration. This integrates chips of different processes in one package to speed up transmission bandwidth among them as found in components of HBM, sensor, MEMS and antenna. Chips are aligned or stacked (Figure 1) by processes including TSV, RDL, bump and interposer for better data transmission efficiency and lower power consumption among heterogeneous chips.
This comes at a price. That is, a complex stacking structure would complicate the ways that heat gets generated and the paths of heat dissipation. Take as an example: Chips of greater power consumption may go off the center of the packaging and the chips may come with different thicknesses. All these would make the heat dissipation and thermal sensing of the chip different from the traditional packaging and complicating the measurement and temperature monitoring operations throughout the reliability verification period.
Figure 1: heterogeneous integrated chip
In summary, it is mandatory to face the challenges of heat dissipation and balance capabilities, limit of test system’s voltage and complex heat dissipation path featured by heterogeneous integration regarding reliability design verification. Recommendations by iST RA lab are summarized below:
I. Steady and Control Heat Generated by AI Chip of High Power Consumption with a Liquid Cooling System
The thermal design power (TDP) is the requirement specification for the main board’s “heat dissipation capabilities” set by CPU chips. The maximal capability of TDP set by CPU of desktop computer is now set around 150W now. Computer game players tend to upgrade mother board, heat sinks, fans and other accessories to keep the CPU running at high efficiency and frequency for long time. This drives the heat dissipation capacity of the upgraded system that tops the TDP requirements to keep CPU running at a high efficiency for long time without switching into mode of lower frequency or sleep caused by overheating.
This is not the case with servers, HPC and cloud AI chip as the existing requirement of TDP reaches 200W super high heat generation power consumption. Limited by packaging structure and materials, it is hard to control junction temperature in chips at the target values via the flow of air.
The target temperature required by reliability verification is 125°C, way above the 70°C required by desktop computers. In most cases, the thermal protection function of chips must have been disabled within 125°C reliability experiment period. That is, they may get thermal runaway easily. Therefore to test the reliability of high power consumption ICs at high temperature, the test system is required to offer even faster heat dissipation capacity.
To address this, the iST RA lab employs high efficiency liquid cooling system along with tailor-made liquid cycle socket (Figure 2). Featuring better heat exchange speed of liquid over air, and real-time chip temperature monitoring and liquid speed tuning, such system is set to control heat generated by AI chips of super high power consumption to collect test data successfully.
Figure 2: Liquid cooling socket (Courtesy: Enplas)
II. Software Pre-simulation on brun-in PCB in Advance to Prevent Poor Performance of Finished Goods
Employing an advanced process, the operating voltage of AI chip comes below 1V. This may lead to a DC IR drop (Figure 3) on PCBs when a strong current passes through circuits. This would further lower the super low operating voltage and fail the AI chip because of an inadequate power voltage margin.
In addition, simultaneous switching noise (SSN) of different frequencies would occur when large current isloaded by IC power.AI Chip Reliability
Different PCB layout may result in various power plane impedance value (Figure 4) by different frequency. When the impedance value goes over the design target at a certain frequency, it would result in heavy power AC noise and power ripple which, in turn, would fail the progress of reliability experiment because of an inadequate power noise margin.
To address those issues the iST RA lab is providing full range of layout design utilities to deal with issues like IR drop and power plane impedance. They are aimed to prevent poor performance of test PCBs after they were assembled by tuning width and length of power trace, size and count of vias, decoupling capacitance and location and other factors.
Figure 3: IR drop simulation
Figure 4: Power plane impedance simulation
III. Tailor-Made Jigs to Fit Dies of Different Thickness
Heterogeneous integrated AI chips may come with dies at different thicknesses. This mandates jigs for the reliability verification featuring chip-specific IC socket and heat sink and sensor to fit dies at different height for better heat dissipation and more accurate temperature measurement and monitoring (Figure 5).
Figure 5: Customized IC test socket
IV. Thermal Diode Monitors Circuit to Watch the Internal Temperature of ICs
The super high power consumption featuring a cloud AI chip tends to result in unexpected failure (e.g., Thermal Runaway) caused by latent heat dissipation due to over fluctuation in chip body temperature. For ICs with internal thermal diodes the iST reliability system and its test board may contain a tailor-made thermal diode monitoring circuit to watch the internal temperature of ICs and get the most instant and accurate junction temperature (Figure 6).
The fast response of this approach along with the aforementioned high efficiency liquid cooling control system is ideal for immediate heat dissipation required by the fast temperature changes in super high power consumption of AI chips. In addition, the thermal diode monitoring circuit may measure temperature of individual chips to collect more accurate and reliable data.
Figure 6: IC thermal diode monitoring circuit example
Based on its hands-on experiences from consumer chips, car chip, 5G chips and the AI chips now, the iST RA lab is confident to tackle the issues of super high power, super low voltage and heterogeneous integration faced by AI chip when designing a reliability verification. The lab is committed to providing you with reliability verification data (e.g., accurate temperature and voltage value) to improve the reliability of AI chips.AI Chip Reliability
In case you need more information on challenges and details of AI Chip Reliability solutions, please email us to get a copy of illustrations by the iST RA lab to immediately learn about the challenges faced by reliability design verification for different AI chips and the available solutions. Please contact Mr. Bear Hsu at 886-3-579-9909 ext. 6428 │ Email: email@example.com