Testing of the IBM z-series Modular Refrigeration Unit Terrence Quinn IBM Integrated Supply Chain Procurement Engineering
Overview: This presentation will describe the methodology and hardware used in the testing of the current z-series MRU. Focus will be placed on the hardware and software of the tester. Data from the 4 years of testing of the current generation of MRU s will be discussed showing the evolution of the MRU design as well as its testing requirements and the tester. 10/18/2007 TAQ/548A 2
Acknowledgement: The data and work presented on the Modular Refrigeration Unit tester is the cumulative team work of several individuals; Frank Cascio Frank Desiano Mark Sinclair Terrence Quinn In addition, the Modular Refrigeration Unit is the result of the work of a large team at IBM 10/18/2007 TAQ/548A 3
History: IBM began using refrigeration to cool its high end servers in 1997. Called a Modular Cooling Units because of its modular design (it is field servicable and can be replaced while the server is running). The design has evolved over time. It is the 3rd generation of the design. The cooling system consists of; - Cold plate (commonly referred to as the Evaporator) that attaches to the processor module - The Modular Refrigeration Unit (MRU) - contains the rest of the refrigeration system including the compressor, the condensor and control valves. - Blower moves air through the condensor and air cools the MRU - Control card electronics that provide power and control signal used to run the MRU. 10/18/2007 TAQ/548A 4
What is the Modular Refrigeration Unit? - The part of the cooling system containing the compressor, control valves, condenser and thermal sensors. The current MRU is approximately 3ft x 1ft x 1 ft in size, about 70 lbs. It was designed to be field servicable. 10/18/2007 TAQ/548A 5
Basic Refrigeration Diagram Compressor gas High pressure gas Evaporator Air flow Condenser Expansion Valve gas liquid 10/18/2007 TAQ/548A 6
Diagram of the refrigeration cycle used in current z-series Compressor MRU High pressure gas gas Evaporator gas Blower Air flow Condenser gas Expansion Valves liquid gas liquid Control Card 10/18/2007 TAQ/548A 7
Testing Strategy: What is the best way to test the MRU? Considerations; - Safe Operation - Handling (MRU weight is approximately 70 lbs) -Chemicals - Component quality - Approximately 100 components - Assembly quality - Electrical connections - Braze joints - Refrigeration Function - MRU will typically run continuously (24 hrs, 7 days/week) in customer environment. - Throughput - Ease of use - Need to support IBM Manufacturing demand fluctuation - High level of automation 10/18/2007 TAQ/548A 8
Resulting Tester Design Decisions; - Cart/stall based tester design allows for easy movement of MRU - Computer controlled, minimal operator interaction, single, easy to use screen - Multiple (4), independent test cells per tester - 24 hour long test. Multiple test (10) conditions. Multiple start/stop operations - Active monitoring of 10 thermistors and other sensors. - Automatic Pass/Fail assessment - Permanent storage of all MRU results (downloaded to permanent media periodically. - Operation of MRU under PID (Proportional Integral Derivative) control at different powers, essentially operation of MRU to run evaporator at a target set point. Approximation of actual MRU operation in field. - Operation of both loops at maximum power (maximum load on MRU) - Operation of each loop at low power. Verifies each loop wired properly, MRU has low heatload function. - Operation of MRU under manual control at 2 different powers. Essentially a capacity test. Valves set to a predetermined position and resulting evaporator temperature monitored. - No need to leak test in tester since Helium leak testing during build operation prior to operation in tester. - Refrigerant level monitored before and after tester using scales. 10/18/2007 TAQ/548A 9
MRU Tester 10/18/2007 TAQ/548A 10
MRU Tester 10/18/2007 TAQ/548A 11
MRU Tester Cart (close up) 10/18/2007 TAQ/548A 12
Primary Tester Results Monitored; - First Test Passes More of a qualitative measure. MRU s that fail can typically be repaired but require additional process time and handling. - Evaporator temperature This is the key function of the MRU. Maintain a constant desired temperature at the evaporator - Valve position Provides an indication of valve performance. The valve is the key component for controlling refrigeration performance. If it is not functioning well, MRU performance will suffer. There are two valves working on same refrigerant source, so some coupling of loops occurs, so valves need to be similar. - Thermistor performance. 10 thermistors in assembly. 8, non-critical, used for monitoring MRU performance and debug, 2 critical and will shut down MRU based on results. Must make sure all are working properly. 10/18/2007 TAQ/548A 13
Typical Test Results Evaporator temperature 10/18/2007 TAQ/548A 14
Test Results Valve positions 10/18/2007 TAQ/548A 15
Test Results First Test Passes Danu MRU Process Yield 100 90 80 70 60 50 40 1Q03 2Q03 3Q03 4Q03 1Q04 2Q04 3Q04 4Q04 1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06 1Q07 2Q07 3Q07 1st Test Pass Yield (%) Quarter or Week ending 10/18/2007 TAQ/548A 16
Yield chart -Useful way to monitor MRU production. -Similar but not exactly an SPC chart. -Early yield problem (1Q03) a result of process line startup and learning curve. - Recent yield problem due to two unrelated issues. - Gradual change in critical temperatures over time (due to collective effects of various design changes) caused failure rate to creep up. -Valve motor change by valve supplier (no initial notification) severely impacted MRU performance. Essentially MRU test parameters so well tuned that performance change immediately impacted results. -Historic average of MRU yield is 85%. -Approximately one quarter of fails (4%) due to various component failures (thermistor fails, cable wiring defects, Another quarter of fails (4%) due to assembly mistakes (improperly connected/loose cables). -Remaining failures a result of valves being too dissimilar to each other to work together. Valve is a commercial part, not designed for this specific operation (dual loops with single compressor). Due to nature of design it is not possible to pre-sort valves. 10/18/2007 TAQ/548A 17
Test Results Evaporator temperatures MRU Performance Evaporator 1, PID, 835W - Average Temp Percentage of Population 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% < 24.930 24.938 24.954 24.971 24.988 25.004 25.021 25.037 25.054 25.070 > 25.079 Average Normalized Temperature (C) Percentage of Population 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% MRU Performance Evaporator 2, PID, 835W - Average Temp < 24.938 24.945 24.960 24.975 24.990 25.004 25.019 25.034 25.049 25.063 > 25.071 Average Normalized Temperature (C) 10/18/2007 TAQ/548A 18
Test Results Evaporator valve position MRU Performance Evaporator 1, PID, 835W - Valve Position Percentage of Population 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% < 124 132 148 164 180 197 213 229 245 261 > 269 Valve Stepper Motor Position MRU Performance Evaporator 2, PID, 835W - Valve Position Percentage of Population 35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% < 124 132 149 165 182 198 215 232 248 265 > 273 Valve Stepper Motor Position 10/18/2007 TAQ/548A 19
Test Results Evaporator temperatures MRU Performance Evaporator 1, PID, 400W - Average Temp Percentage of Population 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% < 24.217 24.304 24.478 24.652 24.826 25.000 25.174 25.348 25.522 25.696 > 25.783 Average Normalized Temperature (C) Percentage of Population 40.00% 30.00% 20.00% 10.00% 0.00% MRU Performance Evaporator 2, PID, 0W - Average Temp < 20.082 20.379 20.975 21.571 22.167 22.762 23.358 23.954 24.549 25.145 > 25.443 Average Normalized Temperature (C) 10/18/2007 TAQ/548A 20
Test Results Evaporator temperatures MRU Performance Evaporator 1, Valve @ 200, 500W - Average Temp Percentage of Population 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% < 6.327 6.708 7.472 8.236 8.999 9.763 10.526 11.290 12.054 12.817 > 13.199 Average Normalized Temperature (C) Percentage of Population 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% MRU Performance Evaporator 2, Valve @ 200, 500W - Average Temp < 6.577 6.917 7.597 8.277 8.957 9.637 10.317 10.997 11.677 12.357 > 12.697 Average Normalized Temperature (C) 10/18/2007 TAQ/548A 21
Test Results Evaporator temperatures MRU Performance Evaporator 1, Valve @ 400, 835W - Average Temp Percentage of Population 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% < 18.142 18.444 19.046 19.648 20.250 20.852 21.455 22.057 22.659 23.261 > 23.562 Average Normalized Temperature (C) MRU Performance Evaporator 2, Valve @ 400, 835W - Average Temp Percentage of Population 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% < 18.558 18.838 19.399 19.959 20.519 21.079 21.639 22.200 22.760 23.320 > 23.600 Average Normalized Temperature (C) 10/18/2007 TAQ/548A 22
Population Distribution charts -Another useful way to monitor MRU production. -Evaporator temperature distribution shows population holding less than +/- 1 C in first test condition (MRU under active feedback control), the specification is +/- 3 C. MRU performance is extremely consistent. -Valve stepper motor position (average value) in same test also shows extreme consistency in the design at the time of initial build. -Single loop tests show that when only one loop running, MRU control is even tighter (also active feedback control). Off Evaporator is actually at ambient air temperature. -Manual/Capacity test show more population variation. Still it demonstrates very good consistency in production. 10/18/2007 TAQ/548A 23
Field issues -MRU Field failures are re-run on tester as part of the Failure Analysis process. Helps diagnose problems quickly. -Field failures less than 1 % (fails/(installs*months of operation)) -Field failures falling into 4 main categories -Over temperature/loss of Refrigeration control -Compressor failures -MRU compressor generally always on. 100% duty cycle. Significantly different than other refrigeration applications. -Changed certain low heatload operations. Particular standby conditions caused significant strain on compressors. -Loss of refrigerant -Traced to handling issues. Eventually corrected by process improvements in IBM Manufacturing -Component failures - Only 3 thermistor failures over course of current MRU product. 10/18/2007 TAQ/548A 24
Conclusion -MRU tester has proved effective way of controlling incoming quality of MRU s to IBM. -Has caught several issues prior to them impacting IBM or our customer base. -Limits the types of defects that escape to the field. -Has contributed to design improvements particularly as part of the failure analysis process. 10/18/2007 TAQ/548A 25