top of page
Search

AI's New Iron Age

  • RCD
  • Sep 16, 2024
  • 8 min read

Updated: Mar 8

AI infrastructure dominates Tech hardware industry growth


It wouldn't be an exaggeration to say that AI could be more consequential to the Tech hardware industry than the PC, the internet, and even the smartphone. Capex spending to develop the next-generation large language models (LLMs) has skyrocketed. Half of that investment flows into Tech hardware. At the current trajectory, the value of AI-related equipment shipped will surpass 10% of the entire Tech hardware industry in the next two years. 

ree

This rapid growth has exceeded even the most bullish forecasts. RCD Advisors had estimated 44% growth in initial attempts to size the AI infrastructure hardware opportunity. It has been closer to 200% growth!


Earlier this year, Nvidia introduced its new Blackwell processor and two rack systems based on its architecture. The GB200NVL72 and the GB200NVL36x2 are mainframes that connect 72 GPUs in a rack configuration and can scale up to 576 GPU clusters for LLM training. They are geared for training trillion parameter foundational models. These new systems are reminiscent of the mainframes (often called Big Iron for their large, bulky metal racks) that were the primary way Tech companies delivered computing before the introduction of the PC. 


These giant AI racks solve some of the syncing, coherency, and latency issues with training LLMs over thousands (and soon, hundreds of thousands) of processors. Some gains are through the brute force tradeoffs of packing more processors together and dealing with the thermal management and power delivery challenges. However, a lot of the performance gain is also due to Nvidia's processor-to-processor interconnect, NVlink. See here and here for an excellent tutorial on the various AI hardware tradeoffs and compute approaches to having processors run coherently.


Before the announced redesign, Blackwell was expected to ramp up quickly, representing 10% of all data center GPU shipments by the end of 2024. That expectation may be too high now because of the recently announced delays. RCD Advisors estimates that volume should reach 4M units by the end of 2025.

ree

Updated March 2025: Since this graphic was published initially, RCD Advisors revamped estimates to incorporate new disclosures from ASIC suppliers, hyperscaler build-outs, and more timely visibility into TSMC CoWoS capacity allocation.


When the GB200 NVL rack systems were first announced, most expected they would only appeal to the hyperscalers. The challenges of cooling and power demand (the NVL72 consumes 120kW and requires liquid cooling) were immense. However, since Nvidia's Blackwell delay announcement, the priority has shifted to meet the demand from foundational model makers and focus on delivering the large-scale rack systems with Grace-Blackwell super chips.


Analysts now anticipate that most Blackwell processors (at least initially) will be delivered through the GB200 Big Iron form factors. Morgan Stanley estimated that Blackwell could ship a staggering $210Bn of revenue in the NVL form factors in 2025. While Nvidia has not offered guidance, comments on recent earnings calls have been bullish. 

ree

Updated March 2025: Since this graphic was published initially, RCD Advisors revamped estimates to incorporate new disclosures from ASIC suppliers, hyperscaler build-outs, and more timely visibility into TSMC CoWoS capacity allocation.


Is the Tech hardware industry entering a new Big Iron Age, driven by next-generation LLMs? If so, this has significant implications for the Tech hardware supply chain.


One of Big Iron's main characteristics is that it pushes the technology envelope. In fact (and we are certainly not the first to point this out), many technologies used in large-scale AI computing systems today are reincarnated from mainframes during their heyday. 


ree

But some technologies are altogether new. Nvidia's GB200 mainframe is the first to incorporate chip interconnects transferring data at 224Gbps. The transition to 224Gbps requires a heavy dose of HDI PCB technology over motherboards with very high layer counts. The GB200NVL racks use a copper Twinax backplane to connect the processor trays to the switch trays. Power delivery to the processor also demands new technologies. Each Blackwell processor requires ~1000A of current at 1V. Multiple VRM modules are mounted directly under the IC, on the opposite side of the PCB, to deliver that power.


Mainframes have always been vehicles for delivering cutting-edge computing technology. As a result, they are always proprietary and delivered by vertically integrated suppliers. IBM's mainframes were (and still are) custom systems. Back in the "old" days, IBM owned fabs throughout New York's Hudson Valley, had its own PCB manufacturing plant, and even, at one point, made cable assemblies. Fujitsu, NEC, and other Big Iron suppliers were similarly structured.


Proprietary systems deliver new technologies faster to market. Nvidia could commercialize 224Gbps PAM4 faster because it had a proprietary NVLink interconnect. The same is true for LPO (linear pluggable optics), liquid cooling, and other technologies. 


Vertically integrated organizations also drive leading-edge technology faster than merchant suppliers because they can control the entire supply chain. The end markets for AI and model development are changing so fast that the only way for organizations to stay competitive is to control as many parts of the value chain as possible. xAI recently abandoned plans to acquire cloud computing services from Oracle to build an AI data center from the ground up in Memphis.


Like almost everything else associated with AI, the origin of the AI mainframe began with Google. Google designed custom TPU processors (with Broadcom) and a unique optical switch technology. However, the bulk of the commercial market today is captured by Nvidia, which has transitioned from an IC maker to a server maker and a full-stack AI supplier. Nvidia's 85+% market share marks another feature of Big Iron value chains. There is typically one dominant supplier. Nvidia has become the AI infrastructure behemoth just like IBM was over thirty years ago during the first Big Iron Age. 


Nvidia's dominant market share, proprietary hardware, and vertical integration strategy has influenced its competitors. In mid-August, AMD announced they would acquire ZT systems to move upstream and be able to design and engineer rack systems. It is a considerable "acqui-hire" and clear that AMD is moving in the same direction. It is also a departure from its past attempts to work with a standardized AI server form factor.


Nvidia's downstream customers are also acquiring custom IC design capability. Every hyperscaler is developing a proprietary processor to break the Nvidia monopoly. It is only natural for folks forking over billions of dollars yearly to one supplier to wonder if they can spend that money more efficiently. 

ree

How will the new Big Iron Age evolve? Well, Nvidia, if not directly, has already laid out its roadmap by stating that it will introduce a new processor every year for the foreseeable future. TSMC, the foundry fabricating the GPUs, isn't on the same cadence. One reason why Nvidia can achieve this timetable is because its roadmap lags behind TSMC's. Part of the lag is due to the difficulty in designing with the latest N3 and N2 process nodes. However, it is also because Apple has reserved all the 3nm and 2nm capacity at TSMC and has crowded out other potential customers.


Another reason Nvidia can achieve this timetable is that it isn't deriving improvements simply by moving to new fabrication nodes every year. A vertically integrated Nvidia has many more knobs to tweak across the AI value chain. Nvidia will likely introduce system-level innovations like co-packaged optics in between process node advancements. 

ree

How long will the new Big Iron age last? There is no way to know. As has been quoted often, "These are still the early innings." Technologists are busy trying to break through memory walls, interconnect walls, power, and cooling constraints. Almost everyone in the AI community (see Trevor Cai's keynote at the recent Hot Chips conference) frames this question in terms of scaling trends. There is no end in sight as long as the models continue to churn out 4x to 7x yearly performance improvement. Everyone is looking for an extrapolation.


But investors don't care about those details and Wall Street is fretting over "ROI" walls. There are already concerns about how Nvidia looks like Cisco during the run-up to the dot-com bubble in the early 2000s. Investor anxieties increase when CEOs justify Capex spending due to FOMO (Fear Of Missing Out). Or CEOs reportedly telling employees that they are "willing to go bankrupt rather than lose this race." How is that for animal spirits on meth? It is going to be a wild ride. 


There are a handful of implications for RCD Advisor's clients in the component supply chain.


First, nothing is stable. Expect rapid design changes. Sometimes, those design changes (like the Blackwell redesign) can happen within the tick of one cycle. It wouldn't be surprising for suppliers to go from winning the socket to being shut out before the product ships. These successes and failures will happen when designs are churning fast to capture markets growing at triple digits. In practical terms, component suppliers must find ways to share manufacturing investment risks with their equipment customers. Suppliers and equipment vendors have to be proportionately vested. 


Second, there is always a viable threat of backward integration. In the first Big Iron Age, IBM was notorious for threatening their suppliers with backward integration during negotiations. Equipment makers did not sell mainframes in high volumes, and the barriers to entry were small for many of the component technologies that IBM did decide to source. It is not hard to see how an equipment maker could rationalize backward integration if a $30 component could potentially be a supply chain risk for a $3M system. 


IC suppliers like AMD, Nvidia, and others can design analog and mixed-signal parts like Serdes, PMICs, and board management controllers if needed.  They usually don't because the technology is mature, and the merchant market has much better capability at a competitive price. 


However, that could change quickly in the next-generation systems (448Gbps Serdes, integrated VRMs, etc.). It wouldn't be odd to see AI suppliers invest in developing these technologies rather than rely on external expertise. This is even more likely if AI is the only major use case. (Nvidia already designs their own optical transceivers.)


Even if the AI equipment vendor is not inclined to take on the design internally, many alternative forms of integration, like equity investments, are possible. For example, as part of a supply agreement with Amazon, Astera Labs, a startup making PCI Serdes chips, issued warrants allowing Amazon to buy up to 1.5 million shares of stock. 


Practically, component suppliers need to act as full design service providers. It's about the internal expertise, not the component. They have to protect that expertise and guard against poaching their internal technical capability. 


Third, be prepared to synchronize product developments. System architecture and hyperscaler improvements will likely come in waves and between wafer fab node advancements. That should set the investment timing pattern as the AI industry continues its growth trajectory.


Fourth, be skeptical of Big Iron standardization initiatives. Only a handful of customers will buy these massive rack systems and they will want a single supplier to be responsible for the entire stack. In the first Iron Age, many second—and third-tier competitors tried to compete in terms of compatibility with IBM mainframes. They never really succeeded. This is the same reason why OpenRan initiatives have failed to gain traction in 5G infrastructure.


The whole point of standardization is to lower acquisition costs (through more competition) and increase the performance-to-cost ratio. But that only works when performance gains slow down. There are still many more levers for a vertically integrated supplier to improve performance. Gavin Baker at Atreides Capital pointed out, in a semi-quantitative way, some of these levers across the hardware/software stack.


A handful of scenarios could play out where the new Big Iron Age gives way to the standardized 6U or 8U rack server like Nvidia's DGX/HGX boxes. But even in these form factors, there is a strong incentive to make modifications in an effort to improve performance.


For the time being, as long as customers are willing to bankrupt their companies to develop the next foundational model with ~50% performance improvements, chasing a ~10% cost reduction through standards activity seems futile. That may be the ultimate reason why AMD acquired ZT Systems to vertically integrate.


Finally, old business indicators don't apply. The semiconductor market has always been a good gauge for measuring the health of the rest of the Tech hardware industry. But those correlations will no longer apply. GPU silicon represents 70-80% of the selling price in a typical AI server. That is far from the 25% average silicon content for every other system in the Tech hardware industry. For most other component sectors (PCB, connectors, passives, etc.), the average content is typically only 2-3%. In AI hardware, those other components have less than 1% of value content. Expect sales correlations to decouple as AI computing becomes a more significant portion of the total Tech hardware industry value.


Practically, component makers may have to begin treating the AI business as a separate organizational unit, much the same way as many already consolidate their automotive activity.


Yes, it will be a wild ride.


If you find these posts insightful, subscribe above to receive them by email. If you want to learn more about the consulting practice, contact us at info@rcdadvisors.com.


Edited for typographical errors and clarity (10/21)

 
 
bottom of page