Chip Failures, GPU-Weighted: The Authoritative Primer

BLUF

Data-center GPUs fail at roughly 9% annualized at hyperscale, advanced packaging is the new dominant fault surface, the financial cost of that failure is funneled almost entirely back to NVIDIA (a ~$2.81B FY26 reserve growing >100% YoY), and there is no purpose-built insurance, monitoring, or repair-tooling stack that crosses the three. That gap — between the physics that produce the failures, the telemetry that detects them, and the balance sheets that absorb the cost — is what this brief maps.

Operations are anchored by two NVIDIA conversations (Lonny Orona, 2026-05-12 and Alex Zhu, 2026-05-27). Physics, detection, and circularity are anchored by the IMEC visit (Leuven), 2026-06-XX — primarily Ben Kaczer (Scientific Director, Advanced Reliability Robustness & Test) on degradation physics and on-chip aging monitors, Cedric Rolin (Program Director, Sustainable Semiconductor Technologies & Systems) on the chip-circularity economics, and Lizzie (LCA researcher) on secondary-market mechanics. External evidence is dominated by Meta’s three reliability papers (Llama-3 2024, Hardware Sentinel ASPLOS 2025, “How Meta Keeps Its AI Hardware Reliable” 2025-07), Google’s “Cores that don’t count” (HotOS 2021) and Dixit et al. SDC paper (2021), the OCP GPU & Accelerator RAS Requirements v1.7 (2025-10-23), and SEC 10-K warranty rollforwards across NVIDIA, AMD, Dell, HPE, SMCI, Intel, Broadcom, and Marvell.

This is a primer, not a thesis document. It surfaces evidence and ranges. Synthesis remains a human activity, per rdi-methodology.

Outline changes from Phase 0

§2 ASICs: searched specifically for TPU/Trainium/MI300/Maia field-reliability disclosures. The disclosure void is the finding. No hyperscaler publishes per-ASIC field failure rates for its own silicon; one Google goodput claim (>97% at 10K-chip scale) is the closest public number. Flagged as a known gap.
§2 Edge: the public quant data is thinner than expected. Jetson, Hailo, and Mythic publish operating-envelope specs and AEC qualification posture, not field failure rates. The brief pulls in AEC-Q100 DPPM targets and Backblaze drive-stat methodology as the closest analogs, and surfaces the gap explicitly.
§5 NVIDIA reserve figure: the brief uses the filed ~$2.81B FY26 number from NVIDIA’s 10-K, not the WarrantyWeek $8.22B figure that appears in prior briefs. The conflict and its resolution are documented in detail in reverse-logistics-warranty-tam-2026-05-29 §4; that finding is treated as settled here.
Post-publication integration of the IMEC visit (2026-06-XX). After initial publication, the on-site IMEC visit was integrated as a physics co-anchor. The brief now treats IMEC as the primary-source voice on degradation physics (§1), on-chip aging-monitor reality (§3), and chip-circularity economics (§4) — paralleling the operations anchoring on Lonny + Alex Zhu. Specific additions include the probability-game framing of reliability, workload-conditional guarantees as a new product surface, tamper-aware odometers as an emerging detection class, the “repair-doesn’t-exist-only-replacement-does” vocabulary correction, and the third-party-telemetry requirement for any insurance product built on chip data.

Reader’s glossary — key terms used throughout

A plain-English reference for the acronyms and concepts that recur across the brief. Skip if you already speak the language.

People at IMEC (referenced repeatedly):

Ben Kaczer — Scientific Director of IMEC’s Advanced Reliability Robustness & Test (AR2T) group. A reliability physicist who spends his career figuring out how transistors break down over time. Our primary-source authority on §1 (physics) and §3 (on-chip monitoring).
Cedric Rolin — Program Director of IMEC’s Sustainable Semiconductor Technologies & Systems (SSTS) group. Built IMEC.netzero (a virtual-fab carbon-footprint model). Our primary-source authority on chip recycling, secondary markets, and supplier data-sharing dynamics.
Lizzie — Life-cycle-assessment researcher on Cedric’s SSTS team.
Annelise — IMEC Ventures investor (Olivier Rousseaux’s replacement) — IMEC’s venture arm investing in chip-adjacent startups.
Jeroen Van den Bosch — IMEC’s Chief of Staff / Chief Strategy Officer.

Chip-physics terms:

Wear-out — slow physical degradation of a transistor as it operates, eventually causing failure. Different from “infant mortality” (defects from manufacturing) or “soft errors” (random radiation hits).
Electromigration (EM) — current flowing through a microscopic wire physically pushes the metal atoms around until the wire breaks. Worse when wires are smaller and currents higher.
NBTI / PBTI (Negative/Positive Bias Temperature Instability) — electrical charges getting trapped inside a transistor’s insulating layer over time, slowly shifting its switching behavior.
TDDB (Time-Dependent Dielectric Breakdown) — the thin insulator inside a transistor eventually gives up under sustained voltage stress; once it breaks down, that transistor fails.
HCI (Hot Carrier Injection) — high-energy electrons damage the insulator while the transistor is switching, accumulating wear with use.
Thermal cycling fatigue — repeated heat-up / cool-down cycles cause materials to expand and contract; if neighboring materials expand at different rates, the joints between them crack over time.
CTE (coefficient of thermal expansion) — how much a material grows when it heats up. A mismatch between two bonded materials’ CTEs is the root cause of thermal-cycling damage.
HBM (High-Bandwidth Memory) — the stack of memory chips sitting next to a GPU die. Modern AI accelerators have HBM mounted in the same package as the compute chip.
TSV (Through-Silicon Via) — a microscopic vertical wire drilled through a silicon chip to connect stacked chips electrically. HBM stacks rely on thousands of TSVs.
Microbump — the solder ball connecting one stacked chip to another. Modern HBM uses microbumps at ~10-30 micron spacing.
CoWoS (Chip-on-Wafer-on-Substrate) — TSMC’s advanced packaging technology that bonds a GPU die plus its HBM stacks onto a single shared substrate. Used by NVIDIA H100, H200, GB200.
Monolithic die — a single chip made on a single piece of silicon, as opposed to several chiplets bonded together. Mostly the old model.
Chiplet — a smaller chip designed to be combined with other chiplets in a package. AMD MI300 and Intel Ponte Vecchio use chiplets.
SDC (Silent Data Corruption) — a hardware fault where the chip produces wrong results without flagging an error. A failure that lies about itself.
MTBF / AFR — Mean Time Between Failures / Annualized Failure Rate. AFR is the share of chips that fail in a year. Used interchangeably in industry; AFR is more intuitive (“9% of GPUs failed in the year”).
DPPM (Defective Parts Per Million) — automotive-industry quality metric. “1 DPPM” means 1 defective part out of every million shipped.
Goodput — the fraction of a chip cluster actually delivering useful work, accounting for failures, retries, slowdowns. Hyperscalers report goodput when they don’t want to reveal a raw failure rate.

Detection / telemetry terms:

Telemetry — sensor data streamed off a chip in real time (temperature, voltage, error counts, etc.).
In-chip monitor / SLM (Silicon Lifecycle Management) — sensors physically built into the chip during manufacturing that measure local temperature, voltage, and aging. Vendors: Synopsys, Cadence, proteanTecs.
Aging monitor / odometer — a specific kind of in-chip monitor that tracks how much “wear” the chip has accumulated. The on-chip equivalent of a car’s odometer.
Ring oscillator differential — the standard on-chip aging-measurement technique. Run two identical oscillator circuits, one almost continuously and one rarely; the frequency difference tells you how aged the active one is.
DCGM / NVML / NVSM — NVIDIA’s family of GPU telemetry interfaces (Data Center GPU Manager, NVIDIA Management Library, System Management). The basic plumbing every higher-level tool sits on top of.
NVSentinel — NVIDIA’s open-source Kubernetes-integrated tool for automated GPU health monitoring and self-healing.
Mission Control / Fleet Intelligence — NVIDIA’s commercial / managed-service versions of the same telemetry layer for large fleets.
ECC (Error-Correcting Code) — memory that can detect and often correct bit errors. ECC counters are one of the earliest signs of HBM stack degradation.
BIST / SLT / burn-in — test methods to catch defective chips before they ship. BIST = Built-In Self-Test (the chip tests itself). SLT = System-Level Test (final functional check). Burn-in = stress-test at high temp and voltage to weed out infant-mortality defects.
OCP RAS — Open Compute Project’s specification for chip Reliability/Availability/Serviceability. The cross-vendor standard hyperscalers and chip designers wrote together (v1.7, October 2025).

Operations / supply chain terms:

RMA (Return Merchandise Authorization) — the customer-side process to return a failed unit to the manufacturer for repair or replacement.
ARMA / Advance Replacement — NVIDIA ships a new unit before receiving the broken one; customer typically has 10 days to return the broken one.
ODM (Original Design Manufacturer) — companies like Quanta, Wistron that design and assemble servers / racks for NVIDIA-stack customers.
CM (Contract Manufacturer) — companies like Foxconn, Wistron that physically build hardware to a vendor’s specification. Often the same firms as ODMs.
FRU (Field Replaceable Unit) — the smallest piece that can be swapped in the field — a board, a memory module, etc. Chips themselves are not FRUs; the boards they’re on are.
ITAD (IT Asset Disposition) — the industry that takes old data-center hardware and resells, refurbishes, recycles, or destroys it. Iron Mountain is the public benchmark.
TPM (Third-Party Maintenance) — companies like Park Place and Service Express that maintain hardware after the OEM warranty expires.

Insurance / financing terms:

Warranty reserve / rollforward — money a manufacturer sets aside on its balance sheet to pay for future warranty claims. The 10-K shows the rollforward: starting reserve + new accruals − claims paid = ending reserve. NVIDIA’s FY26 reserve is ~$2.81B.
Reinsurance — insurance that insurance companies buy from each other to share large risks. Munich Re and Swiss Re are the dominant global reinsurers.
Parametric insurance — a policy that pays out automatically when a measurable trigger is hit (e.g., “if the fleet failure rate exceeds X%, pay Y”) rather than after a claims-adjustment process. Faster than traditional insurance; needs trusted measurements.
ILS (Insurance-Linked Securities) — bond-like instruments that let capital-market investors take insurance risk directly. Used heavily for catastrophe risk.
MGA (Managing General Agent) — a firm that underwrites insurance on behalf of a reinsurer/carrier, often using its own data. Coalition (cyber) is the canonical tech MGA.
Four-pillar test (Preston Wilson) — for any parametric product to work: (1) a trusted third-party measurement agent, (2) an agreed metric, (3) an actuarial model tied to historical loss data, (4) enabling sensor infrastructure. Currently no chip-failure product passes all four.

§1 — Wear-out is back, and advanced packaging is the new dominant fault surface

The plain-English version of this section: chips fail for two big reasons. First, the transistors physically wear out as they’re used (the “wear-out clock” — heat and current literally damage the materials). Second, modern AI accelerators are no longer one chip; they’re a stack of chips bonded together (the GPU die, the memory dies, the substrate, the interposer), and the joints between those materials crack under heat-cycling stress. Both problems get worse as chips get smaller and run hotter, which is exactly the direction NVIDIA, AMD, and the AI industry are headed. There’s a third, sneakier problem — silent data corruption — covered below.

Reliability is a probability game, not a point estimate

The right frame for everything that follows in this section came directly from Ben Kaczer, the reliability physicist who runs IMEC’s wear-out research group:

“It’s all a probability game. The spec is not ‘this chip works for 10 years.’ The spec is, for example, 99 out of 10,000 will work for 10 years.” [Interview: Ben Kaczer, IMEC AR2T, 2026-06-XX]

Three consequences follow, each one a reframe for how to think about chip-failure risk.

Variability has two sources. Every chip is slightly different the moment it’s made (atoms don’t land in exactly the same place; this is called time-zero variability and shows up as a yield problem in manufacturing). On top of that, every chip ages slightly differently in use (electrical charges get trapped in the chip’s insulating layers; this is time-dependent variability and shows up as a reliability problem in service). Modeling reliability means modeling both. [Interview: Ben Kaczer]
There is no single failure point — only a distribution of failure times. Different microscopic bonds in the chip’s insulator hold different amounts of energy, so some break early and some break late. What designers actually engineer against is a curve of failure probability over time, not “the year this chip dies.” [Interview: Ben Kaczer]
Accelerated testing is a projection, not a measurement. No one can wait 10 years for a 10-year reliability test, so reliability engineers crank up the voltage and temperature, measure a few hot points, then mathematically project how the chip would behave at normal conditions. Which math you use for that projection (Arrhenius models, power laws, exponentials, etc.) is genuinely contested — Ben’s line was that it’s “literally why the IRPS conferences exist.” IRPS is the annual International Reliability Physics Symposium where these projection-model debates happen. The implication for a financial product: the failure-rate “number” any insurer would underwrite against is a model output, not measured reality — so the choice of projection model is itself a risk to be priced. [Interview: Ben Kaczer]

The natural shape for a parametric insurance product on chip failure is therefore the distribution Ben described — coverage that pays when actual failures exceed the modeled curve — not a single MTBF (mean-time-between-failures) number. This is why §5 ends where it does.

Five physical mechanisms run the wear-out clock

Modern chips age through five well-understood degradation pathways. Each is sensitive to voltage, temperature, electrical current, and how often the chip is switching states — the four knobs AI accelerators push to the limit. The first four are damage at the transistor level; the last is damage at the packaging level. AI workloads stress all five simultaneously.

Mechanism	What it does	Why GPUs are exposed
Electromigration (EM)	Electric current flowing through a wire physically pushes metal atoms along, forming gaps until the wire breaks.	The vertical wires connecting stacked memory chips (TSVs) are now thinner and carry more current; the bottom of each wire is the hot spot. [Public: Cadence Resources, “What Influences TSV Reliability”]
NBTI / PBTI (Bias Temperature Instability)	Charges get trapped inside the transistor’s insulating layer while it’s on, shifting its switching behavior over time. NBTI affects one type of transistor; PBTI the other.	”As feature sizes decrease, the wearout effects of HCI and NBTI become more prevalent… voltage is not scaling as much as physical dimension.” Translation: transistors got smaller faster than the voltage running through them got smaller, so each transistor sees more electrical stress per unit area than ten years ago. [Public: SemiEngineering, “Aging Problems at 5nm and Below”]
TDDB (Time-Dependent Dielectric Breakdown)	The thin insulating layer separating parts of the transistor eventually breaks down under sustained voltage; once it does, the transistor shorts out.	The insulating layer is now about 1 nanometer thick — roughly the size of three atoms — leaving very little margin. [Public: SemiEngineering]
Hot Carrier Injection (HCI)	High-energy electrons collide with the insulator while the transistor is switching, gradually damaging it.	Worse when the chip switches fast and runs hot — the AI training workload profile.
Thermal cycling fatigue	When the chip heats up and cools down repeatedly, materials expand and contract at different rates; the joints between them crack.	An NVIDIA H100 package bonds silicon (which barely expands with heat) to copper (which expands a lot) to a polymer interposer to a laminate substrate. Four materials, four different expansion rates — exactly the conditions that fatigue the joints. [Public: Cadence Resources]

[Public: arxiv 2503.21165 — Extending Silicon Lifetime; NASA NEPP, Microelectronics Reliability; JEDEC JEP122 / JEP148 standards.]

Foundries publish a conservative envelope. The real safe operating area is larger — and uncharacterized.

Every chip-fabrication foundry (TSMC, Samsung, Intel) publishes what it calls the “guaranteed operating area” — a conservative box of voltage and current conditions inside which the foundry promises the transistor will last for the stated lifetime. The actual safe operating area — the conditions where the chip will still work, just outside the foundry’s official guarantee — is larger but mostly uncharacterized. IMEC’s job is to characterize that larger envelope for chip designers, particularly for emerging memory types (MRAM, RRAM) that need to operate at voltages higher than the standard guarantee covers. [Interview: Ben Kaczer]

Why this matters for the financial wedge: any insurance product that prices coverage on real-time chip telemetry (rather than just the nameplate spec) has to know where the real failure boundary is, not just where the foundry’s marketing language stops. That gap is where IMEC’s characterization work lives.

Workload-conditional reliability is genuinely new

Until very recently, every reliability spec was written against one nominal workload: 100% utilization at 125°C for 10 years, with no adjustment for how the chip is actually used. Margins are now so thin that designers are reconsidering. Ben put it plainly: “Margins are so small that people are looking for workarounds. Maybe I operate my transistor only 10% of the time — I should take this into account. So workloads are actually coming into it.” [Interview: Ben Kaczer]

The shift opens a new financial-product surface. A workload-conditional warranty would price coverage on actual usage data — utilization rate, temperature, voltage history pulled from the chip — rather than on nameplate spec. Conceptually it’s much closer to Progressive’s “how you drive determines your auto rate” model than to traditional weather-event insurance. The data layer already exists (the NVIDIA telemetry stack in §3); the financial product to sit on top of it does not. This is the cleanest mechanical bridge between the physics in §1 and the financing in §5.

Advanced packaging multiplies the fault surface

The biggest reliability story of the AI cycle isn’t the GPU die itself — it’s the package it sits in. Until a few years ago, most chips were “monolithic,” meaning one chip on one piece of silicon. Modern AI accelerators are different: a GPU die plus 6–8 stacks of memory chips, all bonded together onto a shared substrate in a process called “advanced packaging.” The dominant flavor TSMC uses for NVIDIA is CoWoS (Chip-on-Wafer-on-Substrate); Intel calls its version Foveros, AMD uses EMIB. The reliability problem is that the more pieces you bond together, the more joints you create — and joints fail.

HBM stacks are mechanically fragile by design. Each high-bandwidth memory (HBM3e) stack contains 8 to 12 individual memory dies stacked on top of each other, connected through thousands of microscopic vertical wires (TSVs) and tiny solder balls (microbumps) spaced 20–30 millionths of a meter apart. HBM4 will tighten that spacing further to 10 microns. [Public: Siemens semiconductor packaging blog, 2026-04] A single bad connection in any one of those tens of thousands of microbumps can take down the entire packaged device. [Public: Cadence Resources]
The vertical wires (TSVs) fail two ways. They crack from heat-cycling stress (silicon and copper expand at very different rates), and they break from electromigration (the current eats them, see §1’s wear-out table). [Public: Cadence Resources]
Thermal density is a multiplier on every wear-out mechanism above. An NVIDIA GB200 Grace Blackwell superchip dissipates up to 1,200 watts. A full NVL72 rack runs over 120,000 watts. The surface heat density exceeds 500 watts per square centimeter — beyond what any air-cooled heatsink can physically handle, which is why liquid cooling is now mandatory. [Public: ToneCooling GB200 NVL72 analysis; [Network World / Introl, 2026]] Heat accelerates every degradation mechanism in §1.
The whole package can physically warp. Quoting Cadence: “When a GPU chip (silicon), an LSI bridge chip (silicon), an organic interposer (polymer), and a substrate (laminate) are bonded together and the system operates at 1400W, the mismatched coefficients of thermal expansion can cause warping, cracking, and connection failures.” In other words: the rates at which the four materials expand under heat are mismatched enough that the whole sandwich bends. [Public: Cadence Resources]
Non-repairability is structural — and important for the financial wedge. Once CoWoS bonds the HBM stacks to the GPU die, the assembly cannot be taken apart in the field; if one HBM stack fails, the whole multi-chip module typically gets scrapped. This is the physical reason behind Alex Zhu’s number: NVIDIA can only recover ~60 of every 100 returned units. The other 40 have die-level damage and can’t be saved. [Interview: Alex Zhu, 2026-05-27]

Silent data corruption (SDC) is a different problem, and worse than wear-out at scale

Most of §1 above is about chips that die. SDC is the opposite — a chip that lies. It computes a wrong answer and doesn’t raise any error flag; the software downstream just sees “two plus two equals five” and proceeds as if nothing happened. SDC has only recently been recognized as a hyperscale-fleet problem, partly because it’s so hard to detect.

The headline rate. Meta’s landmark 2021 paper (Dixit et al.) documented SDC at scale: roughly 1 in every 1,000 silicon devices has a silicon-defect-driven SDC. That rate is orders of magnitude higher than the textbook “cosmic ray hits a bit and flips it” failure mode, which had been the dominant academic concern. [Public: Dixit et al. 2021 (Facebook); SIGARCH summary]
Google’s “mercurial cores.” Google’s 2021 paper Cores That Don’t Count found that SDC issues “afflict specific individual CPU cores rather than entire chips or a family of parts.” In plain English: one CPU has many cores; the defect is usually in one specific core, not the whole chip — and the same core might compute correctly 99.99% of the time and corrupt one calculation out of millions. [Public: Hochschild et al., HotOS 2021]
Root causes are silicon, not particles. Quoting the SIGARCH summary: “The root cause of such SDCs are silicon chips which are born defective (escaped manufacturing testing), become defective (aging), or just differ from each other (timing variability).” So SDC reads as a mix of all the §1 wear-out mechanisms — chips slightly damaged from birth, or slightly worn over time, producing intermittent computation errors. [Public: SIGARCH SDC summary]
AI accelerators inherit and amplify the problem. From the OCP industry whitepaper (authored by an NVIDIA engineer): “With increased silicon density in accelerators, silent data corruptions now occur at about one fault per thousand devices, much higher than cosmic-ray-induced soft errors.” In other words, the same 1-in-1,000 rate Meta saw in CPUs applies to AI GPUs — and probably worse, because GPUs are denser. [Public: OCP SDC-in-AI Whitepaper v1.1]
Why this matters operationally. During AI training, a corrupted calculation produces a wrong gradient, which can spike or stall the entire training job — sometimes after days of work. During inference, it silently biases the model’s output, which the user has no way to detect. The financial cost shows up not as “chip died and got returned” but as “training job had to restart” or “model produced garbage.” [Public: arxiv 2605.04213 — “The Anatomy of Silent Data Corruption: GPU Error Pattern Study”]

Convergence — what our internal interviews said maps cleanly onto the physics

Alex Zhu (NVIDIA reverse logistics): “We can repair only ~60 of every 100 returned units.” [Interview: Alex Zhu, 2026-05-27] This is exactly what we’d expect from the CoWoS-non-repairability point above — the 40 unrepairable units are the ones with damage at the bonded-package level.
Andrzej Strojwas (PDF Solutions, the fab-floor data company): PDF’s Exensio monitoring system is deployed in “every TSMC fab” and PDF’s Symmetrics equipment-connectivity platform reaches 300+ equipment vendors. [Interview: Strojwas, 2026-05-22] PDF is the manufacturing-stage detection layer — they see defects at fab time, not at field time.
Vivian (semiconductor industry advisor): Each layer of the AI accelerator package — substrate, cooling, power-delivery, passives, testing — is a separate company. [Interview: Vivian, 2026-04-29] The industry’s company-level fragmentation maps onto the physical fault interfaces between layers — every joint between two companies’ products is also a joint that can fail.

For deeper material on packaging-driven reverse-flow mechanics, see reverse-supply-chain-research-2026-05-13 §5.

§2 — Failure rates by chip type × environment: the matrix

The plain-English version of this section: how often do chips actually fail in the wild? The honest answer is “it depends enormously on the chip and the workload.” AI training GPUs in a hyperscaler fail at almost 10% per year. Server CPUs in a regular office fail at roughly zero percent per year. Storage drives fail at about 1.4% per year. The matrix below pulls together what’s been publicly disclosed — and just as importantly, flags what hasn’t been (especially for custom AI chips like Google’s TPU and Amazon’s Trainium, which are nearly information voids).

A key term used throughout: AFR (Annualized Failure Rate) = the percentage of units that fail in a year. So “~9% AFR” means roughly nine out of every hundred units fail per year.

The matrix (annualized failure rate; sources mixed; comparability caveats below)

Chip type	Environment	Annualized failure rate	Source / caveat
Data-center GPU (NVIDIA H100)	Hyperscaler training, 700W TDP	~9%	Meta Llama-3 paper, 2024: 466 interruptions / 54 days × 16,384 H100s; ~78% hardware-related; ~1 failure/3 hrs at the cluster. [Public: Tom’s Hardware Llama-3 coverage] GPU faults 30.1% of interruptions; HBM3 17.2%.
Data-center GPU (Blackwell GB200)	DC training, 1200W/superchip	Unknown publicly	Higher TDP, denser packaging → directionally higher; no field data yet. [Synthesis]
Data-center CPU (Intel Xeon W-2500/W-3500)	Workstation/server	~0% (no recorded failures)	Puget Systems 2025 Annual Reliability Report. [Public: Puget Systems Most Reliable Hardware 2025]
Custom DC ASIC (TPU, Trainium, Maia, MI300)	Hyperscaler	Disclosed only as goodput; not failure rate	Google: “>97% goodput at 10,000-chip scale” [Public: Futurum / Google TPU disclosure 2025]. No per-ASIC failure-rate disclosure for Trainium/Maia/MI300. Known gap.
Server DRAM (DDR4/DDR5 RDIMM)	Server	0.20-0.27%	Puget Systems 2025: Kingston RDIMM 0.20% (also reported as 0.19%); Micron 0.27%. [Public: Puget Systems 2025]
HBM3 (in-package)	DC training	Subsumed in GPU failure rate; ~17% of Llama-3 interruptions = ~0.44% absolute	Meta Llama-3. [Public: Tom’s Hardware Llama-3]
Edge AI accelerator (Jetson Orin Industrial)	-40°C to +85°C industrial, 5G sustained vibration	No field failure-rate publication	NVIDIA publishes the operating envelope, not field data. [Public: Syslogic Jetson Orin NX product page] Standard industrial reliability targets imply <1% AFR but are not field-validated. Known gap.
Edge AI accelerator (Hailo-8, Mythic)	-40°C to +85°C, embedded	No public field data	Hailo claims “hundreds of customer products already in deployment” and “>100K users” of its SDK but does not publish failure statistics. [Public: Hailo product page] Known gap.
Automotive IC (AEC-Q100 qualified)	-40°C to +150°C, 10-15 year life	Industry target: zero defects (<1 DPPM aspirational; in practice 1-10 DPPM)	AEC-Q100/Q200/Q004 framework; “the acceptable failure rate is zero.” [Public: SemiEngineering, “Drive Toward Zero Defects”; Power Electronic Tips, AEC-Q FAQ]
Industrial PLC silicon	24/7 industrial environment	Generally similar to automotive grade	No clean public field data we could find. Known gap.
Power semiconductor (SiC/GaN MOSFET)	High-power industrial / EV	Module-level; thermal cycling-dominated	Not a chip-failure-as-data-product story; module-level service.
Storage (HDD, fleet)	Hyperscaler cold storage	1.36% AFR (2025 fleet aggregate)	Backblaze 344,196-drive 2025 report; Q4 2025 AFR 1.13%; lifetime AFR 1.30%. [Public: Backblaze Drive Stats 2025] Drives are the closest public-data analog to “field reliability as commodity.”

Comparability caveats (read before quoting any number above)

These caveats matter — the numbers in the table look comparable but aren’t, quite.

“Failure” is defined differently by each source. Meta’s number counts any training-job interruption attributable to hardware. Puget Systems’ number counts only units that came back as defective during warranty (RMAs). Backblaze counts drives pulled from service for any reason. So Meta’s ~9% is broader than a strict “the chip is dead” definition and overcounts compared to Puget’s.
Workload dominates the number. Meta ran its H100s flat-out at training intensity for 54 days. The same H100s used for inference (running already-trained models) would show much lower failure rates, because inference is a far gentler workload. None of the published numbers correct for what the chip was actually doing.
Sample bias. Puget builds, configures, and supports its own systems carefully — well-sized power supplies, conservative cooling, proper burn-in. Hyperscaler fleets are run much closer to the edge. So Puget’s near-zero number doesn’t say much about how a CPU would fare in a less-supervised environment.
HBM is the second-biggest AI-GPU failure mode after the die itself. Meta’s data attributes 17.2% of interruptions to HBM. The absolute number is small because the GPU die accounts for an even larger share.
Edge and automotive numbers are qualification targets, not field measurements. The AEC-Q100 framework is a stress-test certification, and “<1 DPPM” (one defective part per million shipped) is what auto suppliers promise their OEM customers — it’s a delivery quality target, not a how-many-fail-per-year-in-service number. The “10-15 year life” claim is the spec the chip is designed against, not data on whether real chips actually last that long.

How the environment changes the rate

Heat. Most of the §1 wear-out mechanisms roughly double for every 10-15°C of additional temperature (this is the Arrhenius rule from chemistry — chemical reactions, including the ones damaging chips, speed up with heat). Blackwell racks run at inlet temperatures pushing the 45°C ceiling; every degree closer to that ceiling makes the wear-out clock tick faster. [Public: ToneCooling GB200 cooling spec]
Utilization. A Google data-center architect estimated GPU service life at 1-3 years when running at 60-70% utilization. Hyperscaler training workloads push much higher than 70%. reverse-supply-chain-research-2026-05-13 §5 has the deeper dive.
Power density. The GB200’s heat output per square centimeter (>500 watts) is right at the edge of what even direct-to-chip liquid cooling can carry away. If the cooling system briefly under-performs, thermal runaway becomes a real possibility. [Public: ToneCooling]
Altitude. Cosmic rays cause bit flips, and the rate is ~2-3x higher at 2km elevation than at sea level. But the Dixit / Hochschild research showed that silicon-defect SDC (chips with manufacturing imperfections) dominates hyperscale failures — particles are a small piece of the overall failure budget. [Public: Dixit et al. 2021]

What the chip vendors disclose, and what they don’t

Public failure-rate disclosure is asymmetric by industry layer.

Hyperscalers publish. Meta has now published three consecutive papers (Dixit 2021; Llama-3 2024; “How Meta Keeps Its AI Hardware Reliable” 2025-07) [Public: Meta engineering blog 2025-07-22]. Google published “Cores that don’t count” (2021). Microsoft has co-authored OCP standards.
Chip vendors don’t. NVIDIA, AMD, Intel, Broadcom, Marvell, Qualcomm, TI, ADI publish neither field failure rates nor in-service MTTF for AI silicon. The only public chip-vendor signal is via warranty rollforward (§5).
ASIC vendors disclose nothing. Google publishes TPU goodput at scale, not failure rates. AWS publishes no Trainium reliability data. Microsoft publishes no Maia reliability data. The custom-ASIC layer is the deepest disclosure void in the matrix.

Divergence — Meta’s 9% does not match Puget’s near-zero CPU number

Meta’s ~9% annualized AI-GPU failure rate sits four orders of magnitude above Puget’s 0% Intel Xeon W-series. The two are not contradictory — they are different workloads, different parts, different operating envelopes. But the gap is the central finding of the brief: the AI-accelerator reliability problem is not the “semiconductor” reliability problem. Server CPUs work; AI GPUs fail; the difference is advanced packaging, thermal density, and the fact that the AI-GPU is run at near-100% utilization for months while the Xeon idles.

Convergence with internal sources: this is exactly the read reverse-supply-chain-research-2026-05-13 §5 produced from the Lonny conversation — AI-accelerator failure is a different category from “chips fail at the normal rate.”

§3 — Detection: telemetry, BIST, ECC, fleet observability

The plain-English version of this section: how does anyone know a chip is failing? There are three layers of detection. Layer 1 is sensors physically embedded inside the chip, measuring its own temperature, voltage, and aging. Layer 2 is the software interface NVIDIA (or AMD) exposes — error codes, error-correcting memory counts, temperature readings — accessible to anyone running the GPU. Layer 3 is the fleet-management software that watches thousands of GPUs at once, spots patterns, and either alerts an operator or self-heals.

There’s also a fourth layer that catches defective chips before they ship: factory test (burn-in, self-test). Together, these four layers are what any insurance product would need to draw on for trustworthy data.

Three layers stack on top of every GPU

Layer	What it sees	Frequency	Source
In-chip / on-die monitors (SLM)	Temperature, voltage droop, aging shifts, per-core timing margin, per-lane PHY margin	Continuous	Synopsys / proteanTecs / Cadence on-chip IP; Synopsys SLM PVT Monitor IP is AEC-Q100 Grade 2 and ASIL-B Ready [Public: Synopsys SLM article]; proteanTecs Proteus monitors HBM PHY in deployed GUC IP [Public: proteanTecs HBM whitepaper]; Ben Kaczer (IMEC AR2T) confirms NVIDIA-class chips “are aware of their temperature, voltage, frequency, and I believe they have aging monitors” [Interview: Ben Kaczer, 2026-06-XX]
Driver/SDK telemetry	XID error codes, ECC counts, power/thermal, NVLink/PCIe link status	Sub-second	NVIDIA DCGM, NVML, NVSM; surfaced through Kubernetes via NVSentinel [Public: Rafay NVSentinel writeup; NVSentinel GitHub]
Fleet observability	Cross-node correlation, longitudinal degradation, capacity availability, in-band + out-of-band telemetry	Continuous, fleet-wide	NVIDIA Mission Control, NVIDIA Fleet Intelligence [Public: NVIDIA developer blog on Fleet Intelligence]; hyperscaler internal stacks (Meta Fleetscanner/Ripple/Hardware Sentinel; CoreWeave Node Lifecycle)

On-chip aging monitors and the tamper-aware odometer

The standard method for on-chip aging detection is the ring-oscillator differential: one oscillator runs 99.99% of the time and ages with the workload; a second runs as a reference; the frequency delta between them reads as cumulative degradation. The technique is mature, the IP is commercial, and Ben Kaczer named the canonical vendor unprompted multiple times across the IMEC visit: “This goes in the direction of this proteanTecs. They’d be interesting for you — look at them.” [Interview: Ben Kaczer, IMEC AR2T, 2026-06-XX]

The brief’s earlier proteanTecs entries (the HBM analytics whitepaper, the Proteus PHY monitor in 3nm 8.8GT/s GUC silicon) had only marketing-side validation. Ben provides third-party validation from a working reliability physicist: the substrate for a chip-level digital twin exists in deployed product silicon today.

IMEC’s own contribution to this layer is a tamper-aware odometer — an on-chip aging monitor designed to detect whether someone has annealed, reset, or otherwise gamed the monitor to resell a worn chip as new. The canonical case Ben cited: a US Navy P-8 maritime patrol aircraft that turned out to be running counterfeit refurbished chips labeled as new. Counterfeit-relabeling is a real grey-market segment; a tamper-evident odometer is the substrate for any product that needs to underwrite chip provenance — including transferable warranty for the secondary market discussed in §4. [Interview: Ben Kaczer]

This matters for §5 / §6: the parametric four-pillar test requires a trusted third-party measurement agent. A manufacturer-controlled odometer can be reset. A tamper-resistant third-party-attestable aging signal is the foundational data layer any chip-failure insurance product would need.

NVIDIA’s own fleet observability stack (2025-2026)

NVIDIA has its own multi-layer software for monitoring GPUs in production. From bottom to top:

DCGM / NVML / NVSM are the basic plumbing — APIs and command-line tools the driver exposes for reading temperature, error counts, link status, etc. Every higher-level monitoring tool sits on these.
NVSentinel is NVIDIA’s open-source health-monitoring + self-healing tool built for Kubernetes clusters (the standard way large GPU fleets are orchestrated). NVIDIA’s own description: “born out of NVIDIA’s own operational experience managing some of the world’s largest GPU clusters.” It reads from DCGM, decides which GPUs are unhealthy, and triggers automatic recovery actions. [Public: Rafay / NVSentinel blog]
NVIDIA Mission Control is the commercial product (generally available) for managing GB200 NVL72 rack-scale systems. It’s part of NVIDIA’s AI Enterprise software subscription — i.e., NVIDIA charges for it. [Public: NVIDIA Mission Control 1.2.1 GA release notes]
NVIDIA Fleet Intelligence is the newest layer: a managed cloud service that watches large GPU fleets remotely, including subtle power-consumption patterns that can predict failure. The customer opts in to letting NVIDIA pull this data from their hardware. [Public: NVIDIA developer blog; Tom’s Hardware on Fleet Intelligence]

The strategic shape: NVIDIA is steadily climbing from “plumbing” (free, basic) to “managed cloud service” (paid, comprehensive) on the monitoring layer. The data NVIDIA could use to underwrite warranty risk is also the data they could sell back to operators as a managed service — which is structurally relevant to who would partner with whom on a chip-failure insurance product.

Hyperscaler in-fleet detection (the public state of the art)

The most important publicly available picture is Meta’s “How Meta Keeps Its AI Hardware Reliable” (July 2025). Meta runs three different detection tools in parallel, each catching different things:

Fleetscanner runs targeted diagnostic tests on GPUs during scheduled maintenance windows — a thorough but slow sweep that covers the whole fleet every 45-60 days. [Public: Meta engineering blog 2025-07]
Ripple is the opposite: lightweight tests that run alongside normal production workloads, executing in milliseconds. Meta gets fleet-wide coverage in days, not weeks. [Public: Meta engineering blog 2025-07]
Hardware Sentinel is the most novel tool. Instead of running explicit diagnostic tests, it watches normal software failure indicators (segmentation faults, crashes, log patterns) and infers which GPU is silently producing wrong results. The published ASPLOS 2025 paper showed Hardware Sentinel detects silent data corruption 41% better than running dedicated test workloads. This is significant because it doesn’t take any compute capacity away from production. [Public: Meta blog; ACM ASPLOS 2025 Hardware Sentinel paper]
Layered recovery. Once a fault is detected, Meta uses reductive triage (binary-search to find the failing component) and hyper-checkpointing (frequently saving training state so failures lose minimal work). Higher up the stack, gradient clipping and other math tricks protect training stability. The reliability story is multi-layered — no single tool does the job. [Public: Meta blog]

CoreWeave (the largest “neocloud” — a GPU-cloud startup competing with the hyperscalers) has published its own equivalent process:

Day 1 — “Zap.” First, update the firmware on every component (GPU, network card, server-management controller, BIOS). 1-2 hours per server.
24-hour test. Extensive stress tests including GPU burn-in, multi-GPU communication tests (NCCL), and real ML training workloads — designed to flush out any anomaly.
Day 2+. Continuous passive monitoring for NVIDIA error codes (XID errors) and thermal spikes; active health checks during reboots; CoreWeave’s Fleet Lifecycle Controller tracks long-term degradation patterns and “replaces unhealthy nodes before they impact accuracy or throughput.” [Public: CoreWeave Node Lifecycle docs; CoreWeave NLM blog]

Microsoft Azure’s published approach focuses on cross-vendor standardization — making the lifecycle process work the same way whether the server is from Dell, HPE, or Supermicro. They claim “95% Nodes-in-Service on large fleet sizes” through this standardization. [Public: Microsoft Tech Community on Azure fleet operations]

OCP RAS Requirements v1.7 — the standard that makes this an industry conversation

The Open Compute Project (OCP) is the hyperscaler-led industry consortium where Meta, Google, Microsoft, and now Amazon agree on shared data-center hardware specs so they’re not each negotiating with vendors independently. “RAS” stands for Reliability, Availability, Serviceability — the standard industry shorthand for “everything to do with hardware not breaking.”

On October 23, 2025, OCP published GPU & Accelerator RAS Requirements v1.7 — and the noteworthy thing is the author list: Microsoft, Meta, Google, AMD, NVIDIA, Arm, and Intel jointly wrote it. Companies that normally guard reliability data as competitive intelligence are now writing it down together. [Public: OCP RAS Requirements v1.7]

What’s in it:

Standardized definitions of what GPU “reliability” means — including locked-in definitions of metrics like MTTF (Mean Time To Failure) and AFR (Annualized Failure Rate) that previously varied vendor by vendor.
Specifications for telemetry signals at every level: silicon-internal sensors, PCIe error reporting, memory error correction, system-level fault handling.
Sister documents: Hyperscale CPU RAS Requirements v0.7 (September 2025) covers CPUs; the SDC in AI Whitepaper v1.1 (authored by an NVIDIA engineer) covers silent data corruption specifically. [Public: OCP CPU RAS; OCP SDC-in-AI]

The structural read: hyperscalers and chip vendors converged on a shared spec because hyperscalers stopped accepting whatever vendors offered by default. This is a procurement-power signal as much as a technical document — and it’s exactly the substrate any third-party insurance product would build on (a standardized, cross-vendor reliability metric is requirement #2 of Preston’s four-pillar test from §5).

Manufacturing-stage catch (burn-in, factory test)

Before a chip ever reaches a customer, three layers of factory testing try to catch the bad ones:

Burn-in. Run the chip at elevated temperature and voltage for hours to days to force “infant mortality” defects — the chips that would have failed in their first weeks of service — to fail in the factory instead. For complex multi-chip modules like an H100, 3-8% fail burn-in. [Public: Jason Hoffman GPU failure analysis, March 2026]
System-Level Test (SLT). A final functional check on near-finished assemblies. Teradyne and Advantest are the dominant test-equipment vendors. 1-3% fail SLT.
Built-In Self-Test (BIST). The chip tests itself using circuits designed in for that purpose. There are formal IEEE standards (1149.1, 1838 for stacked chips, P3405) that specify how this should work.

The fab-floor reality is anchored by Andrzej Strojwas (PDF Solutions, the industry-leading fab analytics company): PDF’s Exensio monitoring system runs in “every TSMC fab, whether it’s legacy or whether this is the newest, greatest being built.” PDF also makes characterization vehicles — special test chips placed on the same wafer as production chips, generating “thousands to tens of thousands of high-density data points per wafer” on early-life reliability. PDF therefore owns the closest thing to “ground truth” on which chips will be problems before they ever ship. But Strojwas was explicit: “A single leakage would probably mean the end of PDF.” The data exists but is locked behind customer confidentiality, not shareable for any third-party underwriting product. [Interview: Andrzej Strojwas, 2026-05-22]

Convergence — internal and external read the same

There’s a striking gap between what the published telemetry stack could do and what NVIDIA’s own RMA-side operations actually use:

Lonny Orona (NVIDIA reverse logistics): NVIDIA’s detection-to-action pipeline today is a stitched-together collection of unconnected systems — Salesforce for customer tickets, SAP for material planning, Baxter Planning for demand forecasting, Expeditors for shipping. Mission Control and NVSentinel exist as NVIDIA products, but they’re not what Lonny’s frontline support team actually uses as the primary failure signal. [Interview: Lonny Orona, 2026-05-12]
Alex Zhu (NVIDIA): “Today we don’t even have a signal in the system… CMs [contract manufacturers] are waiting on it because we don’t even have a signal in the system. That’s just ridiculous.” [Interview: Alex Zhu, 2026-05-27]
Minseok Kim (semiconductor industry): Even where data exists, it’s the wrong shape. Compliance reporting produces binary “pass/fail” flags. A real reliability twin would need per-lane, per-bump, per-cycle telemetry that no party currently aggregates across vendor lines. [Interview: Minseok Kim, 2026-05-05]

The asymmetry to file away: hyperscalers’ customer-side detection (Meta’s Hardware Sentinel beating tests by 41%) is materially ahead of NVIDIA’s own OEM-side workflow (still spreadsheets at the contract manufacturer). This is the central operational finding of §4.

Cross-link: reverse-supply-chain-research-2026-05-13 §3 walks through the full RMA-initiation flow that sits downstream of the detection signal.

§4 — Operations: what happens when a chip fails

The plain-English version of this section: when a GPU fails inside a data center, a long chain of activity gets triggered. The customer (Meta, CoreWeave, etc.) detects the failure through their own monitoring, opens a return-merchandise authorization (RMA) with NVIDIA, and NVIDIA ships out a replacement — usually before the broken one comes back, to minimize downtime. The broken unit goes to one of NVIDIA’s contract manufacturers in Asia (Foxconn, Wistron, Quanta), which strips it down, diagnoses what’s wrong, and either fixes the board around the silicon or — for ~40% of returns — scraps it entirely because the chip itself is dead. The customer pays nothing for any of this. NVIDIA absorbs the cost on its warranty reserve (see §5).

A separate but related question is what happens to chips that don’t outright fail but become uneconomic — i.e., a year-old H100 isn’t broken, but a hyperscaler upgrades to Blackwell anyway. Those go into a growing secondary market for tier-2 and tier-3 data centers — companies that are happy to run older GPUs at lower per-hour rates.

Vocabulary: “repair” mostly doesn’t exist at the chip level

The Lonny + Alex Zhu workflow below uses “repair” because that is NVIDIA’s internal language. Ben Kaczer (IMEC AR2T) pushed back hard during the IMEC visit: “I don’t understand what it means to repair a chip. Once a chip fails, it’s finished. The H100 cannot be repaired once it shows signs of failure.” [Interview: Ben Kaczer, 2026-06-XX]

The reconciliation: when NVIDIA’s CMs (Wistron, Foxconn) “repair” a returned unit, what gets fixed is the board / PCB / cable assembly / power-delivery module around the silicon, plus connector and substrate-level rework. The chip itself — the H100 die or HBM stack — is either functional and reused, or non-functional and scrapped. There is no field-level chip-die “repair.” Cross-source: this is also what Alex Zhu’s “60 of every 100 returned units repairable” number means — the 40 with chip-level damage become full module scrap, not repair candidates.

This brief therefore uses “advanced replacement + second-life grading” in financial-product context (where the workflow being priced is replacement of the device, not repair of the silicon) and preserves “repair” only when describing NVIDIA’s own internal RMA vocabulary.

The OEM-side flow (re-derived from internal anchors)

The most operationally specific account in the vault is the union of Lonny Orona and Alex Zhu. Stepwise:

Detection / RMA initiation. The customer opens a Return Merchandise Authorization (RMA) request through NVIDIA’s Salesforce portal. Per Alex Zhu, “major pain point for enterprise customers — they’re filling out a form by hand.” NVIDIA is building an automated RMA system (QR-code scanning + electronic data interchange) to remove the manual step. [Interview: Alex Zhu]
Triage. Lonny’s frontline team validates the serial number, checks the customer’s warranty status, and confirms eligibility for advance replacement. Whether the customer has a standard or extended warranty determines how fast NVIDIA promises to act (SLA = Service Level Agreement). [Interview: Lonny; Public: NVIDIA Enterprise Support User Guide]
Advance replacement. NVIDIA ships a fresh replacement unit before receiving the broken one. The customer has 10 days to send the broken one back. NVIDIA pays shipping both ways. [Public: NVIDIA Enterprise Support docs]
The hyperscaler approval bottleneck. Inside a hyperscaler, the data-center operations team can’t unilaterally take a rack offline for a swap. The internal business unit running the workload on it (Instagram, Facebook training) has to sign off — and they often prefer to keep limping along with the failing chip rather than accept a service interruption. Result: the replacement chip NVIDIA already shipped sits in a box for weeks or months while the failing one keeps running. [Interview: Lonny]
Return logistics. The standard return route is through the ODM (Original Design Manufacturer) — Quanta or similar — that assembled the rack in the first place. But Lonny’s view is the ODM “isn’t really adding a lot of value” on the return leg. NVIDIA is piloting picking up directly from the hyperscaler, cutting the ODM out of the reverse flow. [Interview: Lonny]
Receive at CM warehouse, repair. Once back at the contract manufacturer (CM — Foxconn, Wistron, etc.), NVIDIA’s spare parts are held two ways: consignment for high-value parts (the chip and board — NVIDIA still owns them while they sit at the CM), and turnkey for low-value parts (cables, third-party items — the CM owns them). Each CM tracks repair-line work on what Alex calls “suspect sheets — 1990s-style spreadsheets.” NVIDIA is building software APIs to replace this. [Interview: Alex Zhu]
Repair vs. replace decision. Lonny: “We give them the playbook — here’s how you’re going to diagnose it, repair it, acceptance criteria.” The economics: NVIDIA can salvage only about 60 of every 100 returned units. The remaining 40 get replaced from new inventory — meaning NVIDIA effectively gives the customer a brand-new chip and absorbs the cost. CoWoS-bonded HBM stacks (see §1) are typically not salvageable; the whole multi-chip module gets scrapped. [Interview: Alex Zhu]
All repairs are free to the customer. Alex: “I believe all of these — NVIDIA, we fix them for free, so there’s no charge for it.” Roughly 90% of these repairs aren’t ad-hoc fixes but remanufacturing — applying engineering change orders (ECOs) or full product recalls, often on multiple units at once. [Interview: Alex Zhu]
Repaired units feed back into the spare pool. Baxter Planning (NVIDIA’s incumbent supply-chain planning vendor) updates the demand forecast. NVIDIA is paying SAP over $2M to automate the planning layer end-to-end. [Interview: Alex Zhu]

Cross-link: reverse-supply-chain-research-2026-05-13 §3 has the full step-by-step with timing estimates; this brief surfaces the operational rhythm. The TAM/SAM sizing for a unified platform sits in reverse-logistics-warranty-tam-2026-05-29 §5.

Customer-side workflow at the hyperscaler / neocloud

What happens before the customer ever opens an RMA with NVIDIA? Three published patterns:

Meta. Detection via the three-tool stack (Fleetscanner, Ripple, Hardware Sentinel from §3 above) → reductive triage (binary-search to find the failing component) → save the training-job state to disk frequently so failures don’t lose too much work → move the workload to a healthy GPU → physically swap the broken FRU (field-replaceable unit — typically the GPU board) during a scheduled maintenance window. [Public: Meta engineering blog 2025-07]
CoreWeave. Passive health checks → active diagnostics → the Node Lifecycle Controller cordons off the bad server (so no new workloads get scheduled on it) → the Fleet Lifecycle Controller decides whether to attempt repair, swap, or quarantine. [Public: CoreWeave NLM docs]
Microsoft Azure. Microsoft’s emphasis is standardization across vendors — getting Dell, HPE, and Supermicro servers to behave the same way through industry-standard management interfaces (Redfish, PLDM). They report 95% of nodes in service at any time on large fleets. [Public: Microsoft Azure infra blog]

The common shape: hyperscalers handle detection, triage, and decommissioning themselves before NVIDIA ever sees an RMA. NVIDIA’s RMA process kicks in only after the customer has already isolated the fault. The customer-side workflow is software-automated; the OEM-side workflow (per Lonny and Alex) is still spreadsheet-driven. That asymmetry — sophisticated detection at the customer, manual triage at NVIDIA’s contract manufacturers — is probably the most operationally important finding in the brief, and the most obvious target for the kind of orchestration layer NVIDIA’s Dallas repair-line buildout is trying to address.

Secondary market handoff

The IMEC visit gave the strongest single primary-source confirmation that data-center GPUs are the one chip case where circularity economics work. Cedric Rolin (Program Director, IMEC SSTS) was unambiguous: industry-wide chip recycling is ~0% (volumes too big, materials nanometer-scale, recyclers crush-and-burn for packaging gold). “The exception is data center GPUs. Concentrated, documented, high-value — $100k+ racks. Probably their better case for reduce / refurbish / repair.” [Interview: Cedric Rolin, IMEC SSTS, 2026-06-XX]

Cedric and Jeroen Van den Bosch independently described an emerging tier-2 / tier-3 data-center secondhand market: “There’s also a market for secondhand GPUs in tier-two and tier-three data centers. Before they trash all this actually-functional material, they need to test it and put it on the secondhand market.” [Interview: Cedric Rolin] Olivier Rousseaux (IMEC Ventures) cited an unnamed Swedish professor whose ~2024 VOC interview found Amazon and others rotating racks on a 3-year accounting depreciation schedule, with removed equipment increasingly being bought by tier-2 secondary-market operators. [Interview: Olivier Rousseaux]

Jeroen added an important caveat that bounds the secondary-market thesis: secondhand DC infrastructure is only a viable business model if technology cycles slow. “If your secondhand technology is two times as slow and consumes two times as much energy… per compute the euro you need to charge, you’re not competitive.” [Interview: Jeroen Van den Bosch] The secondary-market thesis therefore depends on either (a) AI-accelerator generations slowing, or (b) the tier-2/tier-3 customer’s compute-quality requirements being structurally tolerant of older silicon.

Lizzie (IMEC SSTS LCA researcher) added a structural friction worth surfacing: “They buy a certain memory but they don’t know exactly what’s in that chip. Sometimes they ask us: can you identify which chip is actually inside this memory I bought.” [Interview: Lizzie, IMEC SSTS] Even the buyers of finished memory products often don’t know what die is inside the package — which constrains any reliability assessment built on chip provenance. This converges with Minseok Kim’s point that compliance-style binary data is insufficient for a real reliability twin.

When a unit either ages out of warranty or just becomes uneconomical to keep running, it can go three places:

OEM-certified refurbished. Dell sells “Recertified” servers, HPE has “Renew,” and there’s a similar NVIDIA-channel program. Typical price: 30-40% below new, with a 1-2-year warranty included. Covered in detail in reverse-logistics-warranty-tam-2026-05-29 §6.
Specialist refurbishers / ITAD. ITAD = IT Asset Disposition — the industry that takes old hardware and resells, refurbishes, recycles, or destroys it. Iron Mountain’s ITAD-equivalent business unit (Asset Lifecycle Management) is the public benchmark: $153M revenue in Q2 2025 (+70% year-over-year), $232M in Q1 2026 (+92% YoY). Expected full-year 2025 revenue $575-600M, of which ~40% is data-center decommissioning specifically. [Public: Resource Recycling Iron Mountain Q2 2025; Iron Mountain Q3 2025 8-K]
Used-GPU brokers. Companies like ALTA Technologies; pricing tracked by HashrateIndex (originally a crypto-mining-economics site, now also covering AI GPUs). Reseller numbers say an H100 retains 75-85% of its new value through 24 months — surprisingly high. But Princeton’s Center for IT Policy (December 2025) argues the secondary market is too thin to absorb supply at scale, so the reseller numbers may be misleading. [Public: HashrateIndex used GPU pricing; Princeton CITP secondary market note] This contradiction is also flagged in reverse-logistics-warranty-tam-2026-05-29 §6.

A subtle but important problem: warranty doesn’t follow the chip when it’s resold. NVIDIA’s consumer-grade GPU warranty explicitly voids if the card ends up in a data center. NVIDIA’s enterprise (DGX) warranties are negotiated per contract and typically don’t transfer to a new owner. So a secondhand-buyer holds the reliability risk by themselves, with no recourse to NVIDIA. This is one structural reason resale prices drop even when the unit is functionally fine — and one possible market opening for a third-party warranty product (the “Carfax for GPUs” idea from the IMEC visit debrief, where an independent operator could underwrite second-life risk).

Edge / industrial reverse chain

The reverse chain for edge and automotive chips is structurally different from the data-center side for three reasons: (a) failure rates are designed to be near-zero (the automotive industry’s AEC-Q100 qualification standard treats any in-service failure as a serious quality event); (b) volumes per buyer are smaller; (c) when a chip does fail in a car, the contract chain pushes liability back through the Tier-1 integrator (Bosch, Continental, Aptiv) to the chip vendor, not through a consumer-facing RMA.

Two operational facts from Sean’s May 6 conversation (Sean works on the authorized-distributor side — Arrow, Avnet, Future):

Distributors hold the inventory now. After the COVID-era shortages, OEMs and Tier 1s prefer to push inventory holding onto distributors for demand flexibility. Distributors compete on 365-day payment terms instead of the old 30-day standard. [Interview: Sean, 2026-05-06]
Automotive warranty flows up the supply chain. The Tier-1 system integrator takes the warranty hit at the system level; component-level claims then flow back through the distributor up to the chip designer or IDM (Integrated Device Manufacturer — a company that designs and fabs its own chips, like Texas Instruments or Renesas). Tier-1 margins are “razor-thin, some negative in recent years.” [Interview: Sean]

Holly Rawlins (Renesas, industrial/automotive chip designer) adds the chipmaker-side view. Renesas runs all its order management through SAP (the dominant enterprise resource planning software — “the gravity well” in Holly’s words). Compliance pressure flows downward from the auto OEM through the Tier 1 to Renesas. The reverse-logistics burden at Renesas is fundamentally different from NVIDIA’s because (a) failure rates are AEC-Q100 ultra-low to start with, (b) franchise distribution carries some of the customer-facing risk, and (c) end customers don’t expect overnight advance replacement on a car. [Interview: Holly Rawlins, 2026-04-29]

Public anchors:

The AEC-Q100/Q200/Q004 automotive qualification framework — the zero-defect standard. [Public: SemiEngineering, “Drive Toward Zero Defects”]
Synopsys’s SLM (Silicon Lifecycle Management) PVT monitor IP, which is qualified to ASIL-B (a functional-safety standard for automotive) and AEC-Q100 Grade 2. The implication: on-die monitors are how edge silicon proves field reliability without the customer-side fleet telemetry that hyperscalers have. [Public: Synopsys SLM]

§5 — Financing chip-failure risk: warranty reserves, reinsurance, and third-party warranty

The plain-English version of this section: when chips fail at scale, someone pays for it. Today, almost all of that cost lives on NVIDIA’s balance sheet as a “warranty reserve” — money NVIDIA sets aside to pay for future repairs and replacements. NVIDIA’s reserve grew from $82M to about $2.8 billion in three years, almost entirely because of AI GPU failures. AMD’s reserve is on the same trajectory, one cycle behind. The component-IDM vendors (Broadcom, Marvell, Intel) and the system OEMs (Dell, HPE) disclose almost no warranty pressure — for reasons that say something structural about who actually owns the failure risk.

What doesn’t exist yet: a clean way to transfer that warranty risk to an insurer or reinsurer. There are partial precedents (Munich Re’s product for AI model performance, the TWAICE-Munich Re-Hithium deal for batteries), and an adjacent market — third-party data-center maintenance (Park Place, Service Express) — that lets customers shift their post-warranty risk without transferring it off NVIDIA’s books. But a purpose-built warranty-reinsurance product for compute hardware does not visibly exist today.

Cross-vendor warranty disclosure: who carries the burden?

reverse-logistics-warranty-tam-2026-05-29 §2-§4 has the deepest cross-vendor dive. The headline finding survives here: the warranty burden is concentrated almost entirely at NVIDIA, with AMD one cycle behind. System-OEMs (Dell, HPE, SMCI) are flat or declining. Component-IDM and custom-ASIC vendors disclose essentially nothing.

Quick note for non-finance readers on what the table shows. Companies that sell physical products are required by US accounting rules to reserve money on their balance sheets for future warranty claims. Each year they show a “rollforward”: the starting reserve, plus money added for new product warranties (accruals), minus money paid out for repairs (claims paid), equals the ending reserve. So an ending reserve growing fast = the company expects to pay much more in future warranty claims than it used to.

Comparative warranty disclosure table ($M; cross-checked from 10-K SEC filings)

Firm	Latest FY	Reserve (end)	Claims paid	Accruals	Segment driver / notes
NVIDIA	FY26 (1/25/26)	$2,807	$957	$2,474	Compute & Networking (data-center), explicit in 10-K. Reserve +118% YoY; claims +337% YoY [Public: NVIDIA FY26 10-K accn 0001045810-26-000021]
NVIDIA	FY25	$1,290	$219	$1,203
NVIDIA	FY24	$306	$54	$278
NVIDIA	FY23	$82	$109	$145
AMD	FY25 (12/27/25)	$308	$238	$358	Not segmented; ~1.03% accrual rate. Reserve +64% YoY; claims +116% YoY [Public: AMD FY25 10-K accn 0000002488-26-000018]
AMD	FY24	$188	$110	$213
Broadcom	FY25	—	—	—	No product-warranty rollforward disclosed [Public: Broadcom FY25 10-K accn 0001730168-25-000121]. Component / IP model pushes warranty to the system OEM.
Marvell	FY25	—	—	—	Fragmentary; no clean rollforward [Public: Marvell FY25 10-K, CIK 1835632].
Intel	FY25 (12/27/25)	—	—	—	No product-warranty XBRL concept filed [Public: Intel SEC company facts CIK 50863]. Server-CPU warranty burden is near zero. Aligns with Puget Systems’ near-zero Xeon W failure data.
Qualcomm	FY25 (9/28/25)	—	—	—	No standalone rollforward; warranty buried in accrued liabilities (typical for modem / mobile-IP vendors that ship through OEMs).
Texas Instruments	FY25	—	—	—	TI’s 10-K language flags warranty risk but no rollforward. [Public: TI 2025 annual report] Industrial/auto model — warranty primarily a Tier-1 problem.
Analog Devices	FY25	—	—	—	Same structural shape as TI; component-level warranty diluted across thousands of SKUs.
Microchip	FY25	—	—	—	Same.
Dell	FY26 (1/30/26)	$450	$952	—	Whole-system; not AI-attributed. Flat despite AI-server boom [Public: Dell FY26 10-K accn 0001571996-26-000008].
Dell	FY25	$424	$884	—
HPE	FY25 (10/31/25)	$284	—	—	Declining from $318M (FY23) despite server-business growth [Public: HPE FY25 10-K accn 0001645590-25-000130].
Supermicro	FY25 (6/30/25)	$17.0	—	$59.2	The AI-server pure-play; ~0.2-0.3% accrual rate. Tiny. [Public: SMCI FY25 10-K accn 0001375365-25-000027]
Cisco	FY25 (7/26/25)	Disclosed; small relative to revenue	—	—	Networking-equipment warranty, generally low; not data-center-AI-attributed. Lifecycle warranty terms vary by product. [Public: Cisco warranty docs]
Arista	FY25	Small	—	—	Same pattern as Cisco — networking gear at modest accrual rate.
Foxconn / Quanta / Wistron	—	—	—	—	TIFRS / IAS 37 “provisions”; no SEC warranty disclosure. Disclosure gap.

Industry aggregate (where NVIDIA sits in the pool)

WarrantyWeek’s 23rd Annual Product Warranty Report (2026-04-16):

2025 average U.S. product warranty claims rate: 1.30%; average accrual rate 1.43%.
Collectively, $30B+ paid in product warranty claims, $33B set aside in accruals, $72B held in reserves across all U.S. industries. [Public: WarrantyWeek 23rd Annual Report]
Semiconductor & PCB industry specifically accrued $1.743B in 2024, “just about double 2023’s total of $878M. This increase is almost entirely explained by GPU manufacturer Nvidia.” [Public: WarrantyWeek 23rd Annual Report]
NVIDIA 2024 → 2025: claims paid +1000%, accruals +173%, reserve balance +218%. AMD 2024 → 2025: claims +116%, accruals +68%, reserves +76%. [Public: WarrantyWeek 23rd Annual Report]
At end of 2024, the entire U.S. semiconductor industry reserve was $1.691B — a 58% YoY increase, almost entirely NVIDIA-driven. [Public: WarrantyWeek 23rd Annual Report]

The structural read: NVIDIA is roughly 74% of the U.S. semiconductor warranty book on the corrected ~$2.81B reserve, and ~80%+ if you sum NVIDIA + AMD against the industry aggregate. This is also where reverse-logistics-warranty-tam-2026-05-29 §4 landed. The brief carries that finding forward without re-litigating it.

Warranty reinsurance for compute hardware

The structural question: who, if anyone, is reinsuring NVIDIA’s $2.8B reserve — or AMD’s $308M, or HPE’s $284M? The visible public answer is “no one, in any named compute-hardware deal.” A reminder of why this matters: reinsurance is what insurance companies buy to share large risks. If NVIDIA could lay off some of its warranty exposure to Munich Re or Swiss Re, NVIDIA’s balance sheet would be steadier and there’d be a natural opening for a TBD-style product to mediate. But three adjacent precedents matter, because they sketch the shape of a deal that hasn’t been done yet for chips:

Munich Re aiSure (AI model performance warranty). A 2024 Munich Re product that covers AI model owners against losses if their model fails to perform as advertised — described as “parametric-like structure allowing claims to be settled quickly based on measurable performance data.” Mosaic Insurance is the front-end MGA; Munich Re is the reinsurer. Maximum initial coverage ~$15M per deal. What it covers is wrong-model-output risk — not silicon failure. Adjacent but distinct. [Public: Munich Re aiSure product page; Mosaic + Munich Re partnership]
Munich Re + TWAICE + Hithium (battery performance warranty). This is the closest structural analog to what we’re imagining for chips. Hithium is a Chinese energy-storage battery manufacturer. They wanted to offer a long-term performance warranty to their customers, but didn’t want the risk on their own balance sheet. So they partnered with TWAICE (a German battery-analytics company that captures deep telemetry from deployed batteries) and Munich Re. The deal: Munich Re reinsures a 15-year performance warranty, using TWAICE’s telemetry as the underwriting data; the warranty covers repair, maintenance, and downtime; the customer is also protected if Hithium goes bankrupt. The structure is sensor data → independent analytics provider → reinsurer. [Public: Munich Re + Hithium press 2023-10-26; TWAICE Munich Re partnership page] This is the exact shape Max Mirgoli proposed unprompted in May 22 (“the potential to reinsure that warranty risk”) and Preston Wilson’s parametric specialist gestured at in May 22 (“device-based parametric trigger concept”).
General extended-warranty market. Companies like Assurant (publicly traded, $147B+ industry), Asurion (private), Allstate, AIG, AXA dominate consumer and small-business extended warranties (the contract you sign at Best Buy when you buy a TV). The industry is ~$150B in 2025 and growing ~8.5% per year. But we couldn’t find a single publicly disclosed B2B compute-hardware product line — these firms cover phones, appliances, and cars, not data-center GPUs. [Public: Mordor / Grand View 2025; Assurant 10-K]

Third-party data-center maintenance — the real “outside NVIDIA” market

The form of “warranty insurance for customers” that already works at scale is third-party maintenance (TPM) — companies like Park Place Technologies, Service Express, Curvature, Evernex, and Procurri. The deal: a customer with hardware out of OEM warranty buys a fixed-price service contract from a third party (not from Dell or HPE) that covers any repairs needed going forward. The customer is betting their actual repair cost over the contract period will be less than the OEM’s renewal price plus the third party’s margin.

Market sizing:

$12.5B in 2024 growing to $20.3B by 2033 (~6.9% per year). [Public: Verified Market Reports / OpenPR]
Park Place ~10% share; Service Express ~5% share. After the Park Place + Service Express merger, the combined entity holds roughly 28% of global TPM. [Public: Verified Market Research TPM blog 2026]
Park Place stocks $200M+ of replacement parts across 150+ OEM brands. [Public: Verified Market Research]

Important caveat for the strategic framing: TPM does not transfer balance-sheet warranty risk away from NVIDIA, AMD, Dell, or HPE. It transfers post-warranty operating cost away from the customer. The two are separate trades:

A customer-side product (TPM) is what already exists — and consolidates fast.
A vendor-side product (warranty reinsurance) is what’s missing — NVIDIA still carries the $2.8B reserve on its own books.

A TBD-style product could theoretically operate on either side; the financial structure looks very different depending on which.

The trusted-third-party-telemetry requirement (load-bearing for any chip-side product)

The IMEC visit produced the clearest single articulation of the structural requirement that has to be solved before any of the §5/§6 products can be built. During the lab tour, Bliss raised the point that insurance buyers won’t trust telemetry data that NVIDIA controls — the underwriting signal has to come from somewhere NVIDIA can’t influence. Ben Kaczer (IMEC’s reliability physicist) agreed unequivocally. [Interview: Ben Kaczer + Bliss, 2026-06-XX]

This is the same structural requirement Preston Wilson identified from the insurance-broker side as the fourth of his “four pillars” for a parametric product: a trusted third-party measurement agent. Now we have independent confirmation from a reliability physicist. Combined with the tamper-aware odometer discussion in §3, it sketches what the underlying data layer for a chip-failure insurance product would need:

Aging-monitor data captured on the chip itself. This already exists — proteanTecs sells it as commercial IP, and Ben confirmed NVIDIA-class chips already have these monitors inside them.
Independently attestable. Someone other than NVIDIA has to be able to verify the readings haven’t been tampered with. This is what the IMEC tamper-aware-odometer research targets, but there’s no commercial product yet.
A standardized metric. What does “GPU aging” actually mean numerically? OCP RAS Requirements v1.7 (the cross-vendor spec from §3) covers part of this.
Accessible to non-OEM underwriters. Even if (1) through (3) are solved technically, NVIDIA has to be willing to let an outside insurer see the data. This is the political problem, not the technical one.

None of the four is fully productized today. The closest precedent — the TWAICE → Munich Re → Hithium battery deal above — got past it because TWAICE was an independent battery-analytics provider, embedded by contract in the customer’s battery management system, sending data to Munich Re directly. The chip-side equivalent would be a third-party telemetry vendor (or consortium of hyperscalers, à la OCP) with the same independence posture.

Convergence — internal and external read the same shape

Max Mirgoli (May 22) independently raised “the potential to reinsure [NVIDIA’s] warranty risk” without prompting.
Preston Wilson (May 7 + May 22 + May 22b) named GPU failure as a structural gap in the parametric four-pillar test (no trusted third-party measurement agent, no agreed metric, no actuarial model tied to historical loss data, no enabling sensor infrastructure) and proposed an ILS product with a parametric trigger as a structural unlock.
Ronit Jain (May 22) is selling GPU price-depreciation insurance via Pluto ($60M of H200 coverage sold on 2-3 yr windows) — adjacent but different risk class (price depreciation, not failure rate).
Andrzej Strojwas (May 22) provides the sensor-data analog for the underwriting layer (Exensio, characterization vehicles) — though PDF’s data is not shareable.
Ben Kaczer + Cedric Rolin (IMEC, 2026-06-XX) independently converged on three structural points: chip-level repair doesn’t exist (only board-level), data-center GPUs are the one chip-circularity case where economics work, and any insurance product requires a tamper-resistant third-party telemetry signal. [Interview: IMEC visit, 2026-06-XX]

For the full math behind why insurance gets bought corporately and where Wedge 2 (warranty transfer) and Wedge 3 (fab/SC insurance) sit, see insurance-market-overview-2026-06-15 §1-§3 and berk-independent-study-report-2026-06-09 §6.

Divergence to flag

Munich Re aiSure is described both as performance-warranty insurance (Munich Re language) and as a parametric-like product (Munich Re language). It is structurally closer to warranty than to traditional parametric, because it indemnifies against measurable performance shortfalls rather than codified physical triggers. That ambiguity is consequential when comparing it to Preston’s “ILS with parametric trigger” frame.
WarrantyWeek’s $8.22B vs. NVIDIA 10-K’s $2.81B — resolved in reverse-logistics-warranty-tam-2026-05-29 §4 in favor of the 10-K. Brief uses $2.81B.

§6 — Competitive landscape: monitoring, warranty/insurance, and where they meet

Three categories, then the white space where they touch.

Category A — Real-time monitoring / predictive failure analytics

Vendor	Category	What they do	Public status
NVIDIA (internal product)	Stack OEM	DCGM, NVML, NVSM, NVSentinel (open-source on K8s), Mission Control (GA), Fleet Intelligence (managed service)	All NVIDIA-side, increasingly bundled with AI Enterprise subscription [Public: NVIDIA developer blog]
proteanTecs	Standalone	In-chip monitors (Proteus) + deep-data analytics; HBM reliability; embedded in production silicon via GUC	Private; 7nm GUC HBM controller deployment; 3nm 8.8 GT/s HBM3 PHY visibility [Public: proteanTecs HBM whitepaper]
Synopsys SLM (IP + software)	EDA-bundled	On-die PVT monitor IP + lifecycle software; ASIL-B Ready, AEC-Q100 Grade 2	Embedded in Synopsys EDA stack [Public: Synopsys SLM article]
Cadence	EDA-bundled	DDR / HBM analytics, Tempus power-integrity sign-off — less standalone SLM product than Synopsys	Same shape as Synopsys
PDF Solutions	Standalone	Exensio FDC (TSMC-wide), characterization vehicles, Symmetrics equipment connectivity (300+ clients), Securewise remote engineer access, blockchain traceability + DAX for OSAT-to-foundry data exchange, SAP Sapiens Manufacturing Hub	Public; 900 employees; embedded in every TSMC fab [Interview: Andrzej Strojwas, 2026-05-22]
Datadog / ServiceNow ITOM / Splunk	Observability	Generic infra observability; ServiceNow ITOM integrates with Datadog for AI-fleet ticketing	Horizontal; do not natively model GPU XID errors, but integrate with DCGM via custom checks [Public: Datadog ServiceNow integration]
Rafay	Standalone	Kubernetes platform that has integrated NVSentinel for managed GPU health	Private; recent NVSentinel integration [Public: Rafay NVSentinel blog]
Internal hyperscaler tools	In-house	Meta Fleetscanner/Ripple/Hardware Sentinel; Google internal; CoreWeave Node Lifecycle Controller / Fleet Lifecycle Controller	Published but not productized [Public: Meta engineering blog; CoreWeave docs]
Test equipment OEMs	Manufacturing	Teradyne, Advantest, KLA — SLT and burn-in equipment + analytics	Public; deeply embedded upstream
IMEC AR2T (research-org, pre-commercial)	Research IP	Aging-monitor and tamper-aware-odometer IP; system-level reliability program (atomic → defect → SPICE → system); active partner pitches to NVIDIA / Meta on SLR	Public research org; €1.2B budget, 6,500 staff; commercial path via spin-bys / iMec Ventures [Interview: Ben Kaczer + Annelise, 2026-06-XX]
IC Link	Specialist warranty / distribution	Provides a warranty offering for ASIC distribution / IP — exact terms unclear; flagged by Cedric Rolin as worth investigating	Private; warranty structure not publicly documented [Interview: Cedric Rolin, 2026-06-XX]

Category B — Specialist warranty/insurance providers for compute hardware

Vendor	Coverage	What they do	Status
Park Place Technologies + Service Express	Post-warranty DC maintenance	Combined ~28% global TPM share after integration; 150+ OEM brands; $200M+ parts inventory	Private; PE-backed; the TPM consolidator [Public: Verified Market Research]
Curvature, Evernex, Procurri	Post-warranty DC maintenance	Other top-5 TPM players	Private / public hybrid
Munich Re aiSure (incl. Mosaic partnership)	AI model performance	Performance warranty insurance for AI model outputs (not silicon failure); parametric-like settlement	Live, expanding; up to $15M initial coverage [Public: Munich Re; Reinsurance News on Mosaic deal]
Munich Re / TWAICE / Hithium	Battery performance warranty (analog)	15-year reinsured performance warranty for Li-ion ESS, TWAICE deep-data analytics as underwriting input	Live since 2023; the strongest structural analog for a chip-side product [Public: Munich Re Hithium press]
Assurant B2B / Asurion	Consumer + B2B extended warranty	Multi-line warranty administrators; primarily consumer / mobile; no public B2B compute-hardware product line we could verify	Public (Assurant) / private (Asurion) [Public: Assurant 2025 annual report]
Lloyd’s syndicates (Beazley parametric)	Property / contingency parametric	Parametric cover for technology / contingency events; no named compute-hardware warranty product	Live; structural template for parametric-on-chip risk [Public: Beazley parametric page]
Parametrix	DC SLA parametric	Quantifies and transfers tech-downtime risk; Lloyd’s Coverholder	Live; closest existing parametric product to the data-center risk space [Public: Parametrix]
Pluto (Ronit Jain)	GPU price-depreciation	CFTC-designated derivatives exchange + clearinghouse; $60M H200 depreciation coverage sold via swap structure	Pre-launch summer 2026 [Interview: Ronit Jain, 2026-05-22]
Coalition	Cyber MGA (structural analog)	$5B valuation; Aspen Specialty capacity (2024); data-rich MGA template	Live; the structural template Preston named [Public: CB Insights on Coalition]

Category C — Partnerships between telemetry and insurance (the white space)

The TWAICE / Munich Re / Hithium structure remains the only public precedent we have found that puts sensor-driven analytics underneath a warranty-reinsurance product for a hardware class. No equivalent exists for compute / data-center / AI accelerators in public material.

What would the equivalent look like?

Telemetry side: NVIDIA DCGM + NVSentinel + Fleet Intelligence; or proteanTecs Proteus on-die data; or OCP RAS v1.7 standardized metrics.
Underwriting side: Munich Re / Swiss Re specialty or a Coalition-style MGA; or a Lloyd’s syndicate writing a novel risk class; or the ILS/parametric structure Preston gestured at.
Trigger: parametric on per-fleet failure rate (e.g., >X% AFR triggers payment) or on specific physical signals (HBM PHY error counts, NVLink lane retraining rate, sustained temperature excursion).

This is exactly the structural gap Preston, Max, and Andrzej circled from three different angles. None of them framed it as a product yet; each saw the missing piece. That convergence is the most-named research question across the corpus.

For the parametric four-pillar test (third-party measurement, agreed metric, actuarial model, sensor infrastructure) see insurance-market-overview-2026-06-15 §4 and berk-independent-study-report-2026-06-09 §6.1.

One operational note from the IMEC visit that doesn’t fit cleanly in the vendor categories but bears directly on the white space. Cedric Rolin on data sharing in the semiconductor industry: “People always say collaboration is important. In fact they do not collaborate. In sustainability, we do.” [Interview: Cedric Rolin, IMEC SSTS, 2026-06-XX] The origin story is concrete: Apple came to IMEC and said “we don’t trust our suppliers’ carbon numbers, we need a baseline.” That customer pressure pulled IMEC.netzero (a virtual fab LCA model — process-step-level methodology public, partner data private) into existence.

The structural lesson: sustainability framing has unlocked data sharing that the industry refuses under any other label, driven by hyperscaler / brand customer pressure and CSRD-style reporting. The same data layer that powers an LCA — process-step electricity, water, emissions, defect rates, yield — overlaps materially with the reliability data layer needed to underwrite a warranty product. Embedding chip-failure-data sharing inside a sustainability-linked LCA framework may be the path of least resistance for solving the §5 third-party-telemetry political problem. Worth treating as a strategic option, not a default.

Two related caveats from the same conversation: the EU regulatory tailwind for mandatory product-level disclosure has reversed (CSRDD retracted, Digital Product Passport stalled — “Draghi report. Competitiveness is the new religion.” — [Interview: Cedric Rolin]). And the operational-vs-embodied emissions split has crossed over to ~50/50, headed toward embodied dominance as grids decarbonize — which is what gives extended-life pressure its commercial teeth.

§7 — Convergences, divergences, surprises

Convergences

AI accelerators fail at hyperscale. Meta Llama-3 ~9% AFR; NVIDIA’s reserve growing >100% YoY for two cycles; Alex Zhu’s “60 of 100 repairable” → the financial, operational, and physical signals all point the same direction.
Manufacturing-side detection is mature; OEM RMA-side workflow is not. PDF Solutions has Exensio FDC in every TSMC fab; NVIDIA runs RMAs on Salesforce + spreadsheets. The fab data and the field data don’t talk.
Hyperscalers are standardizing. OCP RAS Requirements v1.7 (Oct 2025) co-authored by Microsoft, Meta, Google, AMD, NVIDIA, Arm, Intel is the procurement-side admission that “the OEM defaults aren’t sufficient.”
Warranty burden concentrates at the chip vendor that owns the customer relationship. NVIDIA carries it; ODMs (Foxconn/Wistron/Quanta) and OEMs (Dell/HPE/SMCI) bounce it back via supplier indemnity. The flow has been re-confirmed by every cross-source check.
Chip-level “repair” doesn’t exist; only board-level replacement does. Ben Kaczer (IMEC AR2T physics), Lonny Orona (NVIDIA OEM-side operations), and Alex Zhu (NVIDIA repair-line workflow) converge on this from three angles. The financial-product framing should be “advanced replacement + second-life grading,” not “repair.”
Workload-conditional reliability is the natural unlock. Ben Kaczer (physics), Vivian (cooling/substrate value chain), and the SDC literature all point the same direction: utilization-aware coverage maps to how chips actually fail, and the data substrate to price it already exists. The brief surfaces this as the cleanest mechanical bridge from §1 to §5.
Data-center GPUs are the one chip case where circularity economics work. Cedric Rolin (IMEC sustainability), HashrateIndex reseller data, and Iron Mountain ALM segment growth all confirm — the rest of the chip industry’s recycling rate is ~0% by mass.

Divergences (flag prominently)

Server CPU failure rates are ~0% (Puget Systems 2025) but AI-GPU failure rates are ~9% (Meta Llama-3). Four orders of magnitude difference. The “chips fail” question has two distinct answers depending on what kind of chip.
NVIDIA reserve: filed ~$2.81B vs. WarrantyWeek’s $8.22B. Documented at length in reverse-logistics-warranty-tam-2026-05-29 §4. The 10-K is authoritative.
Secondary GPU value: 75-85% retention through 24 months (reseller view) vs. “given away at $2.10/hr” (Princeton CITP). The two camps disagree because resellers sell GPUs and CITP doesn’t; both arguments survive scrutiny. Material for any obsolescence-driven reverse-flow volume assumption.
ASIC vendors disclose nothing public. Google’s “97% goodput at 10K-chip scale” is the only quantitative ASIC signal we found. AWS Trainium, Microsoft Maia: silent. If hyperscaler ASICs reduce reliance on NVIDIA, the warranty burden could shift dramatically — and we wouldn’t see it in time.
Are hyperscalers running chips to failure against engineering recommendations? Bliss reported (via NVIDIA contact) that some hyperscaler business units knowingly run chips past preventive-replacement thresholds to maximize uptime, against their own engineering teams’ advice. Ben Kaczer’s response: “I never heard that.” [Interview: Ben Kaczer, 2026-06-XX] If true, it materially changes the failure-curve assumption underlying any insurance product. Worth pressing on in the next NVIDIA conversation.

Surprises (what we didn’t expect)

The OEM-side reverse logistics workflow at NVIDIA is materially less sophisticated than the hyperscaler-side detection workflow at the customer. Meta’s Hardware Sentinel beat test-based methods by 41% on SDC detection; NVIDIA still uses spreadsheets at the contract manufacturer. The asymmetry is large.
The Munich Re + TWAICE + Hithium structure is real and live. The “sensors + analytics + reinsurance” template was deployed for batteries three years before the chip-side conversation we’re having. The lag is interesting.
Iron Mountain’s ALM segment grew 70% YoY in Q2 2025 and 92% in Q1 2026. Decommissioning is the fastest-growing reverse-flow business in the public market. That’s not “tooling for NVIDIA returns” — it’s the post-warranty mass.
Park Place + Service Express is now ~28% of global TPM. Consolidation is happening on the customer-protection side without much VC-press attention.
Hyperscalers and chip vendors are jointly writing the spec. OCP RAS v1.7 (Oct 2025) was co-authored across firms that normally guard reliability data as competitive. That cooperation is itself a signal — the cost of NOT cooperating has exceeded the cost of cooperating.
Workload-conditional reliability is brand-new. Until very recently every chip carried a single nameplate guarantee (100% utilization, 125°C, 10 years). Ben Kaczer: the industry is just now starting to design for workload-aware guarantees. Direct primary-source confirmation that the data-substrate-meets-product-structure conversation hasn’t happened yet — which is the financialization wedge’s window.
Counterfeit-relabeled chips are a real grey-market segment. US Navy P-8 maritime patrol aircraft turned out to be running refurbished chips marked as new. IMEC has IP on tamper-aware odometers specifically because the market exists. That’s a previously unaccounted-for failure class: not “chip dies in service,” but “chip was already worn when shipped and the buyer didn’t know.” [Interview: Ben Kaczer]
Sustainability is the one data-sharing dimension that actually works in this industry. Cedric Rolin: collaboration on sustainability data happens because Apple-style customers demanded baselines and CSRD created reporting pressure. The same hypothetical reliability-data sharing has zero industry collaboration. Embedding chip-failure-data sharing in a sustainability framing may be the politically viable path.
Embodied vs. operational emissions have crossed over (~50/50, headed embodied-dominant). 28nm chip = 35,000 g CO₂-eq per gram — same order as gold. Mass-manufactured object with the highest embodied carbon footprint on the planet. Sustainability-driven life-extension pressure on data-center hardware is real and growing. [Interview: Cedric Rolin]
EU regulatory tailwind has reversed. CSRDD retracted, Digital Product Passport stalled, “Draghi report — competitiveness is the new religion.” [Interview: Cedric Rolin] Aligns with compliance_wedge_killed — any wedge premised on EU mandatory disclosure is structurally unwise.

§8 — Tiered open questions, with named contacts

Tier 1 — would resolve the GPU-failure thesis at the core

What’s NVIDIA’s real per-RMA cost (logistics + repair + scrap + lost opportunity)? Method D in reverse-logistics-warranty-tam-2026-05-29 §5 had this as its weakest input. → Greg DeLoccio (NVIDIA Service Team), intro offered by both Lonny and Alex; Alex Zhu follow-up (back from PTO mid-June).
Does the ~$2.81B reserve cover Lonny/Alex’s operational spend, or only the accounting estimate of free repairs? Hyperscaler-side cost (downtime, replanning, lost work) is borne by the customer; this number is just NVIDIA’s. → Greg DeLoccio + NVIDIA CFO-office contact.
What does a hyperscaler actually do on detect → triage → return? Customer-side workflow at scale. → Joel Sng or someone on the Meta AI Infra team (the Hardware Sentinel authors are listed on the ASPLOS 2025 paper); Patrick Yin (CoreWeave) or someone at Crusoe for the neocloud-side; Yvonne Lam if Microsoft Azure infrastructure is reachable. 3a. Does on-die aging-monitor data from NVIDIA’s GPUs actually flow off-chip in production deployments, and to whom? Ben Kaczer’s primary-source claim that NVIDIA-class chips “have aging monitors” is necessary but not sufficient — the question is whether the data is read, by whom, and under what contract. → Ben Kaczer (IMEC AR2T) for an intro to ProteanTec founders / BD; ProteanTec directly for the customer relationship inventory.

Tier 2 — size and shape the gap

Does the MI300 carry NVIDIA’s failure profile? AMD’s warranty curve is up but not segmented. → AMD contact (perspective gap — we have none yet); Holly Rawlins (Renesas) for non-NVIDIA flow-down mechanics.
What’s the operational reality at the ODM / CM layer? Wistron / Foxconn / Quanta won’t publish, but a former operator might. → Expeditors account manager (NVIDIA’s new 3PL); Greg DeLoccio for the in-house view.
Is anyone underwriting silicon-failure risk transfer on compute hardware today? → Preston Wilson (Guy Carpenter) for a structured intro to the parametric specialist; Munich Re specialty / aiSure team; Coalition if a chip-side analog has been considered. 6a. Is sustainability-driven data-sharing pressure (Apple-style customer push + CSRD-style frameworks) a viable Trojan horse for the reliability-data sharing problem? → Cedric Rolin (IMEC SSTS) — explicit open door, “email me anytime”; specifically ask about the IMEC.netzero data-sharing governance model and whether the same structure could carry reliability data; also Lizzie (IMEC SSTS LCA) for the LCA-auditor mechanics. 6b. Does the IMEC tamper-aware-odometer line have a productization path? This is the foundational data layer for any third-party-attestable chip-aging signal. → Annelise (iMec Ventures) for the venture-side view; Ben Kaczer for the research-status update; potential joint-pitch scoping per the IMEC visit debrief.
Is there a stealth / private purpose-built reverse-logistics platform we missed? Method search risk. → Baxter Planning competitive intel; PTC ServiceMax sales team; Gartner analyst on supply-chain execution.

Tier 3 — neighborhood questions

Edge AI and automotive failure rates — does anyone have real field data? → Hailo / Mythic founders; NXP / Renesas automotive reliability team (Holly Rawlins for intro); Tesla / Waymo platform reliability teams (long shot).
Hyperscaler ASIC reliability disclosures — when, if ever? → Google TPU team; AWS Annapurna / Trainium; Microsoft Maia team — all hard to reach but worth tracking.
What is the actual data-center-decommissioning unit economics inside Iron Mountain ALM? → Iron Mountain ALM segment lead; SK tes.

§9 — Confidence summary

Claim	Confidence	Basis
Wear-out mechanisms (EM, BTI, TDDB, HCI, thermal cycling) are the physical drivers of chip failure	High	Decades of peer-reviewed work; cross-checked across NASA NEPP, SemiEngineering, arxiv survey
Advanced packaging multiplies failure modes via CTE mismatch + microbump density	High	Cadence + Siemens + Tom’s Hardware coverage; coherent across sources
SDC rate ~1/1,000 silicon devices at hyperscale	High	Dixit 2021; Hochschild 2021; OCP whitepaper extends to AI accelerators
Meta Llama-3 ~9% annualized failure rate	High	Primary Meta paper; widely re-reported
Server CPUs show near-zero failures (Puget 2025)	Medium-High	Sample selection bias caveat; Puget’s stricter standards may overcount failures elsewhere
ASIC (TPU / Trainium / Maia) failure rates	Low / Unknown	Disclosure void; only Google goodput claim is public
Edge AI accelerator (Jetson / Hailo / Mythic) failure rates	Low / Unknown	Operating envelopes published; field data not
NVIDIA $2.81B FY26 reserve (filed)	High	Primary SEC 10-K XBRL
AMD on the same trajectory, ~10x smaller	High	Primary SEC 10-K
Intel / Broadcom / Marvell disclose no warranty	High	Absence of XBRL concept
System OEMs flat-to-declining	High	Primary SEC 10-Ks
OCP RAS v1.7 is the industry-standard convergence point	High	Primary spec; co-authored across hyperscalers + chip vendors
Munich Re + TWAICE + Hithium analog exists and is live	High	Multiple public press; TWAICE / Munich Re partnership page
No public warranty-reinsurance product for compute hardware	Medium	Absence-of-evidence; private structures may exist
Iron Mountain ALM 70-92% YoY growth	High	Primary 8-K + Resource Recycling reporting
Park Place + Service Express ~28% of global TPM	Medium-High	Single Verified Market Research estimate; not cross-verified
GPU secondary value (75-85% retention vs. CITP “given away”) contradiction	Medium / unresolved	Two credible sources disagree; surfaced
Reliability is a probability game with no point guarantee — “99 of 10,000 work for 10 years”	High	Primary source: Ben Kaczer, IMEC AR2T, 2026-06-XX
Workload-conditional reliability is a recent industry shift	High	Primary source: Ben Kaczer; consistent with arxiv 2503.21165 and Synopsys SLM trajectory
On-chip aging monitors are widely deployed in modern AI silicon (ring-oscillator differential method standard)	High	Primary source: Ben Kaczer + proteanTecs deployed product evidence
Tamper-aware odometers are an emerging detection class with research-side IP but no commercial product yet	Medium-High	Primary source: Ben Kaczer; P-8 Navy counterfeit-relabeling case; no commercial vendor identified
Data-center GPUs are the single chip-circularity case where economics work	High	Primary source: Cedric Rolin, IMEC SSTS, 2026-06-XX
Sustainability is the one industry data-sharing dimension that actually works	Medium-High	Primary source: Cedric Rolin; Apple-IMEC.netzero origin story; CSRD-side context
Chip-level “repair” doesn’t exist; only board-level replacement does	High	Convergent across Ben Kaczer (physics), Lonny Orona, and Alex Zhu
Insurance buyers won’t trust manufacturer-reported telemetry — third-party measurement is required	Medium-High	Bliss → Ben Kaczer agreement; structurally identical to Preston Wilson’s parametric four-pillar

§10 — Falsification check: what would make this brief wrong?

This section is mandatory per RDI methodology. If the read in this brief is wrong, here is how we would find out.

If the AMD MI300 turns out to have an NVIDIA-equivalent failure rate but a flat warranty reserve, then NVIDIA’s reserve growth is mostly an internal-accounting story (more conservative provisioning, tighter SLAs) rather than a physical-failure story. We would see this in two ways: AMD’s MI300-specific failure rate published in a Llama-3-style paper, or AMD’s warranty rollforward continuing to lag NVIDIA’s despite shipping volume catching up.
If hyperscalers’ in-house tooling (Meta Fleetscanner/Ripple/Hardware Sentinel; CoreWeave NLC) is enough and they don’t need external partners, then the “white space” between telemetry and insurance is occupied by internal platforms and never externalizes. The signal would be hyperscalers not joining OCP RAS in earnest, or NVIDIA Mission Control / Fleet Intelligence eating the segment.
If the Blackwell GB200 generation runs much better than H100 (better cooling, better lane repair, tighter binning), then the ~9% Llama-3 rate was a leading-edge specific issue, not a structural problem. Watch Blackwell-era Llama-4 / Gemini equivalents for the next data point.
If the secondary GPU market is as thin as CITP argues, the obsolescence-churn volume thesis weakens and the reverse-flow opportunity shrinks toward warranty-only.
If the ASIC fleet at hyperscalers is materially more reliable than NVIDIA GPUs, the AI-accelerator failure problem becomes a NVIDIA problem specifically, and reduces in importance as Trainium / Maia / TPU share grows. The disclosure void is exactly what prevents us from seeing this today.
If insurers (Munich Re, Lloyd’s, Coalition) survey the market and conclude the risk class is uninsurable — because the failure correlation across a hyperscaler fleet is too high to diversify away — then the warranty-reinsurance wedge is a structural non-starter, not a “no one has tried it” gap.
If on-chip aging-monitor data turns out to be game-able and tamper-resistance never productizes, then the foundational data substrate for any chip-failure insurance product is unreliable. The IMEC tamper-aware-odometer line is the proof-of-concept — if that research doesn’t reach commercial silicon, the third-party-measurement pillar (§5) collapses and the entire underwriting structure has to find a different telemetry source (in-network observation, customer-side independent sensors, or an audit-based proxy). [Interview: Ben Kaczer, 2026-06-XX]
If a workload-conditional reliability spec turns out to be impractical to underwrite (because workload telemetry is too OEM-controlled, or because correlation between fleets nullifies the diversification benefit), then the §1 mechanical bridge to a workload-priced product fails, and the brief’s structural case for “telematics-for-chips” loses its main support.

The single best falsifier of the brief’s core direction would be a Meta-style failure-rate paper on Blackwell that comes in at <2% AFR — that would mean the AI-GPU reliability problem is generational and self-correcting, not structural.

Sources

Internal (semicolon-separated wikilinks):

Lonny Orona, 2026-05-12; Alex Zhu, 2026-05-27; IMEC visit (Leuven) — Ben Kaczer, Cedric Rolin, Lizzie, Annelise, Olivier Rousseaux, Jeroen Van den Bosch, 2026-06-XX; Holly Rawlins, 2026-04-29; Andrzej Strojwas, 2026-05-22; Vivian, 2026-04-29; Sean, 2026-05-06; Preston Wilson, 2026-05-07; Preston Wilson, 2026-05-22; Preston Wilson solo, 2026-05-22; Max Mirgoli, 2026-05-22; Minseok Kim, 2026-05-05; Ronit Jain, 2026-05-22; Josh, 2026-04-30

Cross-linked prior briefs:

reverse-supply-chain-research-2026-05-13; reverse-logistics-warranty-tam-2026-05-29; berk-independent-study-report-2026-06-09; data-centers-research-2026-05-24; insurance-market-overview-2026-06-15; financialization-primer-2026-05-29; glencore-of-semiconductors-2026-05-13; independent-distributors-research-2026-05-13

External — failure physics & SDC:

External — failure rates & detection:

External — RAS standards & telemetry:

External — warranty 10-Ks & industry aggregates:

NVIDIA FY2026 10-K (SEC, accn 0001045810-26-000021)
AMD FY2025 10-K — SEC accession 0000002488-26-000018 (CIK 2488)
Dell FY2026 10-K — accession 0001571996-26-000008
HPE FY2025 10-K — accession 0001645590-25-000130
Supermicro FY2025 10-K — accession 0001375365-25-000027
Broadcom FY2025 10-K
Intel FY2025 10-K — accession 0000050863-26-000011 (CIK 50863)
Marvell FY2025 10-K — CIK 1835632
WarrantyWeek 23rd Annual Product Warranty Report (2026-04-16)
WarrantyWeek — Discrete GPU Warranty Expenses (2026-04-09)
WarrantyWeek — US Semiconductor Warranty Expenses (2025-07-24)
Assurant 2025 Annual Report (SEC)
Texas Instruments 2025 Annual Report
Cisco Hardware Warranty FAQ

External — secondary market, ITAD, TPM, warranty insurance:

Prepared per RDI methodology. Synthesis is a human activity — this brief surfaces evidence, ranges, convergences, and contradictions. It does not conclude.

Project TBD Memory Vault

Explorer

Chip Failures, GPU-Weighted: The Authoritative Primer

Chip Failures, GPU-Weighted: The Authoritative Primer

BLUF

Outline changes from Phase 0

Reader’s glossary — key terms used throughout

§1 — Wear-out is back, and advanced packaging is the new dominant fault surface

Reliability is a probability game, not a point estimate

Five physical mechanisms run the wear-out clock

Foundries publish a conservative envelope. The real safe operating area is larger — and uncharacterized.

Workload-conditional reliability is genuinely new

Advanced packaging multiplies the fault surface

Silent data corruption (SDC) is a different problem, and worse than wear-out at scale

Convergence — what our internal interviews said maps cleanly onto the physics

§2 — Failure rates by chip type × environment: the matrix

The matrix (annualized failure rate; sources mixed; comparability caveats below)

Comparability caveats (read before quoting any number above)

How the environment changes the rate

What the chip vendors disclose, and what they don’t

Divergence — Meta’s 9% does not match Puget’s near-zero CPU number

§3 — Detection: telemetry, BIST, ECC, fleet observability

Three layers stack on top of every GPU

On-chip aging monitors and the tamper-aware odometer

NVIDIA’s own fleet observability stack (2025-2026)

Hyperscaler in-fleet detection (the public state of the art)

OCP RAS Requirements v1.7 — the standard that makes this an industry conversation

Manufacturing-stage catch (burn-in, factory test)

Convergence — internal and external read the same

§4 — Operations: what happens when a chip fails

Vocabulary: “repair” mostly doesn’t exist at the chip level

The OEM-side flow (re-derived from internal anchors)

Customer-side workflow at the hyperscaler / neocloud

Secondary market handoff

Edge / industrial reverse chain

§5 — Financing chip-failure risk: warranty reserves, reinsurance, and third-party warranty

Cross-vendor warranty disclosure: who carries the burden?

Comparative warranty disclosure table ($M; cross-checked from 10-K SEC filings)

Industry aggregate (where NVIDIA sits in the pool)

Warranty reinsurance for compute hardware

Third-party data-center maintenance — the real “outside NVIDIA” market

The trusted-third-party-telemetry requirement (load-bearing for any chip-side product)

Convergence — internal and external read the same shape

Divergence to flag

§6 — Competitive landscape: monitoring, warranty/insurance, and where they meet

Category A — Real-time monitoring / predictive failure analytics

Category B — Specialist warranty/insurance providers for compute hardware

Category C — Partnerships between telemetry and insurance (the white space)

Sidebar — sustainability as a data-sharing Trojan horse

§7 — Convergences, divergences, surprises

Convergences

Divergences (flag prominently)

Surprises (what we didn’t expect)

§8 — Tiered open questions, with named contacts

Tier 1 — would resolve the GPU-failure thesis at the core

Tier 2 — size and shape the gap

Tier 3 — neighborhood questions

§9 — Confidence summary

§10 — Falsification check: what would make this brief wrong?

Sources

Graph View

Table of Contents