Report, Benchmark 2026-04-25 · By Joshua Dalton, Chief of Staff to the CEO at Seentio

Moore's Law Reborn: The AI Infrastructure Stack Thesis

Executive Summary

In 2019 and 2022, Nvidia CEO Jensen Huang declared "Moore's Law is dead"—referring to the premise that transistor density alone would drive exponential compute growth. By late 2024, he reframed the narrative: Nvidia's AI systems are advancing "way faster than Moore's Law" via full-stack co-design. This article decodes that shift, explains why it matters technically and economically, and identifies the investors and companies positioned to benefit.

The core insight is straightforward: progress in AI infrastructure is no longer primarily a function of smaller transistors; it is a function of coordinated improvement across hardware, memory, networking, software, and algorithms. That compounds faster than node shrinkage alone can achieve, and it has profound implications for where value accrues in the AI supply chain.


The Original Moore's Law and Its Limits

Historical Context

Gordon Moore's 1965 observation was empirical: the number of components on integrated circuits doubled roughly every two years. By the 1970s, the semiconductor industry rebranded this as "transistor count doubling every ~18–24 months," and it became the North Star of computing progress.

\[ N(t) = N_0 \cdot 2^{t/\tau} \]

where: - \(N(t)\) = transistor count at time \(t\) - \(N_0\) = initial transistor count - \(\tau\) = doubling period (~2 years)

Plain English: If a chip had 1 million transistors in 1995 and followed Moore's Law exactly, it would have 2 million in 1997, 4 million in 1999, and so on. Each transistor's cost-per-unit fell accordingly, making bigger, faster chips economically viable.

Why Physics Slowed It Down

Process-node shrinkage (feature size reduction: 10μm → 1μm → 65nm → 28nm → 7nm → 3nm) follows the same exponential trajectory that Moore observed. But by ~2015, the physics imposed hard constraints:

  1. Quantum tunneling and leakage. At sub-10nm scales, electrons tunnel through gate oxides, wasting power.
  2. Heat dissipation. Power density (\(\text{W/mm}^2\)) grows faster than area, making high-frequency operation difficult.
  3. Manufacturing yield. Defect densities and variability increase; cost-per-chip does not fall as sharply.
  4. Design complexity. Routing signals at 5nm requires exponentially more engineers and design tools.

By 2022, Moore's doubling period had stretched from 2 years to 2.5–3 years in practical terms. At TSMC and Samsung, moving from one node to the next now costs $5–10 billion and takes 3+ years—a stark contrast to the 1990s, when node transitions happened every 18 months at lower cost.

Result: Raw transistor count still grows, but cost-per-transistor stopped declining sharply. Huang's declaration that "Moore's Law is dead" was acknowledging this economic reality.


The Full-Stack Paradigm Shift

What Huang Actually Means

Huang's pivot in late 2024–early 2025 is not a retraction; it is a redefinition of the relevant metric. Instead of asking "How many transistors fit on a chip?", the question becomes: "How much AI throughput and useful inference output do we get per dollar of total system cost?"

This metric depends on six tightly coupled layers:

graph TB A["Algorithm Design
Flash Attention, Inference Optimization"] B["Chip Architecture
Tensor cores, memory hierarchy"] C["Memory & Interconnect
HBM, NVLINK, networking"] D["Software Stack
CUDA, TensorRT, cuBLAS kernels"] E["System Integration
Power, cooling, rack layout"] F["Model Training & Inference
RLHF, DPO, KV-cache optimization"] A --> B B --> C C --> D D --> E E --> F F -.-> A style A fill:#1a3a5c,color:#fff,stroke:#2563eb style B fill:#1e3a5f,color:#fff,stroke:#3b82f6 style C fill:#162d50,color:#fff,stroke:#60a5fa style D fill:#172554,color:#fff,stroke:#3b82f6 style E fill:#1e293b,color:#fff,stroke:#475569 style F fill:#1a3a5c,color:#fff,stroke:#2563eb

Key insight: Optimizing only the chip (layer B) while ignoring memory bandwidth (layer C) or kernel efficiency (layer D) leaves throughput on the table. Nvidia's advantage is that it controls or deeply influences all six layers.

Quantifying the Gains: Pre-Training, Post-Training, Test-Time Compute

Huang tied full-stack progress to three AI scaling dimensions:

1. Pre-training efficiency Better hardware reduces the cost of training large language models from scratch. If a model requires \(10^{24}\) FLOPs to train and each FLOP costs \(10^{-9}\) \(/FLOP, total cost is ~\)10^{15}$ or $1 trillion. Full-stack optimization (faster memory, better kernels) can reduce cost-per-FLOP by 2–3×, cutting this to $300–500 billion.

\[ \text{Training Cost} = N_{\text{params}} \cdot C_{\text{compute}} \cdot \alpha \]

where: - \(N_{\text{params}}\) = model parameters (e.g., \(7 \times 10^{10}\) for 70B-parameter model) - \(C_{\text{compute}}\) = cost per effective FLOP ($/FLOP) - \(\alpha\) = compute-to-parameter ratio (≈ 20–100 for typical training)

Full-stack optimization lowers \(C_{\text{compute}}\).

2. Post-training (RLHF, DPO, SFT) Reward model training and fine-tuning require smaller models but higher iteration rates. Memory bandwidth becomes critical; optimized inference kernels reduce cost per forward pass, enabling more iterations.

3. Test-time compute Reasoning-style inference (e.g., OpenAI's o1, DeepSeek-R1) uses many "thinking tokens" per output. Cost scales with tokens generated. Better hardware and software mean lower cost per token, making reasoning more economically viable for more users.

\[ \text{Inference Cost per Output} = \frac{C_{\text{compute}} \times (\text{input tokens} + \text{thinking tokens} + \text{output tokens})}{1} \]

Plain English: Cheaper compute at test-time makes expensive reasoning accessible to more applications.


Evidence: Nvidia's Claims and Reality Checks

The 30–40× and 1000× Claims

Huang told TechCrunch that: - Nvidia's latest datacenter superchip (likely the Blackwell architecture) is 30–40× faster for certain AI inference workloads than the H100 (released 2022). - Nvidia's AI chips are 1,000× faster than a decade ago.

Interpretation:

The 30–40× claim is plausible but workload-specific. A few examples: - Token-per-second improvement on small batch inference: H100 achieves ~100–200 tokens/s per GPU for 70B models; newer systems with better kernels and memory bandwidth can reach 3,000–5,000 tokens/s. That is ~30× faster. - Latency reduction for transformer attention: Flash Attention v2 (implemented in newer stacks) reduces attention complexity from \(O(N^2)\) to \(O(N)\) in memory traffic; on typical sequences this is ~4–5× faster than standard attention. Combined with other optimizations (fused operations, quantization), 30–40× is within range for specific workloads.

The 1,000× claim is harder to pin down and likely mixes multiple dimensions: - 10 years ago: GPU accelerators were Tesla K80 (2.5 TFLOPS fp32, 2011). - Today: H100 (141 TFLOPS fp32, 2022) or Blackwell (>400 TFLOPS, 2024). - Raw FLOPS improvement: ~150–300×. - Adding software improvements, kernel efficiency, sparsity: plausible that total effective throughput is 1,000× or more for specific tasks (e.g., sparse inference, quantized models).

Verdict: Directionally true for selected benchmarks, but heavily workload-dependent. Marketing-inflected, not a literal engineering claim.

Industry Benchmarks: Measuring Full-Stack Progress

The table below tracks the evolution of Nvidia's data-center GPUs and their efficiency metrics:

GPU Model Year FP32 TFLOPS Memory Bandwidth (GB/s) Power (W) Efficiency (TFLOPS/W)
V100 2017 15.7 900 300 0.052
A100 2020 19.5 2,039 400 0.049
H100 2022 141 3,352 700 0.20
Blackwell 2024 >400 ~7,000 ~750 >0.53

What this shows: - Raw FLOPS jumped 25× from V100 to Blackwell (2017–2024). - Memory bandwidth jumped 7.8× (same period), reducing compute–memory bottlenecks. - Efficiency (TFLOPS per watt) improved 10×, reflecting better utilization and lower power per operation.

This is full-stack progress: better chip design (more tensor cores), better memory (HBM3, wider paths), better power delivery, and better kernels to use it all.


Why This Reframing Matters: The "AI Factory" Concept

System-Level Economics

The traditional chip metric (transistor count, FLOPS) is a supply-side measure. Huang's framing shifts focus to demand-side efficiency: cost per useful result.

For an AI company running inference at scale:

\[ \text{Cost per Output} = \frac{\text{GPU cost} + \text{Power cost} + \text{Cooling}}{(\text{Throughput in tokens/sec}) \times (\text{Uptime hours})} \]

Improving any numerator term or denominator term lowers cost. A 30% improvement in kernel efficiency (numerator) is economically equivalent to a 30% price drop. Full-stack optimization compounds these gains.

Constraints Beyond Transistors

Modern AI inference is bottlenecked by:

  1. Memory bandwidth. Transformer models spend 80–90% of time on memory-bound operations (attention, linear layers with small batch sizes). Adding transistors without adding bandwidth does not help.
  2. Communication latency. Multi-GPU inference requires fast interconnects (NVLink, InfiniBand). Older systems used PCIe, which is 10–50× slower.
  3. Software kernel quality. A poorly written matrix-multiplication kernel can leave 50% of hardware throughput unused.
  4. Sparsity and quantization. Modern models exploit sparse weights and low-precision arithmetic (fp8, int8). Hardware and software must be co-designed to extract gains.

Example: H100 added 40% more memory bandwidth than A100 but only 7× more FLOPS. For inference, that bandwidth gain is more valuable than the FLOPS gain. The full-stack view explains why; the transistor-count view does not.


Who Wins and Loses: The Investment Map

Tier 1: Chip Designers and Systems Integrators

Ticker Company Price (approx.) Market Cap Role
NVDA Nvidia ~$200 $4.9T GPU architect, software stack, full-stack integration
AMD AMD ~$160 $250B GPU competitor (RDNA, EPYC), system partnerships
INTC Intel ~$30 $110B CPU/GPU maker, foundry ambitions (Intel Foundry Services)

Nvidia's advantage: Controls the full stack—hardware design, CUDA software, system architecture, and increasingly, hyperscaler partnerships. Can optimize across all six layers. CUDA moat remains strong (90%+ market share in AI accelerators).

AMD's position: Strong in CPUs and server processors; RDNA GPU gains traction (Microsoft, Google) but CUDA ecosystem lag is real. Acquiring optimized software (e.g., ROCm improvements) is costly.

Intel's challenge: Fabless chip design (Gaudi, Ponte Vecchio) without equivalent software depth. Foundry services (making chips for others) is not the same as controlling the full stack.

Tier 2: Foundries and Chipmakers

Ticker Company Price Market Cap Role
TSM Taiwan Semiconductor Manufacturing Company ~$200 $1.3T Leading-edge process node (3nm, 2nm), pure foundry
BRCM Broadcom ~$220 $200B Networking (Ethernet, PCIe, InfiniBand PHYs), interconnects
MU Micron Technology ~$120 $175B Memory (HBM3, GDDR6X), crucial for bandwidth

TSMC's role: N3 and N2 node quality directly enables Blackwell and next-generation designs. Huang's full-stack thesis does not reduce reliance on advanced nodes; if anything, it increases it (denser designs allow more memory bandwidth per square mm).

Broadcom: High-speed interconnects (Ethernet, InfiniBand) are critical to multi-GPU systems. Improved routing and switching reduce latency between GPUs.

Micron: HBM (High Bandwidth Memory) is the most bandwidth-dense memory available. H100 uses HBM2e; Blackwell uses HBM3. Increasing capacity and speed is essential to the full-stack story.

Tier 3: Hyperscalers and Model Makers

Ticker Company Price Market Cap Role
GOOGL Alphabet ~$180 $2.3T Cloud (GCP), TPU design, LLM inference (Gemini)
MSFT Microsoft ~$400 $3.2T Cloud (Azure), OpenAI partnership, inference services
META Meta Platforms ~$580 $1.8T LLaMA model family, on-device and cloud inference
AMZN Amazon ~$200 $1.9T AWS cloud, Trainium/Inferentia chips, Claude partnership

Hyperscalers benefit directly from full-stack optimization: lower inference cost per token → lower cloud service margins → more competitive pricing → higher demand for inference. They also build custom chips (TPU, Trainium, Inferentia) to capture some of Nvidia's margin.

Model makers (OpenAI, Anthropic, Meta) benefit from cheaper test-time compute, making reasoning models more economically viable.


Scaling Laws and the Math Behind Full-Stack Optimization

Chinchilla and Subsequent Scaling Work

Recent scaling laws show that compute spent on pre-training, post-training, and test-time should scale in specific proportions. A simplified form:

\[ \text{Loss}(C) = E + \frac{A}{C^{\alpha}} + \frac{B}{D^{\beta}} \]

where: - \(C\) = pre-training compute (FLOPs) - \(D\) = post-training or test-time compute - \(\alpha, \beta \approx 0.07\) to \(0.1\) (exponent from Chinchilla, Grokking, and recent research) - \(E\) = irreducible loss (data quality floor) - \(A, B\) = constants

Interpretation: Loss decreases with compute, but with diminishing returns. The exponent \(\alpha \approx 0.07\) means a 10× increase in compute yields only ~\(10^{0.07} \approx 1.23\)× (23%) loss improvement.

Why this matters for full-stack optimization: If hardware/software improvements reduce effective cost-per-FLOP by 50% (e.g., from \(10^{-9}\) to \(5 \times 10^{-10}\) $/FLOP), then for the same budget, practitioners can afford 2× more compute, yielding ~23% better loss. Compounded across multiple layers (algorithms, kernels, memory), that is meaningful.

Predicting Future Efficiency

Assuming full-stack efficiency improves 30–40% per year (a conservative extrapolation from recent trends):

\[ C_{\text{eff}}(t) = C_{\text{eff}}(0) \cdot (1.35)^{t} \]

where \(t\) is years and \(C_{\text{eff}}\) is effective cost-per-FLOP.

Over 5 years: \((1.35)^5 \approx 5.3\)×, meaning models costing $1T to train today could cost $200B to train in 2030 (if chip cost is the only constraint). This is Huang's "hyper Moore's Law" in action.


The Counterarguments and Risks

Physical Limits Still Apply

Full-stack optimization is powerful, but not infinite:

  1. Memory bandwidth ceiling. Parallel paths (HBM, chiplet interconnects) eventually saturate. Beyond ~10 TB/s per GPU, adding more requires exponentially more area and power.
  2. Power dissipation. 700W per GPU is already pushing cooling infrastructure. 1000W+ per chip requires immersion cooling or other exotic methods, adding system cost.
  3. Latency laws. Signal propagation at light speed means no data center can have zero inter-GPU latency. Distributed training and inference have fundamental costs.
  4. Software complexity. Maintaining CUDA, TensorRT, cuBLAS, cuDNN, and other libraries is expensive. Competitive optimization pressure drives costs up.

Risk: If Huang's "hyper Moore's Law" is too aggressive, investors may overpay for Nvidia stock on the assumption that efficiency gains will continue 30–40% annually. Reversion to 20% annual improvement would not be a collapse but could justify lower valuations.

AMD and Custom Chip Threat

AMD's RDNA3 and upcoming RDNA4 architectures are not far behind Nvidia in raw specs. The gap is software (CUDA vs. ROCm) and ecosystem trust. If AMD or hyperscalers invest enough in ROCm libraries and compiler quality, they could close the gap in 3–5 years.

Similarly, custom chips (Google TPU v5e, Amazon Trainium2) are improving. At trillion-parameter scale, a 10% efficiency advantage is worth billions in infrastructure savings, justifying internal chip development.


How to Track This on Seentio

Stock Dashboards

Screeners and Custom Alerts

Search the Seentio Technology Screener for: - Gross margin trends in semiconductor and cloud companies (margin expansion = successful optimization). - R&D spend as % of revenue in chip and software companies (high R&D reflects full-stack arms race). - Data-center revenue growth (hyperscalers gaining efficiency).

Set alerts for: - NVDA data-center margins staying >65% (indicates continued pricing power). - AMD GPU revenue acceleration (indicates CUDA moat erosion). - TSM N2 utilization >90% (indicates node bottleneck).


Key Takeaways

  1. Moore's Law (transistor count doubling) has slowed, but Huang reframes the metric: cost-per-useful-AI-result can still improve exponentially via full-stack optimization.

  2. The six layers—algorithms, chip architecture, memory, software, system integration, and model training—must improve together. Nvidia's advantage is control across all six; competitors must invest heavily in multiple domains simultaneously.

  3. Scaling laws show that 2× reduction in cost-per-FLOP yields ~20–30% better model performance (depending on exponent). Over multiple optimization targets, this compounds to 5–10× effective improvement per year.

  4. Beneficiaries: Nvidia (chip + software + system architect), TSMC (advanced nodes), Broadcom (interconnects), Micron (memory), and hyperscalers (lower infrastructure cost, higher margins on inference).

  5. Risks: Physical limits on memory bandwidth and power dissipation will eventually bind. Competition from AMD and custom chips (TPU, Trainium) could erode Nvidia's moat in 3–5 years if software quality catches up.

  6. The investment thesis hinges on whether full-stack efficiency truly improves 30–40% annually. Evidence supports 20–30% near-term; 40% annually for 5+ years is optimistic.


Sources


Disclaimer

This article is for informational purposes only and is not investment advice. Seentio is not a registered investment adviser. Readers should conduct their own due diligence and consult a qualified financial advisor before making investment decisions.

Frequently Asked Questions

What did Moore's Law originally mean?

Gordon Moore's 1965 observation that the number of transistors on integrated circuits doubles roughly every two years, enabling exponential growth in compute density and cost-per-operation. This drove semiconductor and computing progress for decades.

Why does Jensen Huang say Moore's Law is 'dead'?

Process-node shrinkage alone (7nm → 3nm → 2nm) no longer delivers the same performance gains relative to power and cost. Scaling laws in semiconductor physics impose hard limits. But Huang argues full-stack optimization—chips, memory, interconnects, software—compounds faster than node shrinks alone.

What is 'hyper Moore's Law' in Huang's framing?

The claim that AI infrastructure is advancing faster than classic Moore's Law (2× every two years) because multiple system layers improve in parallel: architecture, algorithms, software kernels, memory bandwidth, and networking. Progress is exponential but not limited to transistor count.

How does this affect AI model performance and cost?

Lower effective cost-per-inference and cost-per-training FLOP. Better memory bandwidth and interconnects reduce communication bottlenecks. Improved kernels (e.g., Flash Attention, fused operations) increase hardware utilization. Post-training and test-time compute become more affordable, enabling larger models and more capable reasoning.

Which companies benefit most from full-stack optimization?

Nvidia (chip design, software, system architecture), AMD (GPU competition), TSMC (leading-edge foundry), and ecosystem players like Broadcom (networking), Micron (memory). Model makers (OpenAI, Anthropic) gain efficiency. Customers (hyperscalers, enterprises) see lower inference costs.

Related Research

Track these stocks in real time

See the data behind the research. Start with Seentio's free tier.

Get started free