Etched Comes Out of Stealth Already a Giant
Etched came out of stealth with working racks, a successful A0 tapeout, more than $1bn in customer contracts, $800m raised, and a bold thesis: the next AI bottleneck is frontier inference, not just model training.

Most startups come out of stealth with a product demo, a funding round, or a few design partners.
Etched came out this week with something much heavier: working racks, a successful A0 tapeout, more than $1bn in customer contracts, $800m raised, and a team of 400+ engineers from NVIDIA, Google TPUs, Broadcom, SK Hynix, TSMC, and others. The funding includes a $500m round led by Stripes that closed in December at a $5bn post-money valuation, with Jane Street (over $100m in across rounds), Peter Thiel, and TSMC-linked VentureTech Alliance on the cap table, alongside angel checks from Karpathy, Hinton, and Fei-Fei Li.
That is why Etched feels unusual anyway. This is not just another AI chip startup saying it wants to compete with NVIDIA. Etched is trying to build a new category of AI hardware: frontier inference clusters.
Its bet is simple, but extremely bold:
The future of AI will not only be limited by who has the biggest models. It will be limited by who can run those models cheaply, quickly, and power-efficiently at massive scale.
That is the Etched thesis.
Visit my writing AI Infrastructure 101: Semiconductor - The Industry that Powers Everything here for a beginner guide to everything about chips.
What Etched Is Building
Etched is building AI inference systems.
In simple terms, inference is what happens after an AI model has already been trained. When you type a prompt into ChatGPT and the model generates an answer, that is inference.
Training creates the model.
Inference serves the model to users.
That difference matters because the AI bottleneck is shifting. In the first phase of the AI boom, the question was mostly:
Who can get enough GPUs to train the next frontier model?
Now the question is increasingly:
Who can afford to run these models for millions or billions of users every day?
Etched is targeting that second problem.
Its first chip is called Sohu. Unlike NVIDIA GPUs, which are general-purpose and can run many types of workloads, Sohu is highly specialised for transformer inference. Transformers are the architecture behind most modern generative AI systems, including large language models.

Etched's headline claim is concrete: an 8-chip Sohu server can reportedly process around 500,000 tokens per second on Meta's Llama 70B, and one Sohu server can replace up to 160 NVIDIA H100 GPUs. That's the kind of falsifiable number worth watching once independent testing lands — it's a much sharper thing to hold the company to than "SOTA throughput."
This is the core difference:
| NVIDIA GPU | Etched Sohu |
|---|---|
| Flexible and general-purpose | Narrow and specialised |
| Can train models and run many AI workloads | Designed for transformer inference |
| Strong software ecosystem | Optimised for one dominant workload |
| Safer if AI architectures change | Potentially much more efficient if transformers dominate |
A GPU is like a Swiss Army knife. It can do many things well.
Sohu is more like a Formula 1 car. It is not useful for every road, but on the right track, it can be much faster.
Etched is betting that the "right track" is large-scale transformer inference.
Why Specialisation Matters
Specialisation sounds risky, because it narrows what the chip can do. But that narrowness is also the point.
General-purpose chips carry a flexibility tax. They need hardware and software support for many workloads: training, inference, graphics, scientific computing, different neural network architectures, and future model types.
Etched wants to remove that tax.
If transformers continue to dominate frontier AI, then a chip designed specifically for transformer inference could be much more efficient than a general-purpose GPU. The goal is not just more theoretical compute. The goal is better real-world serving:
more tokens per second
lower latency
lower power consumption
lower cost per token
better rack-level efficiencyThis matters because inference is becoming much harder. Users no longer ask AI systems simple one-shot questions. They ask models to read long documents, reason across multiple steps, use tools, write code, search the web, remember context, and act more like agents.
That means more tokens, longer context windows, more memory pressure, and more power consumption.
Etched is basically saying:
Today's AI infrastructure was not fully designed for the inference-heavy, agentic AI world we are moving into.
The Two Technical Ideas
Etched is currently highlighting two major technical ideas: Low-Voltage Inference and Cluster-Scale Memory.
Both are trying to solve the same big problem: frontier inference is no longer just a chip problem. It is a full-system problem including chip, rack, cooling, networking, memory, compiler, software, supply chain, and power system working together.
1. Low-Voltage Inference
The first problem Etched is attacking is power.
Modern AI chips generate huge amounts of heat. As utilisation rises, chips draw more power. If the chip gets too hot or hits its power limit, it has to slow down. This is called thermal throttling.
That means a chip can advertise huge theoretical FLOPs, but fail to sustain that performance in real workloads.
FLOPs means floating point operations per second. It measures how many mathematical operations a chip can perform. But for AI inference, the better question is not simply:
How many FLOPs can this chip theoretically do?The better question is:
How many useful tokens can this system generate per second,
at acceptable latency,
within a real power budget?Etched's answer is Low-Voltage Inference, or LVI.
The company says it designed its math blocks to run at under half the voltage of most AI chips. Lower voltage can reduce power consumption and heat, which lets the system pack more compute into the same power and thermal envelope.

But this is not just a chip-level trick. Etched says LVI requires co-design across the whole stack:
math arrays
circuits
tiling algorithms
scheduling
power delivery
voltage regulators
advanced packaging
cold plates
cluster designThat is the important point. AI hardware is no longer just about one heroic chip. It is about the full system around the chip.
2. Cluster-Scale Memory
The second problem is memory.
In AI inference, especially for long-context and agentic workloads, memory becomes a bottleneck very quickly. The model needs to store weights, move activations, and manage the KV cache — a memory structure that grows as the model processes longer conversations or documents.
The longer the context, the more memory pressure.
There are two common memory types to understand:
| Memory type | What it is good at | Limitation |
|---|---|---|
| HBM | Large capacity and high bandwidth | More latency and data movement |
| SRAM | Extremely fast and low-latency | Expensive in chip area and lower capacity |
So AI chips face a tradeoff. HBM gives capacity, but moving data can become painful. SRAM gives speed, but there is not enough of it for huge models.
Etched's answer is Cluster-Scale Memory, or CSM.
The company says it created a shared low-latency memory pool across the scale-up domain, using a proprietary ultra-low-latency, high-bandwidth interconnect. The idea is to combine the capacity of HBM with the speed benefits of SRAM-like access patterns.

This is especially important for mixture-of-experts models, or MoEs.
In MoE models, not every part of the model is used for every token. The model routes tokens to different "experts." That can improve efficiency, but it creates a systems problem: data needs to move between chips, memory layers, and networking switches to reach the right expert.
Every extra memory hop adds latency.
Etched's thesis is that the cluster should behave more like one shared low-latency memory system, instead of forcing data through a deep and slow hierarchy.
In plain English:
For frontier inference, memory movement can matter as much as raw compute.
Why This Is Urgent?
Etched is appearing at exactly the moment when AI is shifting from training scarcity to inference scarcity.
Training scarcity means: who has enough GPUs to build the next model?
Inference scarcity means: who can afford to serve the model repeatedly, reliably, and cheaply?
That second problem is becoming more important because AI usage is exploding. More users, more apps, longer prompts, tool use, coding agents, document agents, and multimodal workflows all increase inference demand.
There are three pressure points:
| Pressure point | Why it matters |
|---|---|
| Usage | More users and more applications mean more inference calls |
| Power | Data centres are running into energy and thermal constraints |
| Memory | Long context, MoE routing, and KV cache growth make data movement harder |
Etched fits directly into this moment. It is not trying to make models smarter at the research layer. It is trying to make frontier models cheaper, faster, and more practical to run.
That is a very valuable layer of the AI stack — which is presumably why investors like Jane Street, Thiel, and a TSMC-linked fund got in early and stayed quiet about it for years.
Etched vs the Field
NVIDIA is still the giant. Its GPUs are powerful, flexible, and supported by the strongest software ecosystem in AI. It's also not standing still on inference specifically — Blackwell already leans hard into inference-optimised configurations.

AMD is competing seriously too, especially with high-memory accelerators such as MI300X.

Then there's the more direct comp set: Groq, which is already shipping inference-specific chips (LPUs) at scale, and the hyperscalers building their own custom inference silicon in-house — AWS Trainium, Google TPUs — with the advantage of captive demand from their own cloud customers.
These are all impressive, but Etched's potential edge is depth of specialisation. Because Sohu is designed around transformer inference specifically, Etched can optimise for the exact operations, memory patterns, and serving constraints of modern generative AI. In theory, this can lead to:
better throughput per watt
lower latency
lower cost per token
higher rack-level efficiency
better performance on transformer inferenceThe advantage is not simply "more FLOPs." The advantage is more useful inference output under real constraints: power, heat, memory, networking, and latency.
That is why Etched is not only selling a chip. It is selling racks and frontier inference clusters.
The product is the full-stack system.
The Tradeoff: Flexibility vs Efficiency
The risk is also clear.
NVIDIA GPUs can adapt. If model architectures change, GPUs can still run new software. If transformers become less dominant, NVIDIA is protected by flexibility.
Etched is making a much sharper bet.
If transformers remain the dominant architecture for frontier AI, Sohu could become extremely valuable. If AI shifts heavily toward different architectures, Etched's specialisation could become a weakness.
This is the classic ASIC tradeoff:
Better efficiency, less flexibility.
Etched is effectively saying that the transformer opportunity is now large enough to justify a dedicated hardware platform.
If inference becomes the largest cost centre in AI, even a small improvement in cost per token can become enormous at scale.
The honest caveat: a successful first-pass (A0) tapeout is a real technical milestone — most chip startups don't get functional silicon on their first spin. But it's a different bar from sustaining that efficiency in production, at scale, under real customer load. Etched says first racks ship this summer. That's when this thesis actually gets tested.


