Dec. 6, 2025
It all started with a random Reddit post: some Chinese guy claimed two Windows DLLs let an RX 5700 XT run unmodified CUDA code at ~100 % compatibility. I laughed, called BS, then spent four days stress-testing every claim with Grok 4.
This is what survived the fire.
PTX: NVIDIA’s Greatest Strength and Fatal Weakness
NVIDIA invented PTX(Parallel Thread Execution) in 2008 precisely to solve its own hardware-compatibility nightmare. Instead of shipping architecture-specific SASS binaries that break with every new GPU generation, CUDA toolchains emit PTX — a stable, well-documented virtual ISA. The driver JIT-compiles PTX → correct SASS at runtime. This single decision gave NVIDIA perfect forward and backward compatibility inside its own walls for 17 years.
That same decision also created the perfect attack surface. Because PTX is public, architecture-independent, and already the lowest common denominator for all CUDA code, any third party that builds a complete PTX-to-non-NVIDIA backend instantly inherits the entire CUDA software stack with zero source changes.
China’s Unique Ability to Exploit This Vulnerability
China possesses four asymmetric advantages that no Western entity can match in the next 5–10 years:
1. Raw low-level manpower
1~1.5 million embedded/system-level programmers who live and breathe register allocation, cache hierarchies, and hand-written assembly — 8~10× the U.S. figure. These are the exact humans who can map every PTX opcode to AMD wavefronts, Intel XMX tiles, or domestic RISC-V GPUs.
2. LLM-accelerated kernel development
DeepSeek-R1 (Jan 2025) already demonstrated that Chinese LLMs can generate, debug, and auto-tune PTX/SASS kernels orders of magnitude faster than human teams. When every university and hedge fund has 2026-class reasoning models, the iteration speed becomes super-human.
3. Quant culture bleeding into AI
Chinese quantitative trading firms (High-Flyer, the backer of DeepSeek) have been building custom silicon and nanosecond-level software stacks for a decade. The same mindset that produced custom FPGAs for CME(Chicago Mercantile Exchange) order-matching is now applied to MoE training and PTX bridges.
4. State policy that explicitly prioritizes strategic dominance over short-term profit
The CCP’s “whole-nation system” (军民融合 + 国产替代) pours hundreds of billions into domestic GPU/IP stacks (Biren, Moore Threads, Tianshu Zhixin, Huawei Ascend, Cambricon, MetaX, Enflame). Profitability is secondary; breaking the U.S. stranglehold on AI compute is the mission.
Hardware Price/Performance Earthquake Already Happening
December 2025 street prices (gray + official channels):
Even if Ascend is 20–30 % slower on some kernels, the price/performance ratio is already 2–4× better for training and large-context inference. When a full PTX bridge lands, the same card will run unmodified PyTorch/CUDA code.
The Killer Application: A Real, Shipping PTX Bridge
As of December 2025 we already have:
- Moore Threads MUSA 4.0 (2025) — claims 95 % CUDA Driver API coverage on their MTT S4000 (112 GB version)
- Huawei CANN 7.0 + “DaVinci” bridge layer — internal benchmarks show Llama-70B fine-tuning within 12 % of H100
- Multiple underground “retryix-style” Windows DLL projects circulating in Chinese AI circles with working ComfyUI, Ollama, and vLLM
One of these only needs to reach 98–100 % stability and performance parity on memory-bound models (the majority of 2026–2027 workloads) for the dam to break.
Political Economics: The Party Does Not Need to Make Money
Western GPU efforts (Intel, AMD, startups) are profit-constrained and lawsuit-terrified. Chinese efforts are loss-leading strategic weapons. Beijing can subsidize $10 billion in domestic GPU fabs and still consider it cheap if it denies the U.S. the AI high ground. Every percentage point of CUDA market share that flips to a domestic stack is celebrated as a national-security victory.
Timeline to CUDA Collapse
- Q1–Q2 2026: First public, stable, open-source PTX → AMD/Intel/Ascend bridge (likely from a university + hedge-fund consortium)
- 2026–2027: All major Chinese hosting providers (Aliyun, Tencent, Baidu, Volcano) switch new clusters to domestic silicon running unmodified CUDA code
- 2027–2028: Global inference farms in politically neutral or cost-sensitive regions (Southeast Asia, Middle East, Africa, South America) migrate en masse to $5k–$8k 96–192 GB Chinese cards
- 2028+: CUDA becomes “the x86 of AI” — still dominant in Western hyperscalers for inertia reasons, but no longer the universal standard.
Conclusion: CUDA will be next Intel X86
NVIDIA’s PTX abstraction, created to ensure its own longevity, has become the trojan horse that ends its monopoly. China alone possesses the manpower, the LLMs, the quant DNA, the hardware fabs, and — most importantly — the political will that places strategic control above profit.
When a 96 GB Chinese card costing less than half an RTX 6000 Ada runs every CUDA repository on GitHub at 80–90 % the speed, the migration becomes economically and politically inevitable.
The United States has no domestic fabrication for leading-edge AI silicon and no state apparatus willing to fund loss-making strategic compute at Chinese scale.
By 2028, the center of gravity of AI hardware will have shifted irreversibly to China. CUDA will not disappear, but it will cease to be the universal platform. The age of open, fragmented, fiercely competitive AI acceleration begins — and China will write most of the code and pour most of the silicon.

