groundy
infrastructure & runtime

Generating GPU Kernels for Moore Threads Silicon: Can LLMs Break CUDA Lock-In?

MusaCoder trains a 9B model to emit native GPU kernels for Moore Threads' MUSA architecture, claiming parity with frontier models on vendor-controlled benchmarks.

7 min · · · 6 sources ↓

The core proposition in arXiv:2606.04847 is blunt: a 9-billion-parameter language model can generate native GPU kernels for Moore Threads’ MUSA architecture that match or exceed what frontier closed-source models produce. If that holds, the hardest part of moving inference workloads off NVIDIA silicon, writing hand-tuned kernels, becomes a problem an LLM can automate. The question is whether anyone outside Moore Threads’ own labs can verify it.

What MusaCoder Does

MusaCoder is a training framework that produces code-generation models targeting both CUDA and MUSA, the parallel-computing API behind Moore Threads’ domestic Chinese GPUs. The system generates native GPU kernels from problem specifications: given a computational task, the model emits the kernel code, which is then compiled, executed, and scored against correctness and performance targets.

The distinction matters. Most LLM kernel-generation work targets CUDA, where the training data is abundant and the execution environment is trivially accessible. MusaCoder targets MUSA as a first-class backend, which means training on a sparser corpus of kernel code and verifying on hardware that is not available on any major cloud provider.

How the Training Stack Works

The paper describes a three-stage pipeline.

Progressive kernel-oriented data synthesis generates training examples of increasing difficulty, building up from simple elementwise operations to complex fused kernels.

Diversity-preserving rejection fine-tuning filters candidate generations not just for correctness but for coverage across the solution space, preventing mode collapse toward a single kernel pattern.

Execution-feedback reinforcement learning compiles and runs generated kernels on real hardware through MooreEval, a distributed verifier that scores both functional correctness and empirical speedup, then feeds these execution signals back into the model.

The RL stage introduces three stabilization techniques: PrimeEcho (anchoring multi-turn rewards to the first-turn response), Buffered Dynamic Retry (salvaging training signal from samples where every attempt failed), and MirrorPop (filtering off-policy sequences to prevent reward hacking). These are standard problems in code-generation RL. The contribution is the specific architecture for handling them in a GPU-kernel context where the reward signal is sparse and noisy.

The Results, and the Benchmark Question

According to the paper, the 9B-parameter MusaCoder matches or exceeds frontier closed-source models on both KernelBench and a MUSA-ported variant, while the 27B model establishes a new state of the art.

What the abstract does not provide are specific speedup percentages or absolute throughput numbers comparing MusaCoder-generated kernels against hand-tuned baselines. The paper claims the 9B model “matches or exceeds” frontier models and the 27B sets a new state of the art, but these are relative claims against other LLM-generated kernels, not necessarily against expert-written code. The distinction between “best LLM output” and “competitive with hand-tuned CUDA” is where the practical question lives, and the abstract alone does not settle it.

Moore Threads Hardware and the MUSA Ecosystem

Moore Threads’ current flagship, the MTT S4000, is a 48 GB GPU supporting FP64 through INT8 precision with MTLink 1.0 multi-card interconnect. It has passed China’s CAICT AI chip certification running DeepSeek-R1 671B inference, using vLLM-MUSA, MUTLASS, and Triton-MUSA from the MUSA software stack.

The ecosystem around MUSA has grown substantially. The Moore Threads developer portal lists a VS Code extension with CUDA-to-MUSA migration tools, the Moore Perf Compute profiler, and standard math libraries (muDNN, muBLAS, muFFT). Moore Threads has also open-sourced tilelang_musa, a TVM-based domain-specific language for high-performance GPU kernels including GEMM and FlashAttention, supporting the S5000, S4000, and M1000 GPUs with over 95% test pass rate.

This is not a toy stack. It is a credible attempt at building a CUDA-alternative software ecosystem. The MUSA SDK documentation covers the standard GPU programming model. The question is whether credible is enough.

The CUDA Lock-In Question

NVIDIA’s moat is not purely hardware. The H100 and its successors have clear silicon advantages in raw throughput and memory bandwidth for training. But for inference workloads, where the kernel set is more constrained and the performance bar is lower, the moat is partly tooling: CUDA has a 15-year head start in libraries, profiling tools, and the accumulated body of hand-tuned kernel code that every framework depends on.

MusaCoder attacks the tooling part of that moat directly. If an LLM can generate working, performant kernels for MUSA with competitive correctness rates, the cost of porting an inference framework drops from “rewrite your kernel library” to “run the model, verify the output, and fix the failures.” That is a different engineering problem, and a cheaper one.

The paper’s claim that Moore Threads GPUs can support the complete LLM post-training stack is the logical extension: if you can train the model that generates your kernels on the same hardware that runs the kernels, you have a self-contained loop that does not touch NVIDIA at any point. For Chinese organizations facing export controls on NVIDIA silicon, that self-sufficiency is the point.

What Cannot Be Verified From Here

Every claim in this article rests ultimately on either Moore Threads’ own publications or on benchmarks that require Moore Threads hardware to reproduce. The MTT S4000 is not available on AWS, Azure, GCP, or any Western cloud provider. There is no public emulator. To verify MusaCoder’s kernel correctness or throughput claims, you need physical access to a Moore Threads GPU running the MUSA toolchain.

This is not unusual for Chinese domestic-accelerator research; it is the default state. But it means the article you are reading is structurally limited in what it can confirm. The paper’s methodology appears sound in description: execution-feedback RL with a distributed verifier is a legitimate training approach, and KernelBench is a recognized benchmark in the LLM-for-code literature. The specific results on the MUSA-ported variant are, by definition, not reproducible by anyone who cannot run MUSA.

The practical takeaway is conditional. If MusaCoder’s results hold under independent scrutiny, it is evidence that the CUDA moat for inference workloads is narrower than it looks, because the hardest part of porting, writing performant kernels, is automatable with current LLM techniques. If the results are inflated by benchmark construction or selective reporting, it is still evidence that Moore Threads is investing seriously in the tooling layer, which is where the actual lock-in lives. Either way, the CUDA-alternative question in China is no longer hypothetical.

Frequently Asked Questions

Does MusaCoder address multi-GPU communication, or only single-device kernels?

The KernelBench and MUSA-ported benchmarks evaluate single-kernel operations like GEMM, reductions, and elementwise transforms. Multi-device collective operations (all-reduce, scatter-gather across MTLink) fall outside the benchmark scope. Those communication primitives are typically the harder bottleneck in distributed training, and generating them requires reasoning about interconnect topology and latency, not just compute correctness.

How does LLM-generated kernel code compare to AMD’s ROCm or Intel’s oneAPI approach to CUDA displacement?

ROCm uses HIPI source-to-source translation from CUDA, and oneAPI abstracts hardware through SYCL. Both assume an existing CUDA codebase to port or abstract away. MusaCoder generates target-native code from a problem specification, skipping the translation step. That avoids the unsupported-intrinsic and memory-model-divergence problems that plague source-to-source translators, but it only works when you can describe the kernel’s intent declaratively rather than pointing at existing code.

What would a team need to reproduce the MusaCoder benchmark results?

A Moore Threads S4000, S5000, or M1000 GPU, the MUSA SDK (including the muCC compiler and muDNN libraries), and the MooreEval distributed verifier. None of these are available through any public cloud provider or hardware emulator. A team without physical Moore Threads hardware could review the generated kernel source for structural correctness, but throughput and speedup claims require native execution on the target silicon.

What does KernelBench actually test, and where are the blind spots?

KernelBench evaluates self-contained operations: matrix multiplies, elementwise transforms, reductions, and limited fusion patterns. It does not test dynamic shape handling, sparse operations, or production-framework scheduling where multiple kernels chain together with shared memory management. An LLM that scores well on KernelBench can generate correct individual kernels, but real inference workloads require fused compound kernels with framework-specific orchestration, which is a qualitatively harder generation target.

sources · 6 cited

  1. MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU primary accessed 2026-06-05
  2. MTT S4000 | Moore Threads vendor accessed 2026-06-05
  3. 摩尔线程 MTT S4000 通过中国信通院 AI 芯片和大模型适配验证 analysis accessed 2026-06-05
  4. Moore Threads Developer Portal vendor accessed 2026-06-05
  5. GitHub - MooreThreads/tilelang_musa: Domain-specific language for high-performance GPU kernels community accessed 2026-06-05
  6. MUSA SDK | 摩尔线程 vendor accessed 2026-06-05