groundy
models

Fish-Speech: The Open-Source TTS Model That's Threatening ElevenLabs

Fish Audio's S2 model reached SOTA benchmarks in March 2026 with sub-100ms latency, 80+ languages, and open-sourced weights—directly challenging ElevenLabs' commercial dominance while exposing the real costs of 'free' voice AI.

8 min · · · 17 sources ↓

Fish Audio’s open-source Fish-Speech S2 model matches or outperforms closed commercial TTS systems on objective benchmarks—achieving sub-100ms time-to-first-audio, a 0.515 Audio Turing Test score (beating ElevenLabs in head-to-head arena rankings), and support for 80+ languages from a single model. For practitioners considering ElevenLabs at current pricing, S2 resets the evaluation entirely.

What Is Fish-Speech?

Fish-Speech is the open-source TTS project maintained by Fish Audio, a Chinese AI audio startup. The repository sits at 26,000+ GitHub stars as of March 2026, making it one of the most-watched speech synthesis projects on the platform. (GitHub. “fishaudio/fish-speech: SOTA Open Source TTS.”)

The latest generation—Fish Audio S2, released March 10, 2026—is the model that’s drawing direct comparisons to ElevenLabs. Unlike prior releases, S2 ships with model weights, training code, and the full inference engine all available under open-source terms. Fish Audio simultaneously published a technical report on arXiv detailing architecture decisions and benchmark results. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

The parent company also operates a commercial API built on the same model. This dual-track approach—open weights plus a paid API—mirrors the playbook used by Mistral, Qwen, and other frontier open-weight labs: community adoption funds enterprise deals, and enterprise deals fund model development.

How Does Fish-Speech Work?

Architecture

Fish-Speech S2 is built on a Dual-AR (Dual-Autoregressive) architecture with three primary components:

  • A Slow AR backbone (Qwen3-4B) that processes interleaved text and reference audio tokens, autoregressively generating the primary semantic codebook
  • A Fast AR decoder (4-layer Transformer, ~400M parameters) that generates the remaining nine residual acoustic codebooks at each time step
  • A VQGAN vocoder that reconstructs waveforms from the full discrete token sequence

The pipeline requires as little as 4GB of GPU VRAM for inference—accessible on consumer hardware. For production serving, S2 integrates with SGLang and inherits LLM-native optimizations including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. Fish Audio reports an average prefix-cache hit rate of 86.4%, with peaks above 90%. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

Zero-Shot Voice Cloning

The standout capability is zero-shot voice cloning: provide 3–10 seconds of reference audio and S2 replicates that voice across any script, in any supported language, without fine-tuning. The model was trained on over 10 million hours of audio across approximately 80 languages, enabling cross-lingual transfer—clone an English voice and read Japanese copy. (MarkTechPost. “Fish Audio Releases Fish Audio S2: A New Generation of Expressive TTS with Absurdly Controllable Emotion.” March 10, 2026)

# Fish-Speech inference example (local deployment)
from fish_speech.inference import TTSEngine
engine = TTSEngine.load("fishaudio/fish-speech-s2")
audio = engine.synthesize(
text="The market shifted overnight.",
reference_audio="speaker_sample.wav", # 3-10 seconds
language="en"
)
audio.save("output.wav")

Emotion Control via Natural Language Tags

S2’s most technically differentiated feature is inline emotion control using free-form natural language tags. Rather than a fixed predefined set, users insert bracketed instructions anywhere in a script:

[whispering] The acquisition closed this morning. [normal]
Nobody outside this room knows yet.
[urgent, hushed] Keep it that way.

The system recognizes 15,000+ distinct tags covering emotion, tone, volume, pitch, and pacing—written in plain language rather than a proprietary syntax. (Fish Audio Docs. “Emotion & Expression Control.”) Fish Audio describes this as “absurdly controllable emotion,” and the EmergentTTS-Eval benchmark backs the claim: S2 achieves a 91.61% win rate specifically on paralinguistic tasks. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

Fish-Speech vs. ElevenLabs: The Benchmark Picture

ElevenLabs is the benchmark target the community uses, so let’s examine what the numbers actually show.

MetricFish Audio S2ElevenLabsNotes
Audio Turing Test score0.515~0.387–0.417 rangeS2 surpasses Seed-TTS (0.417) by 24% (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)
TTS-Arena ranking (Oct 2025)#1 (S1 model)~#2Independent arena eval (AI Tool Analysis. “Fish Audio Review 2026: The ElevenLabs Killer That’s 6x Cheaper?”)
WER (Seed-TTS Eval)Lowest among allNot publicly disclosedS2 beats Qwen3-TTS, MiniMax, Seed-TTS (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)
TTFA (H200 GPU)~100msNot publishedSuitable for real-time conversational agents
API pricing50–70% cheaperBaselinePer-character at comparable quality (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026)
Voice cloning input3–10 seconds~1 minute for best resultsZero-shot vs. ElevenLabs instant voice cloning
Languages80+32Single model, no per-language switching
Commercial licenseRequires separate agreementIncluded in paid plansFish Audio Research License for open weights
Model weightsOpen (Fish Audio Research License)ClosedTraining code also available for S2

The arena rankings deserve nuance. TTS-Arena scores reflect aggregate human preference across diverse test cases, and ElevenLabs retains advantages in voice cloning fidelity for high-quality input audio—some independent reviewers find ElevenLabs clones more natural-sounding for narration use cases. (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026) S2’s edge is strongest in paralinguistic expressivity and breadth of language coverage.

Why Does Fish-Speech Matter?

The Economics Are Upside Down for ElevenLabs

ElevenLabs raised $500M in a Series D at an $11B valuation in February 2026, ended 2025 at roughly $350M ARR, and announced it had crossed $500M ARR in early May 2026—pulling in additional investors including BlackRock, NVIDIA, and Sequoia in a third close of the round. (ElevenLabs. “ElevenLabs crosses $500M ARR and welcomes new investors.” May 2026) That’s a legitimate business, growing fast. But a model that matches their output quality at 50–70% lower API cost, backed by open weights, is a structural threat that doesn’t resolve through product iteration alone.

The relevant precedent is Stable Diffusion versus Midjourney. Once open-source image generation reached comparable quality to the leading commercial product, pricing power for premium-only players collapsed rapidly. ElevenLabs is a more defensible business—enterprise contracts, multi-year commitments, product integrations—but the directional pressure is the same.

What Open-Source TTS Competition Looks Like in 2026

Fish-Speech isn’t the only pressure point. The open-source TTS ecosystem has matured considerably:

The pattern across all three is the same: open weights with non-commercial restrictions. True Apache or MIT-licensed TTS at commercial quality remains an open problem.

Consumer Hardware Performance Is Finally Viable

S2 achieves sub-100ms TTFA on NVIDIA H200 GPUs—suitable for real-time applications. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026) But more practically, on consumer hardware:

  • RTX 3060 (12GB VRAM): ~1
    real-time ratio (1 minute of audio in ~15 seconds)
  • RTX 4090: ~1
    real-time ratio
  • Server-scale H200: 3,000+ acoustic tokens per second

The 4GB minimum VRAM requirement means S2 runs on most mid-range GPUs released since 2022. For developers self-hosting voice generation pipelines, the inference economics are increasingly compelling relative to per-character API billing.

What ElevenLabs Still Does Better

Credit where it’s due: ElevenLabs maintains genuine advantages in specific areas.

Voice cloning fidelity for studio-quality input: When fed high-quality reference recordings, ElevenLabs produces clones that independent reviewers consistently describe as more natural and coherent for long-form narration. (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026) S2 clones can exhibit occasional artificiality, particularly on voices with unusual prosodic patterns.

Enterprise infrastructure: Compliance certifications, SLAs, dedicated support, and the kind of organizational trust that gets TTS deployed in financial services and healthcare aren’t things you can open-source. ElevenLabs’ transition toward 60–70% enterprise revenue by 2027 reflects this moat. (MLQ.ai. “ElevenLabs Lands $500M Series D Round at $11B Valuation.”)

Stability and versioning: Commercial APIs maintain stable endpoints. Open-source models iterate rapidly—which is a feature for researchers and a risk for production pipelines that can’t absorb breaking changes.

Frequently Asked Questions

Q: Can I use Fish-Speech S2 in a commercial product? A: Not directly with the open weights—S2 is licensed under the Fish Audio Research License, which permits free use for research and non-commercial purposes only. You’ll need to negotiate a commercial license with Fish Audio or use their paid API, which permits commercial deployment under standard API terms.

Q: How does Fish-Speech perform on languages other than English and Chinese? A: S2 was trained on 10M+ hours across approximately 80 languages, and supports zero-shot cross-lingual voice transfer. Quality varies by language—English and Chinese have the strongest coverage given training data distribution, while lower-resource languages may show degraded accuracy.

Q: What hardware do I need to run Fish-Speech locally? A: A minimum of 4GB GPU VRAM for inference. An RTX 3060 generates audio at roughly 1

real-time (15 seconds to produce 1 minute of speech). For real-time applications, a high-end consumer or server GPU is required.

Q: Does Fish-Speech S2 support streaming output? A: Yes. S2’s inference engine supports streaming with a time-to-first-audio under 100ms on server hardware, making it viable for conversational agents and live applications—not just batch voiceover generation.

Q: Is this actually better than ElevenLabs, or just cheaper? A: Both. On objective benchmarks (Audio Turing Test, WER, TTS-Arena), S2 matches or exceeds ElevenLabs in most categories as of March 2026. ElevenLabs retains an edge in voice cloning fidelity for high-quality input audio and in long-form narration naturalness per subjective reviews—but the gap is narrow and the price differential is 50–70%.


Sources:

Sources:

  1. GitHub. "fishaudio/fish-speech: SOTA Open Source TTS." community accessed 2026-04-24
  2. Fish Audio. "Fish Audio S2 Technical Report." arXiv, March 2026 primary accessed 2026-05-22
  3. MarkTechPost. "Fish Audio Releases Fish Audio S2: A New Generation of Expressive TTS with Absurdly Controllable Emotion." March 10, 2026 analysis accessed 2026-04-24
  4. Fish Audio Docs. "Emotion & Expression Control." vendor accessed 2026-04-24
  5. AI Tool Analysis. "Fish Audio Review 2026: The ElevenLabs Killer That's 6x Cheaper?" analysis accessed 2026-04-24
  6. SaaS24 Reviews. "Fish.audio vs ElevenLabs: My Hands-on Experience with Both." 2026 analysis accessed 2026-04-24
  7. MLQ.ai. "ElevenLabs Lands $500M Series D Round at $11B Valuation." analysis accessed 2026-04-24
  8. BentoML. "The Best Open-Source Text-to-Speech Models in 2026." vendor accessed 2026-04-24
  9. Fish Audio analysis accessed 2026-04-24
  10. Fish Audio S2 Launches Next-Gen Real-Time Expressive TTS - UBOS analysis accessed 2026-04-24
  11. Fish Audio vs ElevenLabs: Pricing & Feature Comparison 2025 analysis accessed 2026-04-24
  12. Fish Speech S2: The Open Source TTS Rivaling ElevenLabs - Emelia analysis accessed 2026-04-24
  13. Fish Audio S2-Pro: A TTS Model with Emotion in Speech Controlled with Natural Language - DEV Community analysis accessed 2026-04-24
  14. FISH-SPEECH: Leveraging Large Language Models for Advanced Multilingual TTS Synthesis (arXiv) primary accessed 2026-04-24
  15. ElevenLabs: Revenue, Worth, Valuation & Competitors 2026 - CompWorth analysis accessed 2026-04-24
  16. Fish Audio. "What We Mean by Open Source, and Why It Matters for S2." vendor accessed 2026-05-22
  17. ElevenLabs. "ElevenLabs crosses $500M ARR and welcomes new investors." May 2026 primary accessed 2026-05-22