Table of Contents

Fish Audio’s open-source Fish-Speech S2 model matches or outperforms closed commercial TTS systems on objective benchmarks—achieving sub-100ms time-to-first-audio, a 0.515 Audio Turing Test score (beating ElevenLabs in head-to-head arena rankings), and support for 80+ languages from a single model. For practitioners considering ElevenLabs at current pricing, S2 resets the evaluation entirely.

What Is Fish-Speech?

Fish-Speech is the open-source TTS project maintained by Fish Audio, a Chinese AI audio startup. The repository sits at 26,000+ GitHub stars as of March 2026, making it one of the most-watched speech synthesis projects on the platform.1

The latest generation—Fish Audio S2, released March 10, 2026—is the model that’s drawing direct comparisons to ElevenLabs. Unlike prior releases, S2 ships with model weights, training code, and the full inference engine all available under open-source terms. Fish Audio simultaneously published a technical report on arXiv detailing architecture decisions and benchmark results.2

The parent company also operates a commercial API built on the same model. This dual-track approach—open weights plus a paid API—mirrors the playbook used by Mistral, Qwen, and other frontier open-weight labs: community adoption funds enterprise deals, and enterprise deals fund model development.

How Does Fish-Speech Work?

Architecture

Fish-Speech S2 combines three primary components:

  • A LLaMA-based transformer backbone that processes input text and generates discrete speech tokens
  • A VQGAN (Vector Quantized Generative Adversarial Network) vocoder that reconstructs waveforms from those tokens
  • A VITS synthesis layer for final audio refinement and speaker conditioning

The pipeline requires as little as 4GB of GPU VRAM for inference—accessible on consumer hardware. For production serving, S2 integrates with SGLang and inherits LLM-native optimizations including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. Fish Audio reports an average prefix-cache hit rate of 86.4%, with peaks above 90%.2

Zero-Shot Voice Cloning

The standout capability is zero-shot voice cloning: provide 3–10 seconds of reference audio and S2 replicates that voice across any script, in any supported language, without fine-tuning. The model was trained on over 10 million hours of audio across approximately 80 languages, enabling cross-lingual transfer—clone an English voice and read Japanese copy.3

# Fish-Speech inference example (local deployment)
from fish_speech.inference import TTSEngine
engine = TTSEngine.load("fishaudio/fish-speech-s2")
audio = engine.synthesize(
text="The market shifted overnight.",
reference_audio="speaker_sample.wav", # 3-10 seconds
language="en"
)
audio.save("output.wav")

Emotion Control via Natural Language Tags

S2’s most technically differentiated feature is inline emotion control using free-form natural language tags. Rather than a fixed predefined set, users insert bracketed instructions anywhere in a script:

[whispering] The acquisition closed this morning. [normal]
Nobody outside this room knows yet.
[urgent, hushed] Keep it that way.

The system recognizes 15,000+ distinct tags covering emotion, tone, volume, pitch, and pacing—written in plain language rather than a proprietary syntax.4 Fish Audio describes this as “absurdly controllable emotion,” and the EmergentTTS-Eval benchmark backs the claim: S2 achieves a 91.61% win rate specifically on paralinguistic tasks.2

Fish-Speech vs. ElevenLabs: The Benchmark Picture

ElevenLabs is the benchmark target the community uses, so let’s examine what the numbers actually show.

MetricFish Audio S2ElevenLabsNotes
Audio Turing Test score0.515~0.387–0.417 rangeS2 surpasses Seed-TTS (0.417) by 24%2
TTS-Arena ranking (Oct 2025)#1 (S1 model)~#2Independent arena eval5
WER (Seed-TTS Eval)Lowest among allNot publicly disclosedS2 beats Qwen3-TTS, MiniMax, Seed-TTS2
TTFA (H200 GPU)~100msNot publishedSuitable for real-time conversational agents
API pricing50–70% cheaperBaselinePer-character at comparable quality6
Voice cloning input3–10 seconds~1 minute for best resultsZero-shot vs. ElevenLabs instant voice cloning
Languages80+32Single model, no per-language switching
Commercial licenseRequires separate agreementIncluded in paid plansCC-BY-NC-SA for open weights
Model weightsOpen (CC-BY-NC-SA)ClosedTraining code also available for S2

The arena rankings deserve nuance. TTS-Arena scores reflect aggregate human preference across diverse test cases, and ElevenLabs retains advantages in voice cloning fidelity for high-quality input audio—some independent reviewers find ElevenLabs clones more natural-sounding for narration use cases.6 S2’s edge is strongest in paralinguistic expressivity and breadth of language coverage.

Why Does Fish-Speech Matter?

The Economics Are Upside Down for ElevenLabs

ElevenLabs raised $500M in a Series D at an $11B valuation in February 2026 and is reporting $330M in annual recurring revenue—up 175% year-over-year.7 That’s a legitimate business. But a model that matches their output quality at 50–70% lower API cost, backed by open weights, is a structural threat that doesn’t resolve through product iteration alone.

The relevant precedent is Stable Diffusion versus Midjourney. Once open-source image generation reached comparable quality to the leading commercial product, pricing power for premium-only players collapsed rapidly. ElevenLabs is a more defensible business—enterprise contracts, multi-year commitments, product integrations—but the directional pressure is the same.

What Open-Source TTS Competition Looks Like in 2026

Fish-Speech isn’t the only pressure point. The open-source TTS ecosystem has matured considerably:

  • Kokoro (Apache 2.0): 82M parameters, processes text in under 0.3 seconds, best-in-class for speed on constrained hardware—but no voice cloning support.8
  • XTTS-v2 (Coqui): Zero-shot voice cloning from 6-second clips across 17 languages, but restricted to non-commercial use under the Coqui Public Model License.8
  • Fish-Speech S2: The current quality leader with the broadest language support and inline emotion control, but also non-commercial without a separate license agreement.

The pattern across all three is the same: open weights with non-commercial restrictions. True Apache or MIT-licensed TTS at commercial quality remains an open problem.

Consumer Hardware Performance Is Finally Viable

S2 achieves sub-100ms TTFA on NVIDIA H200 GPUs—suitable for real-time applications.2 But more practically, on consumer hardware:

  • RTX 3060 (12GB VRAM): ~1
    real-time ratio (1 minute of audio in ~15 seconds)
  • RTX 4090: ~1
    real-time ratio
  • Server-scale H200: 3,000+ acoustic tokens per second

The 4GB minimum VRAM requirement means S2 runs on most mid-range GPUs released since 2022. For developers self-hosting voice generation pipelines, the inference economics are increasingly compelling relative to per-character API billing.

What ElevenLabs Still Does Better

Credit where it’s due: ElevenLabs maintains genuine advantages in specific areas.

Voice cloning fidelity for studio-quality input: When fed high-quality reference recordings, ElevenLabs produces clones that independent reviewers consistently describe as more natural and coherent for long-form narration.6 S2 clones can exhibit occasional artificiality, particularly on voices with unusual prosodic patterns.

Enterprise infrastructure: Compliance certifications, SLAs, dedicated support, and the kind of organizational trust that gets TTS deployed in financial services and healthcare aren’t things you can open-source. ElevenLabs’ transition toward 60–70% enterprise revenue by 2027 reflects this moat.7

Stability and versioning: Commercial APIs maintain stable endpoints. Open-source models iterate rapidly—which is a feature for researchers and a risk for production pipelines that can’t absorb breaking changes.

Frequently Asked Questions

Q: Can I use Fish-Speech S2 in a commercial product? A: Not directly with the open weights—S2 is licensed CC-BY-NC-SA-4.0, which prohibits commercial use. You’ll need to negotiate a commercial license with Fish Audio or use their paid API, which permits commercial deployment under standard API terms.

Q: How does Fish-Speech perform on languages other than English and Chinese? A: S2 was trained on 10M+ hours across approximately 80 languages, and supports zero-shot cross-lingual voice transfer. Quality varies by language—English and Chinese have the strongest coverage given training data distribution, while lower-resource languages may show degraded accuracy.

Q: What hardware do I need to run Fish-Speech locally? A: A minimum of 4GB GPU VRAM for inference. An RTX 3060 generates audio at roughly 1

real-time (15 seconds to produce 1 minute of speech). For real-time applications, a high-end consumer or server GPU is required.

Q: Does Fish-Speech S2 support streaming output? A: Yes. S2’s inference engine supports streaming with a time-to-first-audio under 100ms on server hardware, making it viable for conversational agents and live applications—not just batch voiceover generation.

Q: Is this actually better than ElevenLabs, or just cheaper? A: Both. On objective benchmarks (Audio Turing Test, WER, TTS-Arena), S2 matches or exceeds ElevenLabs in most categories as of March 2026. ElevenLabs retains an edge in voice cloning fidelity for high-quality input audio and in long-form narration naturalness per subjective reviews—but the gap is narrow and the price differential is 50–70%.


Sources:

Footnotes

  1. GitHub. “fishaudio/fish-speech: SOTA Open Source TTS.” https://github.com/fishaudio/fish-speech

  2. Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026. https://arxiv.org/html/2603.08823v2 2 3 4 5 6

  3. MarkTechPost. “Fish Audio Releases Fish Audio S2: A New Generation of Expressive TTS with Absurdly Controllable Emotion.” March 10, 2026. https://www.marktechpost.com/2026/03/10/fish-audio-releases-fish-audio-s2-a-new-generation-of-expressive-text-to-speech-tts-with-absurdly-controllable-emotion/

  4. Fish Audio Docs. “Emotion & Expression Control.” https://docs.fish.audio/developer-guide/best-practices/emotion-control

  5. AI Tool Analysis. “Fish Audio Review 2026: The ElevenLabs Killer That’s 6x Cheaper?” https://aitoolanalysis.com/fish-audio-review/

  6. SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026. https://saas24reviews.com/fish-audio-vs-elevenlabs 2 3

  7. MLQ.ai. “ElevenLabs Lands $500M Series D Round at $11B Valuation.” https://mlq.ai/news/elevenlabs-lands-500m-series-d-round-at-11-billion-valuation/ 2 3

  8. BentoML. “The Best Open-Source Text-to-Speech Models in 2026.” https://www.bentoml.com/blog/exploring-the-world-of-open-source-text-to-speech-models 2

Enjoyed this article?

Stay updated with our latest insights on AI and technology.