Fish-Speech: The Open-Source TTS Model That's Threatening ElevenLabs

Fish Audio’s open-source Fish-Speech S2 model matches or outperforms closed commercial TTS systems on objective benchmarks, with sub-100ms time-to-first-audio, a 0.515 Audio Turing Test score, and support for 80+ languages from a single model. For practitioners considering ElevenLabs at current pricing, S2 resets the evaluation entirely.

What Is Fish-Speech?

Fish-Speech is the open-source TTS project maintained by Fish Audio, a Chinese AI audio startup. Fish Audio maintains one of the most-watched open-source speech synthesis repositories on GitHub.

The latest generation, Fish Audio S2 (released March 10, 2026), is the model that’s drawing direct comparisons to ElevenLabs. Unlike prior releases, S2 ships with model weights, training code, and the full inference engine all available under open-source terms. Fish Audio simultaneously published a technical report on arXiv detailing architecture decisions and benchmark results. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

The parent company also operates a commercial API built on the same model. This dual-track approach, open weights plus a paid API, mirrors the playbook used by Mistral, Qwen, and other frontier open-weight labs: community adoption funds enterprise deals, and enterprise deals fund model development.

How Does Fish-Speech Work?

Architecture

Fish-Speech S2 is built on a Dual-AR (Dual-Autoregressive) architecture with three primary components:

A Slow AR backbone (Qwen3-4B) that processes interleaved text and reference audio tokens, autoregressively generating the primary semantic codebook
A Fast AR decoder (4-layer Transformer, ~400M parameters) that generates the remaining nine residual acoustic codebooks at each time step
A VQGAN vocoder that reconstructs waveforms from the full discrete token sequence

The pipeline requires as little as 4GB of GPU VRAM for inference, accessible on consumer hardware. For production serving, S2 integrates with SGLang and inherits LLM-native optimizations including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. Fish Audio reports an average prefix-cache hit rate of 86.4%, with peaks above 90%. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

Zero-Shot Voice Cloning

The standout capability is zero-shot voice cloning: provide 3-10 seconds of reference audio and S2 replicates that voice across any script, in any supported language, without fine-tuning. The model was trained on over 10 million hours of audio across approximately 80 languages, enabling cross-lingual transfer: clone an English voice and read Japanese copy. (MarkTechPost. “Fish Audio Releases Fish Audio S2: A New Generation of Expressive TTS with Absurdly Controllable Emotion.” March 10, 2026)

# Fish-Speech inference example (local deployment)
from fish_speech.inference import TTSEngine

engine = TTSEngine.load("fishaudio/fish-speech-s2")

audio = engine.synthesize(
    text="The market shifted overnight.",
    reference_audio="speaker_sample.wav",  # 3-10 seconds
    language="en"
)

audio.save("output.wav")

Emotion Control via Natural Language Tags

S2’s most technically differentiated feature is inline emotion control using free-form natural language tags. Rather than a fixed predefined set, users insert bracketed instructions anywhere in a script:

[whispering] The acquisition closed this morning. [normal]
Nobody outside this room knows yet.
[urgent, hushed] Keep it that way.

The system recognizes 15,000+ distinct tags covering emotion, tone, volume, pitch, and pacing, written in plain language rather than a proprietary syntax. (Fish Audio Docs. “Emotion & Expression Control.”) Fish Audio describes this as “absurdly controllable emotion,” and the EmergentTTS-Eval benchmark backs the claim: S2 achieves a 91.61% win rate specifically on paralinguistic tasks. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)

Fish-Speech vs. ElevenLabs: The Benchmark Picture

ElevenLabs is the benchmark target the community uses, so let’s examine what the numbers actually show.

Metric	Fish Audio S2	ElevenLabs	Notes
Audio Turing Test score	0.515	~0.387-0.417 range	S2 surpasses Seed-TTS (0.417) by 24% (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)
WER (Seed-TTS Eval)	Lowest among all	Not publicly disclosed	S2 beats Qwen3-TTS, MiniMax, Seed-TTS (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026)
TTFA (H200 GPU)	~100ms	Not published	Suitable for real-time conversational agents
API pricing	50-70% cheaper	Baseline	Per-character at comparable quality (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026)
Voice cloning input	3-10 seconds	~1 minute for best results	Zero-shot vs. ElevenLabs instant voice cloning
Languages	80+	29–74 (model-dependent) [Updated June 2026]	ElevenLabs Multilingual v2: 29; Flash v2.5: 32; Eleven v3: 74
Commercial license	Requires separate agreement	Included in paid plans	Fish Audio Research License for open weights
Model weights	Open (Fish Audio Research License)	Closed	Training code also available for S2

These benchmarks deserve nuance. ElevenLabs retains advantages in voice cloning fidelity for high-quality input audio. Some independent reviewers find ElevenLabs clones more natural-sounding for narration use cases. (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026) On language breadth, [Updated June 2026] ElevenLabs has closed part of the gap: its Eleven v3 model now covers 74 languages, up from 32 in earlier models, though Fish Audio S2’s 80+ still leads and runs from a single model without per-language switching. S2’s edge is strongest in paralinguistic expressivity and breadth of language coverage.

Why Does Fish-Speech Matter?

The Economics Are Upside Down for ElevenLabs

ElevenLabs raised $500M in a Series D at an $11B valuation in February 2026, ended 2025 at roughly $350M ARR, and announced it had crossed $500M ARR in early May 2026, pulling in additional investors including BlackRock, NVIDIA, and Sequoia in a third close of the round. (ElevenLabs. “ElevenLabs crosses $500M ARR and welcomes new investors.” May 2026) That’s a legitimate business, growing fast. But a model that matches their output quality at 50-70% lower API cost, backed by open weights, is a structural threat that doesn’t resolve through product iteration alone.

The relevant precedent is Stable Diffusion versus Midjourney. Once open-source image generation reached comparable quality to the leading commercial product, pricing power for premium-only players collapsed rapidly. ElevenLabs is a more defensible business, with enterprise contracts, multi-year commitments, and product integrations, but the directional pressure is the same.

What Open-Source TTS Competition Looks Like in 2026

Fish-Speech isn’t the only pressure point. The open-source TTS ecosystem has matured considerably:

Kokoro (Apache 2.0): 82M parameters, processes text in under 0.3 seconds, best-in-class for speed on constrained hardware, but no voice cloning support. (BentoML. “The Best Open-Source Text-to-Speech Models in 2026.”)
XTTS-v2 (Coqui): Zero-shot voice cloning from 6-second clips across 17 languages, restricted to non-commercial use under the Coqui Public Model License. [Updated June 2026] Coqui AI shut down in December 2025; the model weights remain on Hugging Face and a community fork is maintained at idiap/coqui-ai-TTS, but active development has stopped. (BentoML. “The Best Open-Source Text-to-Speech Models in 2026.”)
Chatterbox (Resemble AI, MIT): [Updated June 2026] Released in 2025 and updated to Chatterbox Multilingual V3 in 2026, this 0.5B-parameter model supports 20+ languages under a fully permissive MIT license. Zero-shot voice cloning from 10 seconds of audio, built-in PerTh watermarking, and an emotion exaggeration control dial. In Resemble AI’s own blind-test listening study, 63.75% of listeners preferred Chatterbox over ElevenLabs. Being MIT-licensed makes it the only current option among quality-tier open TTS models that permits unrestricted commercial use without a separate agreement.
Fish-Speech S2: The current quality leader with the broadest language support and inline emotion control, but also non-commercial without a separate license agreement.

The pattern across Kokoro and Fish-Speech remains the same: open weights with non-commercial restrictions. Chatterbox’s MIT license is the notable exception — the first model at this quality tier to remove the commercial-use barrier entirely.

Consumer Hardware Performance Is Finally Viable

S2 achieves sub-100ms TTFA on NVIDIA H200 GPUs, suitable for real-time applications. (Fish Audio. “Fish Audio S2 Technical Report.” arXiv, March 2026) But more practically, on consumer hardware:

RTX 3060 (12GB VRAM): ~1:15 real-time ratio (1 minute of audio in ~15 seconds)
RTX 4090: ~1:7 real-time ratio
Server-scale H200: 3,000+ acoustic tokens per second

The 4GB minimum VRAM requirement means S2 runs on most mid-range GPUs released since 2022. For developers self-hosting voice generation pipelines, the inference economics are increasingly compelling relative to per-character API billing. [Updated June 2026] Fish Audio’s current cloud API pricing is $15 per 1M UTF-8 bytes, with subscription tiers at $11/month (~200 minutes), $75/month (~27 hours, 3 seats), and $749/month (~104 hours, 10 seats).

Critical: Voice cloning at this quality level (3-10 second reference audio, zero fine-tuning) raises immediate concerns for deepfake audio, identity fraud, and social engineering across large user populations. Fish Audio operates without the institutional trust and compliance infrastructure that enterprise SaaS providers maintain. Deployers should implement their own consent and verification controls. For a concrete illustration of the threat model, the Mercor breach handed attackers 40,000 pre-verified voice samples — exactly the kind of audio that feeds zero-shot cloning pipelines like S2.

What ElevenLabs Still Does Better

Credit where it’s due: ElevenLabs maintains genuine advantages in specific areas.

Voice cloning fidelity for studio-quality input: When fed high-quality reference recordings, ElevenLabs produces clones that independent reviewers consistently describe as more natural and coherent for long-form narration. (SaaS24 Reviews. “Fish.audio vs ElevenLabs: My Hands-on Experience with Both.” 2026) S2 clones can exhibit occasional artificiality, particularly on voices with unusual prosodic patterns.

Enterprise infrastructure: Compliance certifications, SLAs, dedicated support, and the kind of organizational trust that gets TTS deployed in financial services and healthcare aren’t things you can open-source. ElevenLabs’ transition toward 60-70% enterprise revenue by 2027 reflects this moat. (MLQ.ai. “ElevenLabs Lands $500M Series D Round at $11B Valuation.”)

Stability and versioning: Commercial APIs maintain stable endpoints. Open-source models iterate rapidly, which is a feature for researchers and a risk for production pipelines that can’t absorb breaking changes. Separately, any system ingesting user-supplied audio faces adversarial codec attacks that self-hosted deployments are responsible for mitigating without vendor support.

Frequently Asked Questions

Q: Can I use Fish-Speech S2 in a commercial product? A: Not directly with the open weights. S2 is licensed under the Fish Audio Research License, which permits free use for research and non-commercial purposes only. You’ll need to negotiate a commercial license with Fish Audio or use their paid API, which permits commercial deployment under standard API terms.

Q: How does Fish-Speech perform on languages other than English and Chinese? A: S2 was trained on 10M+ hours across approximately 80 languages, and supports zero-shot cross-lingual voice transfer. Quality varies by language. English and Chinese have the strongest coverage given training data distribution, while lower-resource languages may show degraded accuracy.

Q: What hardware do I need to run Fish-Speech locally? A: A minimum of 4GB GPU VRAM for inference. An RTX 3060 generates audio at roughly 1:15 real-time (15 seconds to produce 1 minute of speech). For real-time applications, a high-end consumer or server GPU is required.

Q: Does Fish-Speech S2 support streaming output? A: Yes. S2’s inference engine supports streaming with a time-to-first-audio under 100ms on server hardware, making it viable for conversational agents and live applications, not just batch voiceover generation.

Q: Is this actually better than ElevenLabs, or just cheaper? A: Both. On objective benchmarks (Audio Turing Test, WER), S2 matches or exceeds ElevenLabs in most categories as of March 2026. ElevenLabs retains an edge in voice cloning fidelity for high-quality input audio and in long-form narration naturalness per subjective reviews, but the gap is narrow and the price differential is 50-70%.