An open-weight text-to-speech Arena leaderboard in January 2026 above much larger models like XTTS v2 (467M) and MetaVoice (1.2B).
Apache 2.0 licensed. No API key. No usage caps. No network calls after the initial model download.
System Verdict
Pick Kokoro if the use case is offline, high-volume, or privacy-constrained English TTS with a fixed voice. The download is ~300MB, runs on a laptop, and costs nothing past electricity. Community ONNX builds ship in 88MB-310MB size variants for mobile and browser deployment.
Skip it if the job needs voice cloning, fine-grained emotion control, or real-time streaming. ElevenLabs keeps the quality ceiling and the voice-library breadth. Cartesia owns low-latency conversational use cases. MiniMax Speech undercuts ElevenLabs on price for multilingual workloads that still want a hosted API.
Kokoro’s moat is size-efficiency, not features. The 82M parameter count means laptop-local inference at commercial-grade quality for a narrow slice of jobs.
Key Facts
| Model size | 82M parameters (~300MB download) |
| License | Apache 2.0 (commercial use permitted) |
| Architecture | Modified StyleTTS 2 |
| Voices (v1.0) | 54 voices across 8 languages |
| Languages (v1.0) | English (US + UK), Spanish, French, Hindi, Italian, Japanese, Mandarin Chinese |
| Inference | CPU and CUDA GPU; Apple Silicon via ONNX |
| Deployment formats | PyTorch, ONNX (fp32 310MB, fp16 169MB, int8 88MB) |
| Hosted API cost | Under $1 per 1M input characters via third-party providers |
| Released | November 2024; v1.0 early 2026 |
What it actually is
A small neural TTS model that turns text into audio locally. The architecture is a modified StyleTTS 2 trained on permissive, non-copyrighted audio with IPA phoneme labels.
The Python package (pip install kokoro) wraps inference with a minimal API. ONNX builds target mobile, browser, and non-Python runtimes. A Gradio demo ships for no-code local testing.
The moat is size. At 82M parameters Kokoro takes under 300MB on disk and runs in real-time on CPU. Competing open models at comparable quality (XTTS v2, Tortoise) are 4-5x larger and need a GPU for acceptable latency.
When to pick Kokoro
- Self-hosted AI stacks that must stay offline. Pair with a local LLM for end-to-end air-gapped audio pipelines.
- High-volume narration where per-character fees hurt. Audiobooks, podcasts, subtitles, game VO at scale.
- Privacy-sensitive text (medical, legal, financial). No outbound API call means no data egress.
- Edge and mobile deployments. The int8 ONNX build is 88MB. Fits on a phone.
- Research and reproducibility. Fixed weights and deterministic inference avoid the drift introduced by hosted-model upgrades.
When to pick something else
- Voice cloning from a reference clip: ElevenLabs, Fish Audio, or MiniMax Speech. Kokoro ships fixed voices only.
- Fine-grained emotional control: ElevenLabs v3 or MiniMax Speech-02. Kokoro’s prosody controls stay basic.
- Real-time streaming for conversational agents: Cartesia is built for this. Kokoro generates full audio before playback.
- Languages beyond the v1.0 set of 9: ElevenLabs and MiniMax cover 30+ languages with native prosody.
- Studio production UI with takes and timeline editing: Murf or ElevenLabs Studio. Kokoro is code-first.
Pricing
| Path | Cost |
|---|---|
| Self-hosted model | Free (Apache 2.0) |
| Own hardware | Electricity only |
| Hosted API (third-party) | Under $1 per 1M input characters (Together AI, others) |
| Commercial use | Permitted under Apache 2.0 without royalty |
Reverified 2026-05-13 via the Kokoro-82M Hugging Face repo and ONNX community builds. Self-hosted inference is free; hosted APIs price per million characters.
Against the alternatives
| Kokoro (82M) | XTTS v2 (~467M) | ElevenLabs (hosted) | |
|---|---|---|---|
| License | Apache 2.0 | CPML (non-commercial by default) | Proprietary |
| Parameter count | 82M | 467M | Not disclosed |
| Voice cloning | No | Yes (instant) | Yes (best-in-class) |
| Languages | 8 (v1.0) | 17 | 32+ |
| Real-time streaming | No | Limited | Yes |
| Emotion control | Basic | Basic | Fine-grained |
| Cost at 10M chars | Electricity | Electricity | ~$300+ on paid tier |
| Best viewed as | Small, offline-first | Mid-size clone-capable | Hosted quality ceiling |
Failure modes
- No voice cloning. Fixed pre-trained voices only. Custom-voice work requires a different model.
- Prosody is basic. No fine-grained emotion sliders. Tone is controlled mainly by text wording and punctuation.
- No streaming. Full audio generates before playback. Latency is not viable for real-time agent loops.
- English quality leads; other languages lag. The 8-language v1.0 list is functional but native-speaker critique can show gaps against specialist models.
- No hosted first-party API. Third-party providers (Together, Replicate) exist, but there is no vendor SLA.
- CPU runs are real-time, GPU is 10-20x faster. Long-document batches on CPU get slow.
- Community-driven release cadence. Version bumps depend on hexgrad’s time. Update frequency is irregular.
Methodology
This page was produced by the aipedia.wiki editorial pipeline, an automated system that ingests vendor documentation, verifies claims against primary sources, and generates the editorial analysis shown here. No individual human wrote this review. Scoring follows the four-dimension rubric at /about/scoring/ (Utility, Value, Moat, Longevity, unweighted average). Last verified 2026-05-13 against the Kokoro-82M Hugging Face repo, VOICES.md, hexgrad GitHub, and onnx-community Kokoro-82M-v1.0-ONNX builds.
FAQ
Is Kokoro free for commercial use? Yes. The model is Apache 2.0 licensed, which allows commercial use without royalties (Hugging Face).
How does Kokoro compare to ElevenLabs? Kokoro matches ElevenLabs on fixed-voice English narration quality in blind TTS Arena tests. ElevenLabs still wins on voice cloning, emotion sliders, real-time streaming, and language breadth. Kokoro wins on cost (free vs per-character) and privacy (local vs hosted).
How do I run Kokoro?
pip install kokoro soundfile. Basic inference:
from kokoro import KPipeline
pipeline = KPipeline(lang_code='a')
audio, _ = pipeline("Your text here.", voice='af_heart')
ONNX builds exist for deployment outside Python (onnx-community).
How many voices and languages does Kokoro support? V1.0 ships 54 voices across 8 languages: English (US, UK), Spanish, French, Hindi, Italian, Japanese, and Mandarin Chinese. Voices come in US female, US male, UK female, UK male, and regional variants. See the VOICES.md reference for the full list.
Can Kokoro clone my voice? No. Kokoro supports fixed voices only. For zero-shot voice cloning from a short reference clip, use ElevenLabs, Fish Audio, or MiniMax Speech.
Sources
- Kokoro-82M on Hugging Face: official model, voicepacks, documentation
- Kokoro VOICES.md: canonical voice/language reference
- hexgrad GitHub: source code and Python library
- Kokoro-82M-v1.0-ONNX: community ONNX builds for mobile and browser
- Together AI hosted Kokoro-82M: hosted API pricing reference
Related
- Category: AI Voice
- Compare: ElevenLabs · Cartesia · MiniMax Speech