Google Launches Gemini 3.1 Flash TTS With Audio Tags and 70+ Languages (April 2026)
Google on April 15, 2026 released Gemini 3.1 Flash TTS in preview — an expressive speech model with natural-language audio tags, 70+ languages and a 1,211 Elo on Artificial Analysis. Output costs $10/1M tokens.
Google on released Gemini 3.1 Flash TTS, a preview text-to-speech model that introduces natural-language "audio tags" for directing vocal style, pace and delivery, native support for more than 70 languages, and multi-speaker dialogue. It posted an Elo score of 1,211 on the Artificial Analysis TTS leaderboard at launch, placing it in what Google calls "the most attractive quadrant for quality-versus-price."
What Happened
The model was announced by Vilobh Meshram (Senior Product Manager) and Max Gubin (Principal Research Engineer) on the Gemini team blog. Unlike previous Google TTS generations that relied on a fixed set of preset voices, Flash TTS lets developers steer the output with director's-notes-style instructions embedded directly in the prompt — controlling regional accent, pace, pitch, emphasis, and scene direction at a per-speaker level.
It shipped immediately in preview through three entry points: the Gemini API and Google AI Studio for individual developers (model ID gemini-3.1-flash-tts-preview), Vertex AI for enterprises, and as a voice option inside Google Vids for Workspace customers. Every generated audio file is watermarked with SynthID to flag AI-generated content.
Key Details
- Pricing: $0.50 per 1M input tokens and $10.00 per 1M output tokens, with audio tokens metered at 25 tokens per second of generated audio. A free tier with generous limits is available on AI Studio for evaluation.
- Audio tags: Developers embed natural-language tags directly in the prompt to control style ("whispering," "excited," Brixton accent), pace, emphasis and scene direction — a prompting style that independent developer Simon Willison called "surprising, to say the least," before publishing his own browser-based testbed.
- Multi-speaker dialogue: Native support for multiple named speakers in a single generation, each with per-speaker "Director's Notes" for character consistency across long passages.
- Leaderboard position: The 1,211 Elo score on Artificial Analysis' TTS leaderboard is built from thousands of blind human preference votes and ranks Flash TTS competitively against higher-priced rivals.
- Safety: SynthID watermarking is applied to every output by default, letting Google's detector identify Gemini-generated audio after the fact.
What Developers Are Saying
The release drew two early submissions on Hacker News within hours, and developer Simon Willison published an analysis the same day that focused on the unusual prompting style: "The prompting guide is surprising, to say the least," he wrote, walking through Google's example prompt that mixes character descriptions with director's-notes-style style, pace and accent instructions before releasing his own UI for testing different regional accents.
The broader developer reaction has emphasized two things. First, the pricing — at $10 per 1M output tokens (roughly $0.40 per minute of audio at 25 tokens/second) Flash TTS undercuts most expressive-voice rivals while benchmarking competitively against them. Second, the control surface — embedding audio direction in natural language rather than a per-voice preset menu is a prompting pattern more familiar from image models like Imagen than from speech APIs.
What This Means for Developers
For teams already on the Gemini API, Flash TTS is a drop-in addition: call the same endpoint, pass gemini-3.1-flash-tts-preview, and receive an audio file. Existing TTS integrations that rely on per-voice presets (ElevenLabs, Amazon Polly, Azure Neural TTS) will need prompt-engineering work to port their style control into Flash TTS's audio-tag format, but Google's promise is that the same prompt now travels across 70+ languages without a voice-per-locale matrix.
The SynthID watermarking also changes the compliance calculation for regulated industries — every Flash TTS output is provenance-tagged by default, which removes a manual step for teams that already needed to label AI-generated audio under emerging EU AI Act and US FTC guidance.
What's Next
Google has called this a preview release and plans broader GA availability later in 2026. The model is expected to graduate from Google AI Studio and Vertex AI preview tiers and become selectable as a standard voice in Google Vids, with Google hinting at deeper integration into Workspace products. The official blog post and the Gemini API pricing page are the canonical references for launch details and rate limits.
Sources
- Google Blog — Gemini 3.1 Flash TTS: the next generation of expressive AI speech — the official announcement from Vilobh Meshram and Max Gubin.
- Simon Willison — Gemini 3.1 Flash TTS — independent developer analysis and a live demo UI.
- SiliconANGLE — Gemini 3.1 Flash TTS offers unparalleled control over AI voices.
- MarkTechPost — A new benchmark in expressive and controllable AI voice.
- Google AI — Gemini Developer API pricing — the $0.50 / $10 per 1M token rates.
- Artificial Analysis TTS leaderboard — where Flash TTS posted its 1,211 Elo score.
Stay up to date with Doolpa
Subscribe to Newsletter →