Google just dropped Gemini 3.1 Flash TTS, and honestly, this is the most interesting speech model they’ve put out in a while. Not because it’s faster or cheaper (though it is both), but because they finally let you tell the AI exactly how to talk.
What’s actually new here
The headline feature is something they call “audio tags.” You embed natural language commands directly into your text input, and the model adjusts the vocal style, pace, and delivery accordingly. Want the AI to sound excited about something? Just tag it. Need it to slow down for a dramatic pause? You can do that too.
This is a big step up from the old approach where you’d tweak a dozen sliders and still end up with something that sounds like a robot reading a script. The tags are intuitive enough that you can just write “[excitedly]” or “[whisper]” and the model handles the rest.
The numbers game
On the Artificial Analysis TTS leaderboard, which measures blind human preferences, 3.1 Flash TTS scored an Elo of 1,211. That’s higher than I expected for a model that’s also supposed to be cost-effective. Artificial Analysis actually placed it in their “most attractive quadrant” — the sweet spot between quality and cost.
It also supports native multi-speaker dialogue out of the box. So if you’re generating a conversation between two characters, you don’t need to stitch separate audio clips together. The model handles the back-and-forth naturally.
Language support and availability
70+ languages is the claim here. I’ve seen this number before from Google, and it usually holds up reasonably well for major languages. The less common ones tend to be hit or miss, but the core European and Asian languages should be solid.
Right now it’s in preview on:
- Gemini API and Google AI Studio for developers
- Vertex AI for enterprise customers
- Google Vids for Workspace users
The watermarking situation
Every piece of audio generated with this model gets watermarked with SynthID. This is Google’s way of preventing misinformation, and it’s becoming standard practice across their generative AI products. The watermark is inaudible to humans but detectable by their tools. Whether this actually stops bad actors is debatable, but it’s better than nothing.
My take
This is the first TTS model from Google that feels like it was designed for people who actually want to use it creatively. The audio tags approach has been tried before by smaller players, but Google’s implementation seems more polished. The multi-speaker support alone makes it worth a look for anyone building conversational AI or interactive voice applications.
That said, it’s still in preview. Expect rough edges, especially with less common languages and complex audio tag combinations. And as always with Google, I’d keep an eye on pricing once it leaves preview. The current sweet spot on the leaderboard might not last forever.
Comments (0)
Login Log in to comment.
Be the first to comment!