OpenAI Powerful New Voice Agents With Better Tone, Emotion, Energy level, and Delivery Style


OpenAI has just unveiled a game-changing suite of voice technology tools that promises to transform how we interact with AI. In their recent livestream announcement, the company introduced three powerful new models and significant updates to their Agent SDK, all designed to make voice-based AI interactions more natural, reliable, and accessible to developers.


The Evolution Toward Voice-First AI

For years, text has dominated our interactions with AI systems, but OpenAI is betting big that voice represents the future. As Olivia Gar, who leads OpenAI's platform, explained during the announcement: "Many people prefer to speak and to listen over writing and reading. So in a way, voice is a very natural human interface."

This shift makes intuitive sense. Humans evolved to communicate through speech long before written language, and our brains are naturally wired for verbal communication. What's exciting about OpenAI's announcement is that it brings us significantly closer to AI systems that can engage with us through this most natural form of interaction.

According to Jeff Harris from the API product team, voice agents are AI systems that can independently act on behalf of users or developers through spoken conversation. Think of the difference between a basic transcription service and a voice agent that can understand your request, reason about it, take actions, and respond appropriately—all through the medium of speech. Common applications include:

  • Customer service voice assistants that can handle product inquiries or process orders
  • Language learning companions that provide pronunciation feedback and conversation practice
  • Virtual assistants that perform complex tasks through conversational dialogue


OpenAI highlighted two approaches developers can take when building voice agents:

  • Speech-to-speech models: These advanced models directly process audio input and generate audio output. They're faster and more seamless but currently less reliable for complex tasks.
  • Chained approach: This modular system combines three specialized components: speech-to-text conversion, language model processing, and text-to-speech synthesis. This approach offers greater control, reliability, and flexibility.

The announcements focused primarily on enhancing this second approach, making it easier than ever for developers to create sophisticated voice experiences.

Three Groundbreaking Models to Power Voice AI

OpenAI unveiled three new models that collectively transform what's possible with voice technology:

  • GPT-4o Transcribe: OpenAI's flagship transcription model, available for $0.06 per minute
  • GPT-4o Mini Transcribe: A smaller, more efficient model priced at $0.03 per minute

According to Shen from OpenAI's research team, they've been built on a foundation and trained on "trillions of audio tokens." The performance improvements are dramatic—both models outperform OpenAI's previous Whisper models across every tested language.


The charts showed these models achieving state-of-the-art word error rates (the percentage of words transcribed incorrectly), beating both previous internal models and competing options on the market. Beyond raw accuracy, these new APIs include crucial features for real-world applications:

  • Noise cancellation to handle background sounds
  • Semantic voice activity detection to naturally segment speech
  • Streaming capabilities for real-time processing

Perhaps the most exciting announcement was the new GPT-4o Mini TTS (Text-to-Speech) model that introduces unprecedented control over vocal delivery. Unlike traditional TTS systems that focus solely on converting text to speech, GPT-4o Mini TTS allows developers to shape how the text is spoken through simple prompts. The team demonstrated this through a retro-styled interface called OpenAI.fm, where users can:

  • Choose from various voice options
  • Type text to be spoken
  • Provide instructions about tone, emotion, energy level, and delivery style

In a demonstration, Yaroslav from the engineering team showed how the same voice could deliver identical text in completely different ways—first as an excitable "mad scientist" character with chaotic energy, then in a calm, encouraging tone—simply by changing the prompt instructions.


"The personality is not tuned into the model, it's just prompted," Yaroslav explained, highlighting the flexibility of this approach. "You can be as specific as you want. You can tell it exactly what kind of pacing, what kind of emotion you want."


This model is available for just $0.01 per minute, making it remarkably accessible for developers looking to create expressive voice experiences.

Transforming Text Agents into Voice Agents with Minimal Code

The final component of OpenAI's announcement was an update to their recently released Agents SDK, which now makes it remarkably simple to convert existing text-based agents into voice agents.


During the demonstration, Yaroslav showed how an existing AI stylist and customer support agent—which could previously only interact through text—could be transformed into a fully functional voice agent with just nine lines of additional code. The updated SDK introduces a "voice pipeline" concept that:

  • Captures audio input from users
  • Converts it to text using the new transcription models
  • Processes it through the existing agent logic
  • Converts the response back to speech
  • Delivers it to the user

This approach allows developers to leverage all their existing work on text-based agents while adding voice capabilities with minimal effort.

For developers concerned about debugging these voice interactions, the SDK also includes an updated tracing UI that supports audio playback and provides detailed metrics on latency, errors, and other performance indicators.

Real-World Applications: How Voice Agents Change the Game

The implications of these new tools extend far beyond technical improvements. They represent a fundamental shift in how we might interact with AI systems in daily life. Consider these potential applications:

  • Customer Service: Rather than navigating complex phone menus or typing into chat interfaces, customers could simply speak naturally to a voice agent that understands their needs and can access relevant information to help them.
  • Language Learning: Students learning a new language could practice conversation with a voice agent that adapts to their skill level, provides pronunciation feedback, and creates realistic scenarios for practice—all through natural speech.
  • Accessibility: For people with visual impairments, limited mobility, or literacy challenges, voice agents could make digital services and information far more accessible.
  • Professional Assistance: Professionals like doctors could use voice agents to take notes during patient visits, summarize key information, and even suggest potential diagnoses or treatments based on the conversation.

The key advantage of OpenAI's approach is that it combines the natural feel of voice interaction with the reasoning capabilities of their most advanced language models.


The Future of Voice AI: What's Next?

OpenAI hinted that more voice capabilities are coming in the months ahead, suggesting this announcement represents just the beginning of their push into voice-first experiences. As these tools become more widely adopted, we can expect several trends to emerge:

  • Greater personalization: Future voice agents will likely adapt their speaking style, vocabulary, and pacing to match individual user preferences and contexts.
  • Multimodal integration: Voice will increasingly combine with other modalities like vision, allowing agents to respond to what they see as well as what they hear.
  • Specialized domain expertise: We'll see voice agents with deep knowledge in specific domains like healthcare, legal, finance, and education.
  • Emotion recognition: Future systems will likely detect and respond appropriately to emotional cues in human speech.
  • Ambient intelligence: Voice agents may eventually become ambient interfaces that are always available but only engage when needed, similar to how we interact with other humans in shared spaces.

As a fun closing note to their announcement, OpenAI revealed that the OpenAI.fm demo tool is now live for public use. They're even running a contest through Friday night for users to create and share the most creative uses of the text-to-speech technology on Twitter.

Users can tweet their creations for a chance to win a Teenage Engineering OB-4.



Recent Posts