OpenAI expands AI capabilities with new audio models for voice agents

ChatGPT maker OpenAI has now launched new speech-to-text and text-to-speech audio models in API to enhance voice agents. OpenAI stated that these new models, “set a new state-of-the-art benchmark, outperforming existing solutions in accuracy and reliability—especially in challenging scenarios involving accents, noisy environments, and varying speech speeds.”

These enhancements improve transcription accuracy, making the models particularly effective for applications such as customer service call centres, meeting note-taking, and other similar use cases.

Developers will now be able to instruct the text-to-speech mode to speak in a specific way. To explain this better, OpenAI gave an example of a developer instructing the voice agent to “talk like a sympathetic customer service agent.” The Sam Altman-led company in its blog post claimed that giving such instructions would unlock a new level of customisation for voice agents.

Speech-to-Text audio models: What do we know

OpenAI has introduced the new GPT-4o Transcribe and GPT-4o Mini Transcribe models, that are said to offer enhanced word error rate, improved language recognition, and greater transcription accuracy compared to the original Whisper models.

The GPT-4o Transcribe model is claimed to deliver better Word Error Rate (WER) performance across multiple benchmarks, showcasing significant advancements in speech-to-text technology.

With these upgrades, the new models are said to be more effective at capturing speech nuances, minimising errors, and ensuring higher transcription reliability. OpenAI has claimed that they perform particularly well in challenging conditions, such as strong accents, background noise, and varying speech speeds. These models are now available through the speech-to-text API.

Text-to-Speech audio model: What do we know

OpenAI has introduced the GPT-4o Mini TTS model, offering improved steerability in text-to-speech generation. For the first time, developers can guide the model not only on what to say but also on how to say it, allowing for more personalised and dynamic voice outputs. This advancement enhances applications such as customer service and creative storytelling.

The model is now accessible through the text-to-speech API. The blog post read, “Note that these text-to-speech models are limited to artificial, preset voices, which we monitor to ensure they consistently match synthetic presets.”

All latest models are now accessible to all developers via OpenAI’s API. Additionally, OpenAI has integrated these models with its Agents SDK, streamlining the development process.

For applications requiring real-time, low-latency speech-to-speech functionality, OpenAI recommends utilising its Realtime API for optimal performance.

You may also like

Comments are closed.