Amazon unveils a new AI voice model, Nova Sonic


On Tuesday, Amazon debuted a new generative AI model, Nova Sonic, capable of natively processing voice and generating natural-sounding speech. Amazon claims that Sonic’s performance is competitive with frontier voice models from OpenAI and Google on benchmarks measuring speed, speech recognition, and conversational quality.

Nova Sonic is Amazon’s answer to newer AI voice models such as the model powering ChatGPT’s Voice Mode, which feel more natural to speak with than the more rigid models from Amazon Alexa’s early days. Recent technological breakthroughs have made legacy models and the digital assistants they underpin, such as Alexa and Apple’s Siri, seem incredibly stilted by comparison.

Nova Sonic is available through Bedrock, Amazon’s developer platform for building enterprise AI applications, via a new bi-directional streaming API. In a press release, Amazon called Nova Sonic “the most cost-efficient” AI voice model on the market, and around 80% less expensive than OpenAI’s GPT-4o.

Components of Nova Sonic are already powering Alexa+, Amazon’s upgraded digital voice assistant, according to Amazon SVP and Head Scientist of AGI Rohit Prasad.

In an interview, Prasad told TechCrunch that Nova Sonic builds on Amazon’s expertise in “large orchestration systems,” the technical scaffolding that makes up Alexa. Compared to rival AI voice models, Nova Sonic excels at routing user requests to different APIs, said Prasad. This capability helps Nova Sonic “know” when it needs to fetch real-time information from the internet, parse a proprietary data source, or take action in an external application — and use the appropriate tool to do it.

During a two-way dialogue, Nova Sonic waits to speak “at the appropriate time,” taking into account a speaker’s pauses and interruptions, says Amazon. It also generates a text transcript for the user’s speech, which developers can use for various applications.

Nova Sonic is less prone to speech recognition errors than other AI voice models, according to Prasad, meaning the model is relatively good at understanding a user’s intent even if they mumble, misspeak, or are in a noisy setting. On a benchmark measuring speech recognition across languages and dialects, Multilingual LibriSpeech, Amazon says Nova Sonic achieved a word error rate (WER) of just 4.2% when averaged across English, French, Italian, German, and Spanish. That means that roughly four out of every 100 words from the model differed from a human transcription in those languages.

On another benchmark measuring loud interactions with multiple participants, Augmented Multi Party Interaction, Amazon says Nova Sonic was 46.7% more accurate in terms of WER than OpenAI’s GPT-4o-transcribe model. Nova Sonic also has industry-leading speed, with an average perceived latency of 1.09 seconds, according to Amazon. That makes it faster than the GPT-4o model powering OpenAI’s Realtime API, which responds in 1.18 seconds, per benchmarking by Artificial Analysis.

Prasad says Nova Sonic is a part of Amazon’s broader strategy to build AGI (artificial general intelligence), which the company defines as “AI systems that can do anything a human can do on a computer.” Moving forward, Prasad says Amazon plans to release more AI models that can understand different modalities, including image, video, and voice, as well as “other sensory data that are relevant if you bring things into the physical world.”

Amazon’s AGI division, which Prasad oversees, seems to be playing a larger role in the company’s product strategy these days. Just last week, Amazon launched a preview of Nova Act, a browser-using AI model that appears to be powering elements of Alexa+ and Amazon’s Buy for Me feature. Starting with Nova Sonic, Prasad says the company wants to offer more of its internal AI models for developers to build with.


Leave a Reply

Your email address will not be published. Required fields are marked *