May 16, 2025Release

Introducing VORA v1

A research preview of our most advanced voice synthesis model. Available to Enterprise users and developers worldwide.

Image generated by MUSE

We're releasing a research preview of VORA v1—our most advanced model for voice synthesis. VORA v1 achieves a new state-of-the-art in generating hyperrealistic speech and audio while maintaining remarkable efficiency. By leveraging novel neural architectures, VORA v1 produces natural-sounding voices with unprecedented emotional range and tonal accuracy.

Early testing shows that VORA v1 creates voices that are nearly indistinguishable from human speech. Its advanced understanding of linguistic nuance, emotional context, and natural speech patterns makes it ideal for applications ranging from accessibility tools to entertainment production. We also find it requires significantly less computational resources than previous models.

We're sharing VORA v1 as a research preview to better understand its strengths and limitations. We're eager to see how people use it in ways we might not have expected.

Model evaluation scores

	VORA v1	Previous SOTA	Aura-2
Mean Opinion Score (MOS)	4.76	4.21	4.38
Word Error Rate (WER)	2.1%	5.7%	4.2%
Speech Naturalness (MUSHRA)	92.5%	78.4%	83.7%
Emotion Recognition Accuracy	87.4%	65.9%	72.3%
Generation Speed (RTF)	0.08	0.26	0.17
Model Size (parameters)	1.6B	2.8B	3.5B

*Numbers shown represent best internal performance.

Deeper audio quality metrics

Voice Quality Metrics (higher is better)

Generation Latency (ms, lower is better)

VORA v1 delivers superior voice quality with significantly lower computational requirements.

Comparative evaluations with human listeners

Human evaluators rated VORA v1 outputs across multiple dimensions, with results showing strong preference over previous models.

Use cases

Accessibility

VORA v1 enables more natural-sounding text-to-speech for people with visual impairments or reading difficulties, with emotional nuance that conveys meaning more effectively.

Content Creation

Creators can generate high-quality voiceovers for videos, podcasts, and other media without professional recording equipment, saving time and resources.

Localization

Efficiently localize audio content across languages while preserving speaker identity, emotions, and natural speech patterns.

Stronger synthesis capabilities

VORA v1 represents a significant advancement in neural voice synthesis technology. Unlike previous models that often sound robotic or lack emotional range, VORA v1 synthesizes speech that captures the full spectrum of human vocal expression. Compared to other voice models, VORA v1 is more general-purpose and inherently more natural-sounding.

We believe voice synthesis will be a core capability of future AI systems, and that the two approaches to scaling—pre-training and acoustic modeling—will complement each other. As models like VORA v1 become more capable through advanced neural architectures, they will serve as an even stronger foundation for expressive and efficient speech generation.

Safety

Each increase in model capabilities presents new opportunities and responsibilities. VORA v1 was developed with safety as a priority, incorporating multiple layers of protections against potential misuse. These include voice authentication systems, audio watermarking, and content filtering mechanisms.

To evaluate our safety measures, we conducted extensive testing before deployment, in accordance with our Responsible AI Framework. We found that our multi-layered approach significantly reduced potential risks while maintaining the model's performance. We are publishing the detailed results from these evaluations in the accompanying system card.

How to use VORA v1 in the API

We're making VORA v1 available in the Speech Synthesis API to developers on all paid usage tiers. The model supports key features like voice cloning (with proper authentication), emotion control, and real-time generation. It also supports advanced audio capabilities including background noise preservation and seamless audio transitions.

Based on early testing, developers may find VORA v1 particularly useful for applications that benefit from its higher naturalness and emotional intelligence—such as dialogue systems, audiobook narration, and personalized content delivery. It also shows strong capabilities in multi-lingual speech synthesis and cross-lingual voice preservation.

VORA v1 is a very large and compute-intensive model, making it more expensive than previous models. Because of this, we're evaluating whether to continue offering it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If VORA v1 delivers unique value for your use case, your feedback will play an important role in guiding our decision.

Conclusion

With every new order of magnitude of compute comes novel capabilities. VORA v1 is a model at the frontier of what is possible in voice synthesis. We continue to be surprised by the quality and emotional range achievable with this technology. With VORA v1, we invite you to explore the frontier of audio synthesis and uncover novel capabilities with us.