Introducing VORA v1
A research preview of our most advanced voice synthesis model. Available to Enterprise users and developers worldwide.

Image generated by MUSE
We're releasing a research preview of VORA v1—our most advanced model for voice synthesis. VORA v1 achieves a new state-of-the-art in generating hyperrealistic speech and audio while maintaining remarkable efficiency. By leveraging novel neural architectures, VORA v1 produces natural-sounding voices with unprecedented emotional range and tonal accuracy.
Early testing shows that VORA v1 creates voices that are nearly indistinguishable from human speech. Its advanced understanding of linguistic nuance, emotional context, and natural speech patterns makes it ideal for applications ranging from accessibility tools to entertainment production. We also find it requires significantly less computational resources than previous models.
We're sharing VORA v1 as a research preview to better understand its strengths and limitations. We're eager to see how people use it in ways we might not have expected.
Model evaluation scores
VORA v1 | Previous SOTA | Aura-2 | |
---|---|---|---|
Mean Opinion Score (MOS) | 4.76 | 4.21 | 4.38 |
Word Error Rate (WER) | 2.1% | 5.7% | 4.2% |
Speech Naturalness (MUSHRA) | 92.5% | 78.4% | 83.7% |
Emotion Recognition Accuracy | 87.4% | 65.9% | 72.3% |
Generation Speed (RTF) | 0.08 | 0.26 | 0.17 |
Model Size (parameters) | 1.6B | 2.8B | 3.5B |
*Numbers shown represent best internal performance.
Deeper audio quality metrics
Voice Quality Metrics (higher is better)
Generation Latency (ms, lower is better)
VORA v1 delivers superior voice quality with significantly lower computational requirements.
Comparative evaluations with human listeners
Human evaluators rated VORA v1 outputs across multiple dimensions, with results showing strong preference over previous models.
Use cases
Accessibility
VORA v1 enables more natural-sounding text-to-speech for people with visual impairments or reading difficulties, with emotional nuance that conveys meaning more effectively.
Content Creation
Creators can generate high-quality voiceovers for videos, podcasts, and other media without professional recording equipment, saving time and resources.
Localization
Efficiently localize audio content across languages while preserving speaker identity, emotions, and natural speech patterns.
Stronger synthesis capabilities
VORA v1 represents a significant advancement in neural voice synthesis technology. Unlike previous models that often sound robotic or lack emotional range, VORA v1 synthesizes speech that captures the full spectrum of human vocal expression. Compared to other voice models, VORA v1 is more general-purpose and inherently more natural-sounding.
We believe voice synthesis will be a core capability of future AI systems, and that the two approaches to scaling—pre-training and acoustic modeling—will complement each other. As models like VORA v1 become more capable through advanced neural architectures, they will serve as an even stronger foundation for expressive and efficient speech generation.
Safety
Each increase in model capabilities presents new opportunities and responsibilities. VORA v1 was developed with safety as a priority, incorporating multiple layers of protections against potential misuse. These include voice authentication systems, audio watermarking, and content filtering mechanisms.
To evaluate our safety measures, we conducted extensive testing before deployment, in accordance with our Responsible AI Framework. We found that our multi-layered approach significantly reduced potential risks while maintaining the model's performance. We are publishing the detailed results from these evaluations in the accompanying system card.
How to use VORA v1 in the API
We're making VORA v1 available in the Speech Synthesis API to developers on all paid usage tiers. The model supports key features like voice cloning (with proper authentication), emotion control, and real-time generation. It also supports advanced audio capabilities including background noise preservation and seamless audio transitions.
Based on early testing, developers may find VORA v1 particularly useful for applications that benefit from its higher naturalness and emotional intelligence—such as dialogue systems, audiobook narration, and personalized content delivery. It also shows strong capabilities in multi-lingual speech synthesis and cross-lingual voice preservation.
VORA v1 is a very large and compute-intensive model, making it more expensive than previous models. Because of this, we're evaluating whether to continue offering it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If VORA v1 delivers unique value for your use case, your feedback will play an important role in guiding our decision.
Conclusion
With every new order of magnitude of compute comes novel capabilities. VORA v1 is a model at the frontier of what is possible in voice synthesis. We continue to be surprised by the quality and emotional range achievable with this technology. With VORA v1, we invite you to explore the frontier of audio synthesis and uncover novel capabilities with us.