Senior Applied Researcher, Audio Generation
Cartesia.com
200k - 350k USD/year
Office
San Francisco, CA
Full Time
About Cartesia
Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.
We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.
We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.
The Role
We are seeking a Senior Applied Researcher to contribute to the development of our next-generation speech models. You will be responsible for designing, training, and deploying novel generative models for tasks like multi-lingual text-to-speech (TTS), voice conversion, music generation, and sound effect synthesis.
The challenge is no longer just about creating high-fidelity audio; it's about generating it with near-zero latency and giving users precise creative control. We aim to set new standards for accuracy, speed, and usability in production systems.
What You’Ll Do
- Develop & optimize speech and audio models for production.
- Work with engineering to ship and scale your models across our target platforms: cloud, on-premise, and on-device.
- Develop model architectures and inference strategies specifically for low-latency, real-time performance on consumer hardware.
- Implement and refine mechanisms for fine-grained controllability, allowing for the manipulation of attributes like speaker identity, emotion, prosody, and acoustic style.
- Pioneer the latest research on new architectures for generative modeling.
What we’re looking for
- Proven experience in developing and training novel generative models, preferably for audio or speech.
- Clear understanding of the architectural trade-offs between model quality, inference speed, and memory footprint.
- Hands-on experience with model conditioning and control mechanisms.
Our Culture
🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.
🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.
🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.
Our Perks
🍽 Lunch, dinner and snacks at the office.
🏥 Fully covered medical, dental, and vision insurance for employees.
🏦 401(K).
✈️ Relocation and immigration support.
🦖 Your own personal Yoshi.
Senior Applied Researcher, Audio Generation
Office
San Francisco, CA
Full Time
200k - 350k USD/year
September 15, 2025