
Manager, Engineering (Production Orchestration)
Cockroach Labs
Posted about 4 hours ago
Category-defining tech. Career-defining work.
Lots of tech companies disrupt. But, many fail when they try to scale. We're different. CockroachDB makes it easier for companies to build and scale apps. This is how and why we're helping some of the most innovative companies on the planet. We tackle problems head-on and focus on solutions that create lasting impact.
Because when our customers win, we all win.
The Role
At the heart of CockroachDB is our Production Orchestration team- the stewards of availability, reliability, and scalability across our cloud offerings and beyond. Built on a foundation of SRE principles and carrying forward years of operational practice, our core commitment is clear: ensuring our customers have a secure, reliable, and performant production service at scale.
We're looking for an Engineering Manager to lead our Production Orchestration team as part of a global Production Engineering organization. You'll drive foundational architectural changes to how we operate our fleet, champion AI-driven approaches to both development and operations, and foster a culture of operational excellence, ensuring CockroachDB meets and exceeds our SLAs while keeping pace with rapid growth.
You'll report to Tom Schmidt, Director of Production Engineering, who has led this team for 4+ years and will continue to be deeply involved in its technical direction. You'll be responsible for the growth and development of the team's engineers, day-to-day execution, and operational health, while bringing your own leadership and ideas to the table.
You Will
- Lead the Production Orchestration team, focused on the reliability, availability, and scalability of CockroachDB in production.
- Own operational excellence. Ensure the team is meeting or exceeding our SLAs, running effective incident response, and continuously improving our operational posture. Every incident is treated as a learning opportunity.
- Partner across the global Production Engineering organization to align on shared goals, ensure smooth coordination across time zones, and drive cohesive execution.
- Drive automation and tooling. Relentlessly reduce operational toil by building systems that improve observability and scale our fleet without scaling headcount linearly.
- Leverage AI to improve how the team builds and operates. Help the team adopt AI-assisted development practices and identify applied AI opportunities to improve operational workflows, from alert triage to capacity planning to incident response.
- Contribute to foundational architecture. The team is building a new architectural initiative that will reshape how we operate our fleet. You'll help lead execution on this work and ensure the team has the space and support to deliver.
- Coach and develop your engineers. Provide direct, constructive feedback. Guide personal development and career growth beyond just technical skills. Managing performance and ensuring engineers are achieving their goals is essential to retaining a high-performing team.
- Partner with engineering and product leadership to shape the roadmap for CockroachDB's operational capabilities and future products.
- Collaborate across teams to build and establish the tools and processes that empower everyone to make our customers successful.
The Expectations
In your first 30 days, you will become an integrated member of our engineering team. You'll spend time learning about the Production Orchestration team's domain, processes, and people, as well as CockroachDB and CockroachDB Cloud. You'll shadow on-call rotations, review recent incidents, and begin to understand the operational landscape. We believe it's essential for you to take this first month to become familiar with our technology and our company.
After 3 months, you will be fully integrated into the team and comfortable leading the Production Orchestration team's execution. You'll have built an understanding of our infrastructure, observability stack, and operational tooling. You'll understand the team's priorities and roadmap, have established working relationships with partner teams across Production Engineering, and be actively contributing to our incident response and operational review processes.
After 6 months, you'll be confidently managing the team and driving their work forward. You'll be shaping how the team approaches its new architectural work, identifying opportunities to apply AI to operational challenges, and ensuring that each member of your team is working on projects that align with both our needs and their interests. You'll be a key voice in Production Engineering's strategic direction.
You Have
- A passion for building relationships and a deep sense of responsibility for the welfare of the engineering team you manage, including their professional development and growth. We're looking for managers that want to empower their team to achieve their professional and personal goals.
- Experience leading global operations and/or incident management and response.
- Experience working on complex technical products with exposure to distributed systems, cloud infrastructure, container orchestration, or large-scale fleet management.
- A strong SRE or Production Engineering background. You understand the principles of reliability engineering, SLOs/SLAs, error budgets, and the engineering approach to operations.
- Comfort with programming languages like Go and Python. We use Go, but if you don't know it, you'll learn while you're here.
- Solid systems architecture knowledge and an understanding of how a variety of teams' interactions may impact operational reliability.
- Experience with performance management, understanding the importance of building an effective team that can function independently while collaborating and supporting each other.
- Partnered across departments, ensuring coordination with internal teams and external partner teams across time zones.
Bonus (You Have)
- Grown or managed teams that coordinate across multiple time zones.
- Experience supporting workloads across multiple cloud providers (GCP, AWS, Azure).
- Leveraged, or even better built, observability tooling for your team and the rest of your org.
- Experience applying AI/ML to operational workflows (e.g., intelligent alerting, automated remediation, capacity forecasting).
- Familiarity with CockroachDB or distributed SQL databases.
The Team
Tom Schmidt- Director, Production Engineering
Tom leads Cockroach Labs' Production Engineering org, responsible for the operational reliability and scalability of CockroachDB. He joined Cockroach Labs in August 2022 as manager of Site Reliability Engineering and has since taken responsibility for the broader production engineering organization. Before CRL, Tom spent 15 years at IBM, initially in technical leadership roles spanning compiler development, test frameworks, and CI/CD, before dedicating the latter half of his career to championing SRE across the organization. An enthusiastic advocate of the discipline, Tom has presented at conferences, developed certification curriculum, secured multiple patents, and was recognized as one of IBM's first three SRE Thought Leaders. Outside of work, Tom is a proud father of a 5-year-old boy and enjoys hiking, camping, and gaming.
Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at [email protected].
Cockroach Labs has a hybrid work model, with Roachers that are local to one of our offices coming in on Mondays, Tuesdays, and Thursdays and working flexibly the rest of the week. While we’ve learned valuable lessons working remotely, nothing can replace the connection, creativity, and fun that occurs when Roachers get together and we are committed to fostering a workplace that encourages collaboration and allows us all to do our best work.
Benefits
- Stock Options
- Medical Insurance
- Vision Insurance
- Dental Insurance
- Life and Disability Insurance
- Professional Development Funds
- Flexible Time Off
- Paid Holidays
- Paid Sick Days
- Paid Parental Leave
- Retirement Benefits
- Mental Wellbeing Benefits
- And more!
The annual anticipated base salary range for U.S. candidates for this role is listed in USD below. Salary is one component of the Cockroach Labs’ Total Rewards package, which also includes, for each employee: stock options, medical insurance, vision insurance, dental insurance, life and disability insurance, funds towards professional development resources, flexible paid time off, 11 paid holidays a year, 10 paid sick days a year, paid parental leave, a 401(k) plan, and wellbeing benefits.
We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies.
Job details
Jobr Assistant extension
Get the extension →