
Staff Software Engineer - Reliability
Rubrik Job Board
Posted about 4 hours ago
About Team & About Role
The Site Reliability Engineering (SRE) team at Rubrik ensures the absolute reliability, availability, performance, and security of our enterprise infrastructure services, spanning both global SaaS platforms and government-compliant environments. We operate at the intersection of software development and systems engineering, prioritizing hyperscale platform automation, self-healing architectures, and structural resiliency. As a Staff Site Reliability Engineer, you will serve as a primary technical leader and architect across our broader distributed cloud systems. You will drive long-term technical roadmaps, establish cross-organizational reliability standards, and solve complex distributed systems challenges that safeguard both enterprise and public sector environments.
Beyond the core SRE charter, this Staff role also leads the Application-SRE team — a US-based group that partners closely with engineering, Sales, and Support to unblock POCs, drive complex customer escalations to resolution, and convert recurring field signals into engineering and reliability roadmap items. You will be the technical leader and project owner for Application-SRE: setting direction, tracking commitments, and ensuring the team operates as a high-leverage bridge between the field and the broader engineering org.
What You'll Do
As a Staff Site Reliability Engineer, you will possess engineering-wide influence and take ownership of the following critical areas:
- Infrastructure Strategy & Architecture: Formulate and execute the architectural vision for Rubrik's Cloud Platform, optimizing backend infrastructure systems like Kubernetes, MySQL, and cloud-native services for performance, security, and multi-region scale.
- Hyperscale Automation & Platform Tooling: Build, scale, and maintain sophisticated custom internal tools, platform controllers, and automation frameworks in Go or Python to systematically eliminate operational toil.
- Cross-Functional Leadership: Wield engineering-wide influence to create technical consensus among component, platform, and security engineering teams, effectively "shifting left" to embed structural resilience, capacity guards, and compliance from initial feature designs.
- Reliability Governance: Define, audit, and enforce robust Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets across all critical enterprise platform services, translating telemetry insights into actionable product roadmaps during executive reviews.
- Incident Command & Operations Review: Serve as a primary Incident Commander for high-severity cloud outages, establishing roles, directing mitigation vectors under pressure, and orchestrating comprehensive, blameless post-mortems that drive durable systemic fixes.
- Cost Governance & Capacity Modeling: Architect cost-observability tools and attribution frameworks, leading cloud infrastructure capacity forecasting, resource quota optimization, and vendor SLA management.
- Application-SRE Leadership: Set the technical direction for the Application-SRE team, raising the bar on how the team diagnoses, mitigates, and durably resolves the most complex customer-impacting issues across our platform.
- Technical Multiplier & Mentorship: Champion SRE best practices, mentoring senior and junior individual contributors across the organization, participating in interview frameworks, and actively raising the collective technical bar.
- On-Call Rotations: Participate in on-call rotations
Experience You'll Need
- Citizenship & Residency: Must be a US Citizen currently residing on CONUS soil (strict regulatory requirement to enable support for federal and FedRAMP environments when required).
- Education: BS, MS, or PhD in Computer Science, Computer Engineering, or a highly related technical discipline.
- Industry Experience: A minimum of 8–12+ years of software engineering and production cloud infrastructure experience, with at least 5+ years dedicated to a formal SRE, DevOps, or Platform engineering role operating hyperscale SaaS products.
- Technical Depth: Comprehensive, hands-on programming expertise in Golang, Python, or Java with a deep grasp of concurrency models, data structures, and test-driven software design patterns.
- Distributed Systems Expertise: Proven proficiency designing, deploying, analyzing, and auditing complex, large-scale distributed systems, database topologies, and high-availability public cloud meshes.
- Systems Internals: Authoritative operational command of Unix/Linux operating system environments (process models, file systems, kernels), systems administration, and advanced L4/L7 networking protocols.
- Field-to-Product Feedback Loop: Institutionalize the channel that converts patterns from customer escalations and POCs into prioritized product and reliability feedback, partnering directly with Product, Sales Engineering and Support leadership.
- Customer & Field Fluency: Track record of partnering directly with Sales, Support, and customers on escalations and POCs, and translating field signals into engineering action.
- Leadership Capability: Demonstrated history of technical leadership, mapping architectural dependencies, managing multi-team technical projects, and guiding organizations through critical platform shifts with high technical judgment.
Preferred Qualifications
- Extensive production experience provisioning, lifecycle-managing, and recovering enterprise-scale Kubernetes (GKE, EKS) deployments and large-scale relational/non-relational databases (MySQL).
- Prior experience building, certifying, or auditing infrastructure environments under compliance structures such as FedRAMP (High/Moderate), SOC 2, ISO 27001, or CJIS.
- Fluency in Infrastructure-as-Code (Terraform, Pulumi) module design, multi-tenant state isolation, and enterprise observability fabrics (Prometheus, Grafana, OpenTelemetry).
Security and Privacy Responsibilities section:
This position carries special Security and Privacy Responsibilities for protecting the U.S. Federal Government’s interests:
- Know, acknowledge, and follow system-specific security policies and procedures;
- Protect data and individual privacy per requirements and regulations;
- Perform ongoing activities in compliance with service and contractual obligations;
- Participate in role-based training, completing assignments on a timely basis;
- Report security issues promptly, and aid investigation when needed;
- Support controlled changes and vulnerability remediation activities; and
- Work collaboratively with Information Security in designing, implementing, assessing or enhancing system-specific security and privacy controls.
Position Risk Designation section:
This position carries duties and responsibilities involving the U.S. Federal Government’s interests. The selected incumbent may be subject to one or both of the additional background checks with periodic re-screening as noted below:
Position Risk Designation: Non-Sensitive, Low Risk, Tier 1
Incumbents without access to U.S. Government data may be required to complete Standard Form 85 and undergo a Tier 1 Investigation (T1) for non-sensitive positions of Low Risk. (Baseline screening; formerly National Agency Check and Inquiries (NACI)).
Position Risk Designation: Non-Sensitive, Moderate Risk, Tier 2 (Public Trust)
Incumbents with access to U.S. Government data may be required to complete Standard Form 85P and undergo Tier 2 (T2) Investigation for non-sensitive positions designated Moderate Risk.
Position Risk Designation:Moderate Risk Law Enforcement (CJIS)
When hired for a position where access to Moderate Risk criminal justice information is required, the employee must complete a fingerprint-based national criminal history background check within 30 days after the employee’s start date.
Join Us in Securing and Accelerating the World's AI Transformation
Rubrik (RBRK), the Security and AI Operations Company, leads at the intersection of data protection, cyber resilience, and enterprise AI acceleration. Rubrik Security Cloud delivers complete cyber resilience by securing, monitoring, and recovering data, identities, and workloads across clouds. Rubrik Agent Cloud accelerates trusted AI agent deployments at scale by monitoring and auditing agentic actions, enforcing real-time guardrails, fine-tuning for accuracy and undoing agentic mistakes.
Linkedin | X (formerly Twitter) | Instagram | Rubrik.com
Inclusion @ Rubrik
At Rubrik, we are dedicated to fostering a culture where people from all backgrounds are valued, feel they belong, and believe they can succeed. Our commitment to inclusion is at the heart of our mission to secure the world’s data.
Our goal is to hire and promote the best talent, regardless of background. We continually review our hiring practices to ensure fairness and strive to create an environment where every employee has equal access to opportunities for growth and excellence. We believe in empowering everyone to bring their authentic selves to work and achieve their fullest potential.
Our inclusion strategy focuses on three core areas of our business and culture:
-
Our Company: We are committed to building a merit-based organization that offers equal access to growth and success for all employees globally. Your potential is limitless here.
-
Our Culture: We strive to create an inclusive atmosphere where individuals from all backgrounds feel a strong sense of belonging, can thrive, and do their best work. Your contributions help us innovate and break boundaries.
-
Our Communities: We are dedicated to expanding our engagement with the communities we operate in, creating opportunities for underrepresented talent and driving greater innovation for our clients. Your impact extends beyond Rubrik, contributing to safer and stronger communities.
Equal Opportunity Employer/Veterans/Disabled
Rubrik is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.
Rubrik provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics. In addition to federal law requirements, Rubrik complies with applicable state and local laws governing nondiscrimination in employment in every location in which the company has facilities. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.
Job details
Jobr Assistant extension
Get the extension →