Job Purpose

Lead for a team of site reliability engineers delivering who deliver incident detection, triage, and runbook-based remediation for production cloud-native environments, to support our North American customers. Set the operational standard for triage and recovery, act as the senior escalation point, and serve as the primary technical liaison to the Service Delivery Manager.

Key Responsibilities

• Lead incident detection, triage, and first response across production cloud and Kubernetes environments, to support our North American customers.

• Execute and oversee approved runbooks for service restoration — workload and node restarts, scaling, rollbacks, and database stabilization — within agreed operational boundaries.

• Act as the senior escalation authority; prepare clear escalation summaries covering impact, actions taken, current state, and recommended next steps.

• Author, review, and maintain operational runbooks; continuously improve detection, alerting, and automation.

• Engage cloud-provider support (AWS, GCP) for platform-level failures and vendor escalations.

• Technically supervise and mentor the SRE team; review handoffs and assure consistency across shifts.

• Own daily shift handoffs and contribute to monthly service reporting and reviews.

People Management

• Provides technical leadership and day-to-day supervision

• Contributes to coaching, performance input, and skills development; formal line management sits with the Service Delivery Manager.

Financial Responsibility

• Accountable for protecting service levels and cost-to-serve through efficient, automation-first operations.

• Key Performance Indicators (KPIs)

• Service-level (SLO/SLA) attainment

• Mean time to acknowledge / mean time to resolve

• Runbook coverage and quality

• Escalation accuracy and completeness

• Shift-handoff quality and reporting timeliness

• Repeat-incident reduction and automation adoption

Requirements

Education & Certifications

• Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

• Certified Kubernetes Administrator (CKA) required; CKAD, AWS, and Google Cloud certifications strongly preferred.

Experience

• 7+ years in SRE, DevOps, or production infrastructure operations, including 3+ years operating Kubernetes in production.

• Proven track record leading incident response for production cloud workloads.

• Managed-services / MSP or 24×7 operations experience preferred.

Skills & Competencies

Technical Skills

• Kubernetes operations across AWS EKS and GCP GKE

• AWS and GCP core services (compute, storage, networking, scaling, IAM)

• Relational database operational recovery (e.g., PostgreSQL)

• Observability platforms (e.g., Datadog)

• Scripting and automation (Bash, Python, Go or equivalent); read-level Terraform/IaC

• Incident command and structured troubleshooting

Soft Skills

• Calm, decisive incident leadership under pressure

• Clear written and verbal English

• Mentoring and team collaboration

• Time management

Tools / Software

• Datadog

• Jira / ServiceNow

• Confluence / GitHub Wiki

• AWS & GCP consoles

• Slack / Microsoft Teams

Benefits

What We Offer

Competitive compensation package
Competitive benefits package
Company Perks, Good Life gym, and various brand discounts
Company events, recognitions, and celebrations
Career development and growth opportunities

Senior Site Reliability Engineer