STN Inc logo

Site Reliability Engineer

Posted 12 days ago

RemoteRemote

Site Reliability Engineer

Platform and software · shared across customers

Reports to: Director, Site Reliability

Location: Remote (US)

Department: Cloud Platform Engineering / SRE/Reliability

Position summary

The Site Reliability Engineer (SRE) owns reliability, observability, and incident response for the GPU One (GPUaaS) platform. The SRE defines and enforces SLOs aligned with contractual SLAs, builds the observability stack, and leads major incidents to resolution.

Key responsibilities

  • Define and operate Service Level Objectives (SLOs) aligned with customer SLAs

  • Build and maintain the observability stack including metrics, logs, traces, and alerting

  • Lead incident response and chair post-incident reviews

  • Drive automation to reduce toil and improve mean-time-to-recover (MTTR)

  • Author and maintain operational runbooks alongside the NOC

  • Manage on-call rotation, escalation paths, and incident-management tooling

  • Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering

  • Drive chaos engineering, game days, and reliability testing programs

  • Produce SLA performance reports in coordination with the SLA Manager

  • Mentor junior engineers and contribute to engineering culture

Required qualifications

  • 5+ years in SRE, DevOps, or production engineering roles

  • Strong programming skills in Go, Python, or both

  • Hands-on experience operating Kubernetes-based platforms at scale

  • Deep familiarity with observability tooling (Prometheus, Grafana, Datadog, OpenTelemetry)

  • Strong incident management experience including major-incident command

Preferred qualifications

  • GPU or HPC platform operational experience

  • Familiarity with SLA-driven customer environments and credit calculations

  • Experience with chaos engineering tools (Gremlin, Litmus, or similar)

  • Published SRE content or contributions

Job details
Workplace
Remote
Location
Remote

Secure, production-grade GPU cloud for AI teams. SOC 2 & HIPAA compliant with 99.999% uptime, no noisy neighbors, and expert human support.

Employees
83
Industry
IT Services and IT Consulting
Headquarters
Pleasanton, California
Founded
2016
Specialties
Managed Services, SOC2 Certified, Cyber Security, Risk Assessments, HIPAA, Compliance, Managed SIEM, Backup, Recovery, Incident Response, Ransomware Prevention, Penetration Testing, Social Engineering, Network Engineering, and VAR Reseller

Key team members

Sabur Mian

Sabur Mian

Christopher Chua

Christopher Chua

Trevor Walker

Trevor Walker

Tom Genn

Tom Genn

Apply smarter with Jobr

Jobr aggregates jobs directly from company career portals — no middlemen. Our team applies on your behalf with AI-tailored resumes, reviewed by a human before submission.

Direct from company career pages
AI-personalised cover letters
Human review before every submit
Application tracking & follow-ups