Senior Site Reliability Engineer

Swoogo.com

Remote

United States

Full Time

Who You Are

You thrive in complex systems, but your solutions are anything but. You see patterns others miss, know when to dive deep, and when to step back and look across the architecture. You're not just solving problems. You’re designing platforms that prevent them.

As a key member of the Platform Engineering team, reporting to the VP of Platform Engineering, you’ll help build the backbone of our infrastructure. Reliability isn’t just a metric to you, it’s a principle. You bring a pragmatic mindset to stability, scaling, and automation, always thinking long-term. You document as you go, automate what you can, and build tools that make life easier for everyone who ships code.

You’re driven by curiosity and a deep love for building things that last. You’re always learning, always iterating, and just as committed to sharing knowledge as gaining it. Whether you’re debugging a tricky issue, designing new internal tooling, or helping evolve our platform architecture, you keep a clear head and collaborative spirit. If you're excited to help shape a platform team where reliability and developer experience go hand in hand, you’ll feel right at home here.

This role will report to the VP of Platform Engineering and IT.

About The Role

As a Senior Site Reliability Engineer at Swoogo, you’ll help shape the future of our platform. This is a hands-on, collaborative role where you’ll build reliable, automated systems and work closely with engineers across the company to champion operational excellence. Here, reliability means more than uptime. It means trust. And you'll be at the core of delivering it.

In this role, you will be responsible for:

Reliability & Uptime: Ensure high availability and resilience of our production systems. Anticipate problems before they arise.
Automation & Tooling: Build, improve, and maintain automation for infrastructure provisioning, deployments, and system operations.
Incident Management: Lead and participate in on-call rotations, troubleshoot production incidents, and drive incident reviews to prevent recurrence. Help shape culture around incident response.
Performance & Scalability: Solve bottlenecks to improve system performance and define Swoogo’s operational standards.
Security & Compliance: Implement best practices for cloud infrastructure, identity, and access management.
Collaboration: Lead projects and partner closely with developers to improve observability, deployment pipelines, and overall developer productivity.
Continuous Improvement: You’re always looking to leave things better than you found them.

What You’Ve Done Before

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Deep experience with cloud platforms (AWS preferred).
Expertise with infrastructure-as-code tools (Terraform preferred).
Deep knowledge of containers and orchestration
Proficiency with CI/CD pipelines and deployment strategies
Experience with monitoring and observability tools
Coding/scripting ability in at least one language
Solid understanding of networking, distributed systems, and systems-level troubleshooting
Growth mindset and desire to learn and mentor

It’D Be Great If You’Ve Done This

Experience with chaos engineering and resilience testing
Familiarity with compliance frameworks (SOC 2, ISO 27001, PCI DSS)
Experience in events industry

Swoogo & How We Work

Learn more about Swoogo, how we work, and our Perks & Benefits.