Job Description

Role Overview

As a Senior Lead Site Reliability Engineer (SRE) specializing in Azure, you will be a hands-on technical owner of our cloud infrastructure. You will architect, build, and operate the systems that underpin our Azure-based SaaS offerings — owning reliability, scalability, and security from the infrastructure layer up. You will work in close partnership with R&D to embed operational excellence into the software delivery lifecycle, and you take full ownership of every system within Cloud Operations' purview. You bring deep Azure and DevOps expertise, thrive in complex distributed environments, and raise the technical bar through the quality of your engineering work.

Key Responsibilities

Design, implement, and continuously improve Azure-based infrastructure for high-availability, mission-critical SaaS services — owning the full lifecycle from architecture through to production operation.
Own, operate, and continuously improve CI/CD pipelines across Jenkins, Azure DevOps, and GitHub Actions — including pipeline architecture, build performance, deployment reliability, secrets handling, and migration work as we evolve our toolchain. This is active ownership, not support.
Configure and maintain Ansible playbooks for configuration management, provisioning automation, and drift remediation across the infrastructure estate.
Build and maintain Infrastructure as Code using Terraform and/or ARM/Bicep, covering the full provisioning lifecycle — from initial environment build through to day-two operations and ongoing change management.
Work directly and continuously with R&D engineering teams to embed reliability, operability, and deployment quality into the software development lifecycle — including pipeline design reviews, pre-production environment ownership, release readiness, and incident learnings fed back into build practices.
Own the observability and alerting stack across Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and Pingdom — including metric collection, synthetic monitoring coverage, alerting thresholds, and dashboard design. Own the PagerDuty configuration end-to-end: escalation policies, routing rules, service integrations, and on-call schedule management. Act as the technical escalation point for complex incidents and participate in the team's on-call rotation.
Design, operate, and optimize AKS clusters for production workloads — including node pool configuration, autoscaling, network policy, ingress architecture, workload identity, and persistent storage patterns. Own cluster health, upgrade lifecycle, and capacity planning end-to-end.
Instrument Kubernetes workloads with Prometheus exporters and build Grafana dashboards that give engineering teams genuine operational visibility into service health, latency, error rates, and resource consumption.
Take full technical ownership of all systems within Cloud Operations' scope — infrastructure, tooling, pipelines, observability, and security controls. If it lives in our environment, you own its reliability, its documentation, and its improvement roadmap.
Lead root cause analysis on production incidents; author post-mortems with actionable engineering remediation, not just process changes.
Define, instrument, and own SLOs, SLIs, and error budgets for Azure-hosted SaaS services; use data to drive reliability investment decisions.
Engineer and enforce security controls across identity, access, secrets, and certificate management in Azure — including hands-on implementation, not just policy definition. Contribute directly to the technical controls, evidence collection, and continuous compliance posture required to maintain SOC 2 Type II, ISO 27001, and ISO 9001 certification across the Cloud Operations environment.
Evaluate emerging Azure services and features against real production requirements; build proof-of-concepts, validate at scale, and drive adoption where the engineering case is clear.
Produce and maintain architecture documentation, runbooks, and operational playbooks that are technically precise enough for an on-call engineer to execute under pressure — and meet the documentation standards required under our ISO 9001 quality management obligations.

Qualifications

Required

7+ years in SRE, Cloud Operations, or DevOps roles, with at least 4 years of hands-on Microsoft Azure focus.
Deep expertise across Azure services including App Services, AKS, Azure SQL, Storage, Networking, Security Centre, and Monitor.
Hands-on experience building, maintaining, and improving CI/CD pipelines in Jenkins, Azure DevOps, and GitHub Actions — including real ownership of pipeline failures, performance, and evolution, not just consumption.
Working experience with Ansible for configuration management and infrastructure automation.
Production-grade Kubernetes/AKS experience — cluster operations, workload troubleshooting, RBAC, network policies, Helm, and upgrade management in a live SaaS environment.
Hands-on experience with Prometheus and Grafana in a production context — metric instrumentation, alerting rule design, and dashboard development, not just consumption.
Experience with Pingdom for synthetic monitoring and PagerDuty for incident alerting and on-call management — including configuration of escalation policies, alert routing, and participation in a 24/7 on-call rotation.
Strong scripting and automation skills in PowerShell, Python, Bash, or equivalent — with a track record of using code to eliminate operational toil.
Proven, production-grade experience with Infrastructure as Code using Terraform and/or ARM/Bicep.
Advanced troubleshooting ability across distributed systems, network layers, and application performance in Azure — comfortable owning a complex outage end-to-end.
Demonstrated ability to work closely and effectively with software development teams — contributing to SDLC processes, pipeline standards, and release quality as a technical peer, not a service desk.
Strong working knowledge of security protocols, certificate lifecycle management, secrets management, and compliance controls in Azure — including practical experience supporting or maintaining SOC 2 Type II, ISO 27001, or ISO 9001 audits in an infrastructure or cloud operations context.
Demonstrated experience leading incident response and driving post-mortem remediation to completion.

Preferred

Azure certifications (Azure Solutions Architect, Azure DevOps Engineer Expert, or equivalent).
Experience with hybrid or multi-cloud environments, including AWS.
Familiarity with Azure cost management tooling and hands-on optimisation work.
Experience operating large-scale SaaS platforms with multi-tenant infrastructure.
Experience with Grafana alerting, Grafana OnCall, or similar on-call routing tooling.

Additional Information

What We’re Offering

Salary Range: $133k and $151k CAD
Permanent, Full-time

Use of Artificial Intelligence in Recruitment
As part of our recruitment process, we may use automated tools, including artificial intelligence, to help screen and assess applications based on job‑related criteria such as skills, experience, and qualifications.
These tools do not make hiring decisions. All employment decisions are reviewed and made by members of our hiring team.

We embrace flexibility and hybrid work opportunities to support diverse needs and lifestyles, while also valuing inclusive workplace experiences. By fostering a sense of community, we drive innovation, strengthen connections, and nurture belonging. Our commitment ensures you can work in a way that suits you best, while also engaging with colleagues to share ideas and build meaningful relationships.

Senior Lead Site Reliability Engineer

Job Description

Qualifications

Additional Information

Other open roles at IFS(6)