Senior Site Reliability Engineer (SRE)
CME.com
Hybrid
Remote
Full Time
This is a remote position.
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team. The ideal candidate will have a strong understanding of DevOps and Service Level Management (SLM) metrics. As well as experience working in event-driven infrastructure projects using tools like Terraform, New Relic, Kubernetes, AWS, and Kafka.As a representative of Platform Engineering, you will play a critical role working with other engineering teams to ensure our platform infrastructure tooling fulfils their needs and has a positive impact on Developer Experience. As well as helping them determine the right settings and thresholds for triggering alerts or automations on their applications.
Key Responsibilities:
- Scalability and High Availability: Design, implement, and maintain scalable and highly available systems using load balancing, auto-scaling patterns, canary releases, and blue-green deployments.
- Monitoring, Logging, and Observability: Develop and maintain monitoring and logging dashboards using tools like New Relic, Prometheus, Grafana, and Datadog.
- Ensure observability through metrics, tracing, log aggregation, and alerting.
- Alerting and Automation: Help teams determine the right settings and thresholds for triggering alerts or automations on their applications.
- Understand that each application has different performance requirements, such as varying acceptable response times or resource constraints.
- System Performance and Reliability: Monitor, optimize, and ensure system reliability and performance using tools like New Relic.
- Apply DORA metrics to measure and improve development and operational performance.
- Ensure compliance with SLM metrics like SLAs, SLOs, and SLIs by tracking uptime, response times, and resolution times.
- Resiliency: Implement and advocate for "Chaos" engineering practices to ensure system resiliency.
- Collaboration: Work with cross-functional teams to enhance platform engineering practices and gathering the right information for metrics analysis.
Requirements
- Proven experience working with Infrastructure-as-Code tooling, like Terraform, for infrastructure management.
- Strong understanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deployments.
- Strong understanding of DevOps metrics (like DORA) and their application in measuring and improving development and operational performance.
- Strong understanding of Service Level Management (SLM) metrics (like SLAs, SLOs, and SLIs). And their importance in defining, monitoring, and ensuring compliance from the services bound to them.
- Experience with monitoring, logging, and observability tools like New Relic, Prometheus, Grafana, and Datadog.
- Experience working with Kafka and improving performance of event-driven, realtime data processing and streaming projects and architectures.
- Familiarity with tooling used for SLM, DevOps and DORA metrics like Apache Dev Lake, Grafana and New Relic.
- Experience working with AWS, Azure or GCP for cloud infrastructure management.
- Experience working with CI/CD pipeline tools such as GitHub Actions, Jenkins, GitLab CI, or similar.
- Analytical Skills. Ability to analyze and interpret metrics to drive improvements.
- Strong communication skills to effectively collaborate with team members and stakeholders.
- Nice-to-haves Familiarity with Observability-as-Code tooling and practices.
- Familiarity with "Chaos" engineering practices for system resiliency
Senior Site Reliability Engineer (SRE)
Hybrid
Remote
Full Time
September 17, 2025