
About this role
Technical Summary
You will support the reliability and scalability of services across AWS, Azure, GCP, and Oracle by executing automation, CI/CD, observability, and container orchestration tasks. You will work closely with senior engineers to ensure production systems are stable, well-monitored, and continuously improving.
Responsibilities
- Implement and maintain monitoring, alerting, and logging systems (Prometheus, Grafana, ELK, OpenTelemetry)
- Build and maintain CI/CD pipelines and automation for deployments and testing
- Support containerized workloads using Docker and Kubernetes; manage Helm charts and deployments
- Contribute to incident response, troubleshooting, and postmortem documentation
- Implement IaC patterns (Terraform, CloudFormation, ARM templates) under guidance
- Collaborate with developers to improve service reliability and operational readiness
- Participate in continuous platform improvements led by senior/principal engineers
Must-have Qualifications
- 3–5 years of experience in operations, DevOps, or SRE roles
- Hands-on experience with containers and orchestration (Docker, Kubernetes)
- Familiarity with IaC tools (Terraform, Ansible, or similar)
- Experience with CI/CD tools (Jenkins, GitHub Actions, ArgoCD, or similar)
- Proficiency in at least one scripting language (Python, Bash, Go)
- Associate Level Cloud Certification (AWS, Azure, GCP, Oracle, Cloud+)
- This position requires availability for weekend and holiday shifts as part of the standard scheduling rotation
Nice-to-have Skills
- Exposure to SLOs/SLIs and error budgets
- Familiarity with chaos testing or service mesh