Site Reliability Engineer
RiskSpan
Office
India
Full Time
ONLY IMMEDIATE JOINERS WILL BE CONSIDERED.Hybrid position for Bangalore office.Shift Timings: Rotational shifts between 8:00 AM to 2:00 AM (next day) About RiskSpan TechnologiesRiskSpan Technologies is a leading technology and data solutions company specializing in delivering innovative and scalable solutions to complex challenges in the financial services and technology sectors. We pride ourselves on a collaborative culture, technical excellence, and a passion for problem-solving. Join us to enhance system reliability, observability, and performance at scale! Job OverviewWe are looking for a Site Reliability Engineer (SRE) with 2.5 to 5 years of experience to join our team. The ideal candidate will be responsible for ensuring the availability, scalability, and reliability of our distributed systems, improving observability, automating infrastructure, and enhancing system performance. This role provides an opportunity to work on high-scale, mission-critical environments and contribute to building a resilient infrastructure. Key Responsibilities
- Improve observability by implementing and managing monitoring, logging, and alerting solutions using Prometheus, ELK stack, and Grafana.
- Work with APMs like Dynatrace, New Relic to monitor performance metrics, define SLIs, SLOs, and error budgets.
- Participate in incident management, including on-call rotation, and Root Cause Analysis (RCA).
- Automate infrastructure provisioning using Terraform and Infrastructure as Code (IaC) principles.
- Ensure system scalability, reliability, and performance in a distributed environment.
- Strengthen security by applying cybersecurity best practices, vulnerability assessments, and compliance policies.
- Collaborate with cross-functional teams to establish SRE best practices, improve release pipelines, and minimize deployment risks.
- Maintain and improve disaster recovery plans to enhance resilience.
- Manage and optimize workflows using Apache Airflow to ensure efficient scheduling and execution of data pipelines.
- Support Snowflake data operations, ensuring high availability, performance optimization, and security compliance.
- Monitoring and observability using Prometheus, ELK, Grafana.
- Application Performance Monitoring (APM) tools like Dynatrace, New Relic, or Datadog.
- Incident response and on-call rotation management.
- Infrastructure automation using Terraform.
- Distributed systems operations and scaling.
- Load testing and performance analysis using tools like JMeter, k6, or Locust.
- Security at scale, including vulnerability scanning and compliance automation.
- Workflow automation and orchestration using Apache Airflow.
- Experience with Snowflake, including query optimization, data management, and security controls.
- Technical Skills:
- Strong knowledge of cloud platforms (AWS preferred).
- Experience with troubleshooting distributed systems and high-traffic environments.
- Hands-on knowledge of Linux, networking, and security fundamentals.
- Familiarity with container orchestration (Kubernetes, Docker).
- Ability to write automation scripts using Python, Bash, or Go.
- AWS Certified DevOps Engineer – Professional (or equivalent AWS certification).
- HashiCorp Certified: Terraform Associate.
- Certified Kubernetes Administrator (CKA).
- Google SRE Professional Certificate (preferred but not mandatory).
Site Reliability Engineer
Office
India
Full Time
August 8, 2025