Site Reliability Engineer (SRE)
BETSOL.com
Office
Bengaluru, KA, India
Full Time
Company Description
BETSOL is a cloud-first digital transformation and data management company offering products and IT services to enterprises in over 40 countries. BETSOL team holds several engineering patents and is recognized with industry awards, and BETSOL maintains a net promoter score that is 2x the industry average.
BETSOL’s open source backup and recovery product line, Zmanda (Zmanda.com), delivers up to 50% savings in total cost of ownership (TCO) and best-in-class performance.
BETSOL Global IT Services (BETSOL.com) builds and supports end-to-end enterprise solutions, reducing time-to-market for its customers.
BETSOL offices are set against the vibrant backdrops of Broomfield, Colorado, and Bangalore, India.
We take pride in being an employee-centric organization, offering comprehensive health insurance, competitive salaries, 401K, volunteer programs, and scholarship opportunities. Office amenities include a fitness center, cafe, and recreational facilities.
Job Description
Own the reliability, availability, performance, and scalability of customer and employee facing platforms. Partner with application, infrastructure, security, and NOC teams to engineer resilient services, and automate operations across Azure and on-prem environments. Drive incident response and post-incident reviews, implement observability, and continuously improve service health through automation and best practices.
Responsibilities:
- Build and operate production platforms across Azure (e.g., AKS, App Services, Functions), Windows/Linux, and networking layers in partnership with Platform/Server/Network teams.
- Engineer end-to-end observability: metrics, logs, and traces via Azure Monitor, Application Insights, Log Analytics, Prometheus, Grafana, and centralized logging.
- Automate provisioning and configuration using Infrastructure as Code (Terraform/Bicep) and configuration management (Ansible/PowerShell DSC).
- Design and maintain CI/CD pipelines (Azure DevOps/GitHub Actions) with automated testing, canary/blue-green deployments, and change control alignment.
- Establish runbooks, SOPs, and self-healing automations to reduce MTTR and ticket volume from the NOC and Service Desk.
- Harden platform security (identity, secrets, certificates, network segmentation) leveraging Azure Key Vault, managed identities, and policy guardrails.
- Perform capacity planning, performance tuning, and cost optimization (FinOps) for compute, storage, and networking.
- Partner with Data/ETL teams to ensure reliability of batch and streaming jobs, scheduling, and dependencies.
- Create and maintain documentation (architecture, runbooks, dashboards) and support audits and compliance requirements.
Qualifications
Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
- 2–5+ years in SRE/DevOps/Platform Engineering with hands-on production ownership.
- Proficiency with Azure services (AKS, App Services, Functions, Azure Monitor, Log Analytics, Application Insights).
- Strong Kubernetes/Docker skills; Helm, ingress, service mesh (e.g., Istio/Linkerd) experience is a plus.
- IaC (Terraform or Bicep) and scripting (PowerShell and/or Python); Git-based workflows.
- CI/CD (Azure DevOps or GitHub Actions), artifact management, and release strategies (canary/blue-green).
- Observability tooling (Prometheus, Grafana, ELK/OpenSearch, Azure Monitor) and alert design to minimize noise.
- Experience with ITIL processes (incident, change, problem) and tools (ServiceNow/Jira).
- Knowledge of networking, DNS, TLS/certificates, load balancers, and security fundamentals.
- Excellent troubleshooting, communication, and cross-functional collaboration skills.
- Certifications such as Microsoft Azure Administrator/DevOps, CKA/CKAD, or ITIL Foundation are a plus.
Additional Information
All your information will be kept confidential according to EEO guidelines.
Site Reliability Engineer (SRE)
Office
Bengaluru, KA, India
Full Time
October 10, 2025