
About this role
Job Summary
We are seeking a highly skilled Senior DevOps Engineer with deep expertise in Azure cloud infrastructure, automation, and MLOps practices. The ideal candidate will be responsible for designing, implementing, and maintaining scalable, secure, and efficient cloud infrastructure while bridging the gap between development, operations, and machine learning teams.
Key Responsibilities
Infrastructure & Cloud Management
- Design, implement, and manage Azure cloud infrastructure using Infrastructure as Code (IaC) principles
- Architect and maintain scalable, highly available, and cost-effective cloud solutions on Azure
- Implement and manage Azure DevOps pipelines for CI/CD automation
- Monitor cloud resource utilization and optimize costs while maintaining performance standards
- Ensure security best practices and compliance across all cloud environments
Automation & Development
- Develop Python-based automation tools and scripts for infrastructure management, monitoring, and deployment workflows
- Create reusable Python modules for DevOps tooling and automation frameworks
- Build custom integrations between various DevOps tools and platforms
- Automate repetitive operational tasks to improve efficiency and reduce manual intervention
Infrastructure as Code
- Write, maintain, and optimize Terraform configurations for multi-environment deployments
- Manage infrastructure state and versioning using Terraform Cloud
- Implement modular and reusable Terraform modules following best practices
- Perform infrastructure code reviews and ensure adherence to coding standards
MLOps Implementation
- Design and implement end-to-end MLOps pipelines for model training, validation, and deployment
- Build automated workflows for model versioning, experiment tracking, and model registry management
- Integrate ML model deployment pipelines with existing CI/CD infrastructure
- Implement monitoring and observability solutions for ML models in production
- Collaborate with data science teams to operationalize machine learning models
- Manage containerized ML workloads using Azure Kubernetes Service (AKS) or Azure Container Instances
Collaboration & Leadership
- Promote DevOps best practices across teams
- Collaborate with development, QA, and data science teams to streamline workflows
- Participate in architectural decisions and technical strategy planning
- Document infrastructure designs, processes, and runbooks
- Required Skills & Qualifications
Core DevOps Skills
- Experience: 5+ years in DevOps, Site Reliability Engineering, or related roles
-
Azure Expertise:
Deep knowledge of Azure services, including:
- Azure Virtual Machines, App Services, Azure Functions
- Azure Storage (Blob, File, Queue, Table)
- Azure Networking (VNet, NSG, Application Gateway, Load Balancer)
- Azure Monitor, Log Analytics, Application Insights
- Azure Key Vault, Azure Active Directory
- Azure Container Registry, Azure Kubernetes Service (AKS)
-
Azure DevOps:
Extensive hands-on experience with:
- Azure Pipelines (Build and Release)
- Azure Repos (Git)
- Azure Artifacts
- Azure Boards for work item tracking
- Service connections and variable groups
- Pipeline templates and YAML pipelines
Programming & Scripting
-
Python Development:
- Strong proficiency in Python 3.x for automation and tooling
- Experience with Python libraries: boto3, requests, paramiko, fabric
- Understanding of Python best practices, virtual environments, and package management
- Ability to write unit tests and maintain code quality
- Experience with Python frameworks like Flask/FastAPI for building APIs (preferred)
Infrastructure as Code
-
Terraform:
- Strong hands-on experience writing and managing Terraform configurations
- Knowledge of Terraform state management and backends
- Experience with Terraform modules, workspaces, and remote backends
- Understanding of Terraform lifecycle and dependency management
-
Terraform Cloud (Bonus):
- Experience with Terraform Cloud workspaces and VCS integration
- Knowledge of remote state management in Terraform Cloud
- Familiarity with policy as code using Sentinel (preferred)
- Understanding of cost estimation and drift detection features
MLOps Expertise
-
ML Pipeline Development:
- Design and implementation of automated ML training and deployment pipelines
- Experience with Azure Machine Learning services and ML workspace management
- Knowledge of model versioning and experiment tracking tools (MLflow, Azure ML, Weights & Biases)
- Understanding of feature stores and data versioning (DVC, Delta Lake)
-
Model Deployment & Serving:
- Experience deploying ML models using Azure ML endpoints, AKS, or Azure Functions
- Knowledge of model serving frameworks (TensorFlow Serving, TorchServe, ONNX Runtime)
- Implementation of A/B testing and canary deployments for ML models
- Experience with batch and real-time inference architectures
-
ML Monitoring & Observability:
- Implementation of model performance monitoring and data drift detection
- Setup of alerting systems for model degradation and anomalies
- Experience with MLOps platforms (Kubeflow, MLflow, Azure ML)
- Understanding of model explainability and fairness monitoring
-
Data Engineering for ML:
- Knowledge of data pipeline orchestration tools (Apache Airflow, Azure Data Factory)
- Experience with data preprocessing and feature engineering automation
- Understanding of data validation and quality checks in ML pipelines
- Familiarity with big data technologies (Spark, Databricks) is a plus
Additional Technical Skills
-
Containerization & Orchestration:
- Docker containerization and multi-stage builds
- CI/CD Tools:
-
- Jenkins, GitLab CI, GitHub Actions (in addition to Azure DevOps)
- Artifact management and versioning strategies
-
Monitoring & Logging:
- Prometheus, Grafana, ELK Stack
- Distributed tracing and APM tools
-
Version Control:
- Advanced Git workflows (branching strategies, PR reviews)
- Git hooks and automation
-
Security:
- Secrets management (Azure Key Vault, HashiCorp Vault)
- Security scanning and vulnerability assessment
- RBAC and access control implementation
Soft Skills
- Strong problem-solving and analytical thinking abilities
- Excellent communication and collaboration skills
- Ability to work independently and manage multiple priorities
- Proactive approach to learning new technologies
- Experience working in Agile/Scrum environments
- Strong documentation skills
Preferred Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
- Azure certifications (Azure Administrator, Azure DevOps Engineer Expert, Azure Solutions Architect)
- Terraform Associate Certification
- Experience with multi-cloud environments (AWS, GCP)
- Knowledge of machine learning frameworks (TensorFlow, PyTorch, scikit-learn)
- Experience with data science workflows and Jupyter notebooks
- Understanding of FinOps principles and cloud cost optimization