Job Description

Principal Software Engineer – Site Reliability & Application Support, Chennai

We are looking for a Principal Software Engineer in Site Reliability Engineering (SRE) who defines and drives the reliability strategy for large‑scale, distributed, and cloud‑native applications. This role operates at a company and platform level, bridging the gap between software engineering and operations to ensure our applications are highly available, performant, and resilient at scale. The scope spans the full application stack Angular front‑end, Node. jsservices, Java back‑end, and Python tooling — and encompasses reliability engineering, observability, incident management, and continuous improvement of application health across production environments.
You will act as a technical authority for application reliability and support, leading triage efforts, driving automation to eliminate toil, setting company‑wide SRE standards, and collaborating with development, platform, and architecture teams to embed reliability as a first‑class engineering concern.

Responsibilities

Application Reliability & Support

Own end‑to‑end reliability of multi‑tier applications spanning Angular, Node.js, Java, and Python stacks
Monitor, triage, and resolve production incidents with speed and precision, minimizing customer impact and MTTR
Perform root cause analysis (RCA) on recurring issues and drive permanent fixes through development or platform teams
Define and track SLIs, SLOs, and error budgets aligned to business criticality
Lead blameless post‑mortems and ensure actionable follow‑through on learnings
Proactively identify reliability risks and work with engineering teams to address them before they impact production

Incident Management & Technical Triage

Lead technical triage bridges during P1/P2 incidents, coordinating across application, infrastructure, and vendor teams
Rapidly diagnose issues across the full stack — front‑end rendering, API failures, JVM issues, database bottlenecks, and network anomalies
Establish and maintain runbooks, escalation paths, and incident response playbooks
Drive structured incident timelines, stakeholder communications, and resolution documentation
Champion fast feedback loops between on‑call, engineering, and leadership during high‑severity events

Observability & Monitoring

Design and implement end‑to‑end observability strategies covering logs, metrics, traces, and synthetic monitoring
Build and maintain dashboards, alerting rules, and anomaly detection for Angular, Node.js, Java, and Python applications
Define golden signals (latency, traffic, errors, saturation) and SLO‑based alerting for all critical services
Drive adoption of distributed tracing and correlation of signals across service boundaries
Evaluate and integrate observability tooling (e.g., Prometheus, Grafana, Open Telemetry, Datadog, Dynatrace,Splunk, ELK)
Continuously improve signal‑to‑noise ratio to reduce alert fatigue and improve detection confidence

Automation & Toil Reduction

Identify and eliminate operational toil through automation, scripting, and self‑healing mechanisms
Build and maintain automation scripts in Python, Shell/Bash, or Node.js for diagnostics, remediation, and reporting
Develop automated health checks, smoke tests, and canary validations for releases and deployments
Automate repetitive support workflows such as log analysis, data reconciliation, and environment reset procedures
Contribute to the internal tooling ecosystem to improve operational efficiency across teams

Release & Change Management

Coordinate application releases in alignment with change management processes and release calendars
Conduct pre‑release readiness reviews, validating deployment readiness, rollback plans, and monitoring coverage
Collaborate with development and DevOps teams to define and enforce safe deployment practices(blue‑green, canary, feature flags)
Participate in change advisory board (CAB) processes, providing technical assessment of risk and impact
Maintain deployment runbooks and ensure change traceability across environments

Collaboration — Development, Architecture & Platform Teams

Serve as the operational voice in engineering discussions, advocating for reliability, observability, and supportability
Partner with development teams during design and sprint cycles to embed SRE best practices early
Engage with architects to review designs for failure modes, observability gaps, and operability concerns
Provide production insights and telemetry data to inform architectural decisions and technical debt prioritization
Drive feedback loops from production back to development and architecture teams in a structured ,data‑driven manner

Cloud & Infrastructure

Support and operate cloud‑native applications on Azure, AWS, or GCP, leveraging managed services effectively
Manage and troubleshoot containerized workloads using Docker and Kubernetes (AKS / EKS / GKE)
Understand and operate CI/CD pipelines, supporting deployment automation and pipeline reliability
Apply Infrastructure‑as‑Code (Terraform, Bicep, or similar) understanding to diagnose and support environment‑level issues
Collaborate with platform and cloud teams on capacity planning, cost optimization, and scaling strategies

AI & Engineering Innovation

Leverage AI‑assisted tooling (e.g., AIOps, GenAI‑based log analysis, intelligent alerting) to accelerate diagnosis and reduce resolution time
Evaluate and adopt AI/ML‑driven observability and anomaly detection capabilities
Apply GenAI tools responsibly to improve runbook generation, RCA summaries, and incident documentation quality
Contribute to organizational knowledge by documenting patterns, solutions, and operational best practices

Required Technical Skills

Application Stack
Angular (component lifecycle, API integration, front‑end performance profiling, browser diagnostics)
Node.js (event loop, async patterns, memory management, npm ecosystem, service debugging)
Java (Spring Boot, JVM diagnostics, heap/thread analysis, REST APIs, microservices)
Python (scripting, automation, data analysis, diagnostic tooling)
SRE & Reliability Engineering
SLI / SLO / SLA definition, tracking, and error budget management
Incident management frameworks (ITIL, PagerDuty, Opsgenie, or equivalent)
Root cause analysis methodologies (5 Whys, Fishbone, fault tree analysis)
Reliability patterns: circuit breakers, retries, timeouts, bulkheads, graceful degradation
Capacity planning, performance profiling, and load analysis

Observability & Monitoring

Logging: ELK Stack / Splunk / Loki — structured logging, log correlation, query analysis
Metrics: Prometheus, Grafana, Datadog, CloudWatch, Azure Monitor
Tracing: OpenTelemetry, Jaeger, Zipkin, distributed trace correlation
Synthetic monitoring, uptime checks, and real‑user monitoring (RUM)
Alert design: thresholds, multi‑condition rules, SLO burn rate alerts
Automation & Scripting
Python, Shell/Bash, PowerShell for automation, diagnostics, and remediation scripts
REST API automation and integration testing tools (Postman, curl, pytest, JUnit)
CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps, GitLab CI)
Infrastructure tooling: Terraform, Ansible, or similar
Cloud & Platforms
Cloud platforms: Azure / AWS / GCP — managed services, networking, IAM, storage, compute
Containers and orchestration: Docker, Kubernetes (kubectl, Helm, namespaces, resource limits)
Service mesh basics (Istio, Linkerd) and API gateway management
Database operations: SQL query analysis, connection pool diagnostics, slow query identification
AI / Data (Working Knowledge)
AIOps platforms and AI‑assisted alert correlation
GenAI tooling for documentation, RCA assistance, and knowledge management
Basic understanding of ML model deployment and observability for AI‑driven systems

Qualifications

Must have 10–15+ years of hands‑on software engineering and/or SRE experience
Proven experience designing and operating enterprise‑grade, large‑scale production systems
Demonstrated impact at Staff / Principal / Architect level in SRE, platform engineering, or applicationreliability
Strong background in influencing reliability and observability strategy across multiple teams or platforms
Demonstrated experience leading incident triage and driving resolution in high‑pressure, high‑stakesenvironments
Bachelor's or master's degree in Computer Science, Information Technology, or a related field

Leadership & Soft Skills

Exceptional analytical, diagnostic, and structured problem‑solving skills
Strong written and verbal communication — able to convey technical issues clearly to both technical andnon‑technical stakeholders
Ability to lead under pressure and drive calmness and clarity during high‑severity incidents
High ownership, accountability, and bias for action
Collaborative mindset with the ability to influence development and architectural decisions through dataand evidence
Continuous improvement orientation — always looking to reduce toil, improve quality, and raise thereliability bar

Nice to Have

Experience with Kafka, event‑driven architectures, and streaming system observability
Exposure to security monitoring, compliance frameworks, and vulnerability management in production
Experience with large‑scale analytics platforms (Spark, BigQuery, Databricks)
Familiarity with chaos engineering principles and tooling (Chaos Monkey, Litmus, Gremlin)
Prior role as Principal SRE, Staff Engineer, or Platform Reliability Architect
Certifications: AWS/Azure/GCP Associate or Professional, CKA (Certified Kubernetes Administrator), orequivalent

Additional Information

Our Benefits

Flexible working environment
Volunteer time off
LinkedIn Learning
Employee-Assistance-Program (EAP)

NIQ may utilize artificial intelligence (AI) tools at various stages of the recruitment process, including résumé screening, candidate assessments, interview scheduling, job matching, communication support, and certain administrative tasks that help streamline workflows. These tools are intended to improve efficiency and support fair and consistent evaluation based on job-related criteria. All use of AI is governed by NIQ’s principles of fairness, transparency, human oversight, and inclusion. Final hiring decisions are made exclusively by humans. NIQ regularly reviews its AI tools to help mitigate bias and ensure compliance with applicable laws and regulations. If you have questions, require accommodations, or wish to request human review were permitted by law, please contact your local HR representative. For more information, please visit NIQ’s AI Safety Policies and Guiding Principles: https://nielseniq.com/global/en/info/niqs-ai-safety-policies/

About NIQ

NIQ is the world’s leading consumer intelligence company, delivering the most complete understanding of consumer buying behavior and revealing new pathways to growth. In 2023, NIQ combined with GfK, bringing together the two industry leaders with unparalleled global reach. With a holistic retail read and the most comprehensive consumer insights—delivered with advanced analytics through state-of-the-art platforms—NIQ delivers the Full View™. NIQ is an Advent International portfolio company with operations in 100+ markets, covering more than 90% of the world’s population.

For more information, visit NIQ.com

Want to keep up with our latest updates?

Our commitment to Diversity, Equity, and Inclusion

At NIQ, we are steadfast in our commitment to fostering an inclusive workplace that mirrors the rich diversity of the communities and markets we serve. We believe that embracing a wide range of perspectives drives innovation and excellence. All employment decisions at NIQ are made without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, age, disability, genetic information, marital status, veteran status, or any other characteristic protected by applicable laws. We invite individuals who share our dedication to inclusivity and equity to join us in making a meaningful impact. To learn more about our ongoing efforts in diversity and inclusion, please visit the https://nielseniq.com/global/en/news-center/diversity-inclusion

Principal Software Engineer

Job Description

Qualifications

Additional Information

Other open roles at NielsenIQ(6)