NielsenIQ logo

Principal Software Engineer

Posted about 3 hours ago

OfficeChennai, TN, IndiaSE

Job Description

Principal Software Engineer – Site Reliability & Application Support, Chennai

We are looking for a Principal Software Engineer in Site Reliability Engineering (SRE) who defines and drives the reliability strategy for large‑scale, distributed, and cloud‑native applications. This role operates at a company and platform level, bridging the gap between software engineering and operations to ensure our applications are highly available, performant, and resilient at scale. The scope spans the full application stack Angular front‑end, Node. jsservices, Java back‑end, and Python tooling — and encompasses reliability engineering, observability, incident management, and continuous improvement of application health across production environments.
You will act as a technical authority for application reliability and support, leading triage efforts, driving automation to eliminate toil, setting company‑wide SRE standards, and collaborating with development, platform, and architecture teams to embed reliability as a first‑class engineering concern.

Responsibilities

Application Reliability & Support

  • Own end‑to‑end reliability of multi‑tier applications spanning Angular, Node.js, Java, and Python stacks
  • Monitor, triage, and resolve production incidents with speed and precision, minimizing customer impact and MTTR
  • Perform root cause analysis (RCA) on recurring issues and drive permanent fixes through development or platform teams
  • Define and track SLIs, SLOs, and error budgets aligned to business criticality
  • Lead blameless post‑mortems and ensure actionable follow‑through on learnings
  • Proactively identify reliability risks and work with engineering teams to address them before they impact production

Incident Management & Technical Triage

  • Lead technical triage bridges during P1/P2 incidents, coordinating across application, infrastructure, and vendor teams
  • Rapidly diagnose issues across the full stack — front‑end rendering, API failures, JVM issues, database bottlenecks, and network anomalies
  • Establish and maintain runbooks, escalation paths, and incident response playbooks
  • Drive structured incident timelines, stakeholder communications, and resolution documentation
  • Champion fast feedback loops between on‑call, engineering, and leadership during high‑severity events

Observability & Monitoring

  • Design and implement end‑to‑end observability strategies covering logs, metrics, traces, and synthetic monitoring
  • Build and maintain dashboards, alerting rules, and anomaly detection for Angular, Node.js, Java, and Python applications
  • Define golden signals (latency, traffic, errors, saturation) and SLO‑based alerting for all critical services
  • Drive adoption of distributed tracing and correlation of signals across service boundaries
  • Evaluate and integrate observability tooling (e.g., Prometheus, Grafana, Open Telemetry, Datadog, Dynatrace,Splunk, ELK)
  • Continuously improve signal‑to‑noise ratio to reduce alert fatigue and improve detection confidence

Automation & Toil Reduction

  • Identify and eliminate operational toil through automation, scripting, and self‑healing mechanisms
  • Build and maintain automation scripts in Python, Shell/Bash, or Node.js for diagnostics, remediation, and reporting
  • Develop automated health checks, smoke tests, and canary validations for releases and deployments
  • Automate repetitive support workflows such as log analysis, data reconciliation, and environment reset procedures
  • Contribute to the internal tooling ecosystem to improve operational efficiency across teams

Release & Change Management

  • Coordinate application releases in alignment with change management processes and release calendars
  • Conduct pre‑release readiness reviews, validating deployment readiness, rollback plans, and monitoring coverage
  • Collaborate with development and DevOps teams to define and enforce safe deployment practices(blue‑green, canary, feature flags)
  • Participate in change advisory board (CAB) processes, providing technical assessment of risk and impact
  • Maintain deployment runbooks and ensure change traceability across environments

Collaboration — Development, Architecture & Platform Teams

  • Serve as the operational voice in engineering discussions, advocating for reliability, observability, and supportability
  • Partner with development teams during design and sprint cycles to embed SRE best practices early
  • Engage with architects to review designs for failure modes, observability gaps, and operability concerns
  • Provide production insights and telemetry data to inform architectural decisions and technical debt prioritization
  • Drive feedback loops from production back to development and architecture teams in a structured ,data‑driven manner

Cloud & Infrastructure

  • Support and operate cloud‑native applications on Azure, AWS, or GCP, leveraging managed services effectively
  • Manage and troubleshoot containerized workloads using Docker and Kubernetes (AKS / EKS / GKE)
  • Understand and operate CI/CD pipelines, supporting deployment automation and pipeline reliability
  • Apply Infrastructure‑as‑Code (Terraform, Bicep, or similar) understanding to diagnose and support environment‑level issues
  • Collaborate with platform and cloud teams on capacity planning, cost optimization, and scaling strategies

AI & Engineering Innovation

  • Leverage AI‑assisted tooling (e.g., AIOps, GenAI‑based log analysis, intelligent alerting) to accelerate diagnosis and reduce resolution time
  • Evaluate and adopt AI/ML‑driven observability and anomaly detection capabilities
  • Apply GenAI tools responsibly to improve runbook generation, RCA summaries, and incident documentation quality
  • Contribute to organizational knowledge by documenting patterns, solutions, and operational best practices

Required Technical Skills

  • Application Stack
  • Angular (component lifecycle, API integration, front‑end performance profiling, browser diagnostics)
  • Node.js (event loop, async patterns, memory management, npm ecosystem, service debugging)
  • Java (Spring Boot, JVM diagnostics, heap/thread analysis, REST APIs, microservices)
  • Python (scripting, automation, data analysis, diagnostic tooling)
  • SRE & Reliability Engineering
  • SLI / SLO / SLA definition, tracking, and error budget management
  • Incident management frameworks (ITIL, PagerDuty, Opsgenie, or equivalent)
  • Root cause analysis methodologies (5 Whys, Fishbone, fault tree analysis)
  • Reliability patterns: circuit breakers, retries, timeouts, bulkheads, graceful degradation
  • Capacity planning, performance profiling, and load analysis

Observability & Monitoring

  • Logging: ELK Stack / Splunk / Loki — structured logging, log correlation, query analysis
  • Metrics: Prometheus, Grafana, Datadog, CloudWatch, Azure Monitor
  • Tracing: OpenTelemetry, Jaeger, Zipkin, distributed trace correlation
  • Synthetic monitoring, uptime checks, and real‑user monitoring (RUM)
  • Alert design: thresholds, multi‑condition rules, SLO burn rate alerts
  • Automation & Scripting
  • Python, Shell/Bash, PowerShell for automation, diagnostics, and remediation scripts
  • REST API automation and integration testing tools (Postman, curl, pytest, JUnit)
  • CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps, GitLab CI)
  • Infrastructure tooling: Terraform, Ansible, or similar
  • Cloud & Platforms
  • Cloud platforms: Azure / AWS / GCP — managed services, networking, IAM, storage, compute
  • Containers and orchestration: Docker, Kubernetes (kubectl, Helm, namespaces, resource limits)
  • Service mesh basics (Istio, Linkerd) and API gateway management
  • Database operations: SQL query analysis, connection pool diagnostics, slow query identification
  • AI / Data (Working Knowledge)
  • AIOps platforms and AI‑assisted alert correlation
  • GenAI tooling for documentation, RCA assistance, and knowledge management
  • Basic understanding of ML model deployment and observability for AI‑driven systems

 

Qualifications

  • Must have 10–15+ years of hands‑on software engineering and/or SRE experience
  • Proven experience designing and operating enterprise‑grade, large‑scale production systems
  • Demonstrated impact at Staff / Principal / Architect level in SRE, platform engineering, or applicationreliability
  • Strong background in influencing reliability and observability strategy across multiple teams or platforms
  • Demonstrated experience leading incident triage and driving resolution in high‑pressure, high‑stakesenvironments
  • Bachelor's or master's degree in Computer Science, Information Technology, or a related field

Leadership & Soft Skills

  • Exceptional analytical, diagnostic, and structured problem‑solving skills
  • Strong written and verbal communication — able to convey technical issues clearly to both technical andnon‑technical stakeholders
  • Ability to lead under pressure and drive calmness and clarity during high‑severity incidents
  • High ownership, accountability, and bias for action
  • Collaborative mindset with the ability to influence development and architectural decisions through dataand evidence
  • Continuous improvement orientation — always looking to reduce toil, improve quality, and raise thereliability bar

Nice to Have

  • Experience with Kafka, event‑driven architectures, and streaming system observability
  • Exposure to security monitoring, compliance frameworks, and vulnerability management in production
  • Experience with large‑scale analytics platforms (Spark, BigQuery, Databricks)
  • Familiarity with chaos engineering principles and tooling (Chaos Monkey, Litmus, Gremlin)
  • Prior role as Principal SRE, Staff Engineer, or Platform Reliability Architect
  • Certifications: AWS/Azure/GCP Associate or Professional, CKA (Certified Kubernetes Administrator), orequivalent

Additional Information

Our Benefits

  • Flexible working environment
  • Volunteer time off
  • LinkedIn Learning
  • Employee-Assistance-Program (EAP)

NIQ may utilize artificial intelligence (AI) tools at various stages of the recruitment process, including résumé screening, candidate assessments, interview scheduling, job matching, communication support, and certain administrative tasks that help streamline workflows. These tools are intended to improve efficiency and support fair and consistent evaluation based on job-related criteria. All use of AI is governed by NIQ’s principles of fairness, transparency, human oversight, and inclusion. Final hiring decisions are made exclusively by humans. NIQ regularly reviews its AI tools to help mitigate bias and ensure compliance with applicable laws and regulations. If you have questions, require accommodations, or wish to request human review were permitted by law, please contact your local HR representative. For more information, please visit NIQ’s AI Safety Policies and Guiding Principles: https://nielseniq.com/global/en/info/niqs-ai-safety-policies/

About NIQ

NIQ is the world’s leading consumer intelligence company, delivering the most complete understanding of consumer buying behavior and revealing new pathways to growth. In 2023, NIQ combined with GfK, bringing together the two industry leaders with unparalleled global reach. With a holistic retail read and the most comprehensive consumer insights—delivered with advanced analytics through state-of-the-art platforms—NIQ delivers the Full View™. NIQ is an Advent International portfolio company with operations in 100+ markets, covering more than 90% of the world’s population.

For more information, visit NIQ.com

Want to keep up with our latest updates?

Follow us on: LinkedIn | Instagram | Twitter | Facebook

Our commitment to Diversity, Equity, and Inclusion

At NIQ, we are steadfast in our commitment to fostering an inclusive workplace that mirrors the rich diversity of the communities and markets we serve. We believe that embracing a wide range of perspectives drives innovation and excellence.  All employment decisions at NIQ are made without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, age, disability, genetic information, marital status, veteran status, or any other characteristic protected by applicable laws. We invite individuals who share our dedication to inclusivity and equity to join us in making a meaningful impact. To learn more about our ongoing efforts in diversity and inclusion, please visit the https://nielseniq.com/global/en/news-center/diversity-inclusion

Job details
Workplace
Office
Location
Chennai, TN, India
Experience
SE

Key team members

Dimiter Gergishanov

Dimiter Gergishanov

Apply smarter with Jobr

Jobr aggregates jobs directly from company career portals — no middlemen. Our team applies on your behalf with AI-tailored resumes, reviewed by a human before submission.

Direct from company career pages
AI-personalised cover letters
Human review before every submit
Application tracking & follow-ups