Engineering-L2-Hyderabad-Vice President-Software Engineering

Goldman Sachs.com

Office

Hyderabad, Telangana, India

Full Time

Senior Site Reliability Engineer (SRE) Job Description (12+ Years Experience)

Short Description for Internal Candidates

The Senior Site Reliability Engineer (SRE) will serve as a technical leader and subject matter expert, responsible for defining, implementing, and optimizing the reliability, performance, and scalability of our most critical, large-scale distributed systems. This role requires a blend of deep technical expertise, strategic thinking, and the ability to mentor and guide other engineers, fostering a culture of operational excellence and continuous improvement across the engineering organization.

ABOUT GOLDMAN SACHS
At Goldman Sachs, we commit our people, capital and ideas to help our clients, shareholders and the communities we serve to grow. Founded in 1869, we are a leading global investment banking, securities and investment management firm. Headquartered in New York, we maintain offices around the world. We believe who you are makes you better at what you do. We're committed to fostering and advancing diversity and inclusion in our own workplace and beyond by ensuring every individual within our firm has a number of opportunities to grow professionally and personally, from our training and development opportunities and firmwide networks to benefits, wellness and personal finance offerings and mindfulness programs. Learn more about our culture, benefits, and people at GS.com/careers.

We are seeking highly skilled Senior C++ Developers with 8 to 10 years of experience to take ownership of critical aspects of the software Software Development Life Cycle (SDLC). The ideal candidates will have a strong background in C++ programming, experience mentoring junior developers, and a proactive approach to software upgrades and product enhancements. This role requires technical leadership, collaboration with cross-functional teams, and a deep understanding of system architecture and performance optimization.

Key Responsibilities:

Strategic Reliability Leadership: Lead the development and execution of SRE strategies, best practices, and roadmaps to enhance system reliability, availability, scalability, and efficiency across multiple domains or the entire platform.
Architectural Guidance & Design: Provide expert guidance and hands-on contributions in designing, building, and maintaining robust, fault-tolerant, and highly available architectures for distributed systems, including microservices and orchestrators. This includes influencing product and service roadmaps to ensure reliability is a first-class feature.
Advanced Monitoring & Observability: Architect and implement sophisticated monitoring, alerting, and logging systems to provide deep insights into system health, performance, and user experience. Define and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to drive continuous improvement and manage expectations.
Complex Incident Management & Resolution: Act as a primary point of contact and lead the response for major incidents, performing deep root cause analyses, and implementing strategic improvements to prevent recurrence. Drive a culture of blameless post-mortems and learning.
Advanced Automation & Toil Reduction: Champion and lead the development of advanced automation tools and frameworks to eliminate toil, streamline operational tasks, and improve overall system efficiency, including deployment, configuration management, and incident response.
Performance Engineering & Capacity Planning: Proactively identify and mitigate potential system risks, leading efforts in performance optimization, capacity planning, and efficiency improvements for large-scale production environments.
CI/CD & Release Excellence: Drive the evolution of Continuous Integration/Continuous Deployment (CI/CD) pipelines and release management processes, ensuring safe, efficient, and reliable software delivery at scale.
Mentorship & Technical Leadership: Mentor and guide junior and mid-level SREs and other engineering teams, fostering a culture of knowledge sharing, technical growth, and operational maturity. Provide expert advice on technical and business-related issues.
Cross-functional Collaboration: Collaborate extensively with development, product, security, and infrastructure teams to embed reliability practices throughout the software development lifecycle and ensure alignment with organizational goals.
Technology Evaluation & Adoption: Continuously evaluate emerging tools, technologies, and industry best practices, making recommendations and leading their adoption to enhance operational efficiency and reliability.

Qualifications:

Experience: 12+ years of progressive experience in Site Reliability Engineering, Production Engineering, Software Development, or related roles with a strong focus on large-scale distributed production systems.
Technical Leadership: Demonstrated ability to lead technical initiatives, influence architectural decisions, and drive significant improvements in system reliability and performance.
Programming Mastery: Expert-level proficiency in multiple programming languages, such as Python, Go, Java, Ruby, or Bash, with a strong emphasis on writing high-quality, maintainable code for automation and tooling.
Operating Systems: Deep expertise in Linux/Unix operating systems and systems engineering.
Cloud Platforms: Extensive hands-on experience with major cloud providers (e.g., AWS, GCP, Azure) and designing cloud-native solutions.
Containerization & Orchestration: Mastery of container technologies (e.g., Docker) and advanced orchestration tools (e.g., Kubernetes), including designing and managing large-scale Kubernetes deployments.
CI/CD & IaC: Proven experience with advanced CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions) and Infrastructure as Code (IaC) principles and tools (e.g., Terraform, Ansible).
Monitoring & Observability Stack: Expertise in designing and implementing comprehensive monitoring, logging, and alerting solutions using tools like Prometheus, Grafana, Datadog, ELK Stack, Splunk, or similar.
Distributed Systems: In-depth understanding and experience with the design, development, and operation of complex distributed systems.
Networking: Advanced knowledge of networking concepts (TCP/IP, DNS, load balancing) and network observability.
Databases: Strong understanding of various database technologies (SQL and NoSQL) and data platforms, especially in a high-performance, high-availability context.
Problem-Solving: Exceptional analytical, problem-solving, and debugging skills for complex, multi-layered systems.
Communication: Excellent written and verbal communication skills, with the ability to articulate complex technical concepts to diverse audiences, including executive leadership.
Interpersonal Skills: Strong ability to collaborate effectively across teams, influence stakeholders, and lead technical discussions.

Preferred Qualifications:

Experience with chaos engineering principles and practices.
Familiarity with compliance and security best practices in large-scale environments.
Contributions to open-source SRE tooling or publications.
Experience in a regulated industry (e.g., financial services, healthcare).

© The Goldman Sachs Group, Inc., 2023. All rights reserved. Goldman Sachs is an equal opportunity employer and does not discriminate on the basis of race, color, religion, sex, national origin, age, veterans status, disability, or any other characteristic protected by applicable law.