Site Reliability Engineers (SREs) are responsible for supporting and improving the reliability of production systems. In this role, you will work hands-on with real systems, contribute to incident response, and develop the technical skills needed to operate and troubleshoot modern distributed applications.

We are looking for engineers who are proactive, detail-oriented, and eager to take on technical challenges. This role is ideal for candidates who want to build a strong foundation in systems, automation, and reliability engineering.

Monitoring & Observability:

Monitor system health using metrics, logs, and traces.
Respond to alerts and perform initial investigation using runbooks.
Identify patterns in alerts and contribute to improving alert quality and signal.

Incident Support & Troubleshooting:

Participate actively in incident response and assist in troubleshooting issues.
Perform initial diagnosis by analyzing logs, metrics, and system behavior.
Escalate issues with clear context and supporting data.
Contribute to incident documentation and post-incident reviews.

Operations & System Support:

Support operational tasks such as deployments, rollbacks, and system checks.
Assist in maintaining production systems by identifying and reporting anomalies or risks.
Follow and improve standard operating procedures (SOPs) and runbooks.

Automation & Improvement:

Develop scripts or small tools to automate repetitive operational tasks.
Contribute to improving monitoring, alerting, and operational workflows.
Identify inefficiencies and suggest improvements to enhance reliability.

Performance & System Understanding:

Assist in analyzing system performance and identifying basic bottlenecks.
Develop an understanding of how services interact and fail in production.
Support troubleshooting efforts across application and infrastructure layers.

Collaboration & Learning:

Work closely with developers and cross-functional teams.
Communicate clearly on findings, issues, and progress.
Continuously learn from incidents, system behavior, and team feedback.

AI & Learning (Nice to Have / Exposure):

Gain exposure to AI/ML tools used in monitoring and operations.
Explore how AI can assist in anomaly detection and incident analysis.
Show curiosity in AIOps and automation-driven operations.

Technical Skills

Good understanding of system and networking concepts (HTTP, DNS, TCP/IP).
Ability to write code or scripts (e.g., Python, JavaScript, or similar) for basic automation.
Ability to analyze logs and metrics to support troubleshooting.
Basic understanding of how web applications and services operate in production.
Willingness to learn and work with observability tools (e.g., Datadog).

Education / Background

Bachelor’s degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.
Minimum 1–3 years of experience in software engineering, DevOps, or related areas (including internships or projects).

Nice to Have

Internship or project experience involving backend systems, APIs, or cloud platforms.
Exposure to cloud environments (AWS, GCP, or Azure).
Basic understanding of CI/CD pipelines and deployment workflows.
Basic knowledge or hands-on experience with AI/ML concepts or tools.
Interest in applying AI to monitoring, automation, or operational efficiency.

Soft Skills

Strong willingness to learn and take on technical challenges.
Problem-solving mindset with attention to detail.
Ability to communicate clearly and work collaboratively.
Proactive attitude and sense of ownership for assigned tasks.

Site Reliability Engineer II

Job details

Yum! Brands