Site Reliability Engineer - USDS
TikTok.com
Office
Sydney, New South Wales, Australia
Full Time
Site Reliability Engineering(SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. In our team, you’ll have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design. We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving. We encourage close collaboration while promoting self-direction.
Responsibilities:
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
- Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust.
- Ensure system scalability to handle growth in web traffic and data.
- Implement monitoring tools and set up metrics to keep track of system health and performance.
- Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
- Conduct performance tests to find and address system bottlenecks.
- Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
- Develop and maintain automation procedures to maximize system efficiency and minimize human intervention.
- Work closely with software engineering teams to design, deploy and operate elements to ensure that systems are functionally robust.
- Ensure system scalability to handle growth in web traffic and data.
- Implement monitoring tools and set up metrics to keep track of system health and performance.
- Participate in on-call rotations, assist with incident management, and diagnose, resolve, and prevent production issues.
- Conduct performance tests to find and address system bottlenecks.
- Collaborate with teams across the organization to define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs).
- Practice sustainable user support, incident response, and blameless postmortems.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
