Site Reliability Engineer, Recommendation Infrastructure - USDS

TikTok.com

Office

San Jose, California, United States

Full Time

About the Team
The USDS TikTok Recommendations Infra SRE team works with engineering and product teams to build and run large-scale, globally distributed, observable, fault-tolerant systems. SREs on this team will deliver on production ownership and be responsible for observability and automation across complex, large-scale service mesh architectures.

In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.

Responsibilities
• Engage in and improve the whole lifecycle of Recommendation systems — from system design consulting through to launch reviews, deployment, operation and refinement
• Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
• Build availability of large-scale services deployed across global data centers
• Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
• Measure and monitor availability, latency and overall service health
• Practice sustainable incident response and postmortems.