HPC Engineer
Posted 19 days ago
OfficeSunnyvale, CA150k - 300k USD
About MBZUAI
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.
### Responsibilities
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.
Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.
• Monitor health, performance, and availability of large-scale GPU clusters.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.
### Education
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.
Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.
### Experience
• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.
### Preferred Qualifications
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.
• Slurm.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Other open roles at Institute of Foundation Models(6)
Institute of Foundation Models
View company pageMohamed bin Zayed University of Artificial Intelligence (MBZUAI) is a graduate research university dedicated to advancing AI as a global force for good.
Apply smarter with Jobr
Jobr aggregates jobs directly from company career portals — no middlemen. Our team applies on your behalf with AI-tailored resumes, reviewed by a human before submission.
Direct from company career pages
AI-personalised cover letters
Human review before every submit
Application tracking & follow-ups