Firmus Technologies logo

Site Reliability Engineer

Firmus Technologies

Posted about 12 hours ago

Firmus Technologies

Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs. Our mission is to create the most energy-efficient AI infrastructure, combining cutting edge technology with a steadfast commitment to sustainability.

Through ground-breaking research and development, we invented a verticalized AI Factory - a new class of digital infrastructure that replaces traditional data centres. Built on new approaches to liquid cooling, energy management, water use and modular construction methodology, the Firmus AI Factory delivers low-cost AI tokens across Asia-Pacific.

Firmus AI Cloud

We provide customers with access to energy savings via our large-scale GPU cloud, Firmus AI Cloud. Rated Silver in The GPU Cloud ClusterMAXRating System, our cloud empowers developers, enterprise, education and government users to train AI models with unmatched efficiency and cost savings. With an ever-growing list of services and applications, we are committed to building a cloud experience for our customers that is market-leading, proprietary and built to scale.

Why you’ll love working here

  • A fast-paced and dynamic environment working with next-gen technology. You’ll be operating at the intersection of sustainability and artificial intelligence – helping to transform an industry.
  • Working with and access to colleagues who are true innovators and leaders in their field.
  • As an emerging company, we work as a close-knit team. Work with the founders, grow a strong network, and witness the impact you make first-hand as we democratise AI tools for everyone – more sustainably, and more affordably.
  • We believe that people from diverse backgrounds come together to do their best work, be their authentic selves, and build great things. We are proud to be an equal opportunity employer.

ROLE SUMMARY

Firmus Technologies is seeking a skilled Site Reliability Engineer to join our Operations team, supporting the daily operations and maintenance of our AI-accelerated High-Performance Computing (HPC) infrastructure. This role will work closely with Field Service Engineers, HPC and Network Engineering teams, and assist the Global Operations Centre (GOC). This is a unique opportunity to contribute directly to the stability and growth of cutting-edge AI infrastructure.


KEY RESPONSIBILITIES

  • Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
  • Perform hardware diagnostics, systems functionality and firmware updates as required.
  • Collaborate with engineering teams to assist in tailored customer environments deployment (eg: bare-metal systems, HPC Clusters, Kubernetes, Slurm etc).
  • Serve as first line of engineering support for onsite operational issues, including troubleshooting hardware, network and software problems, and firmware compliance.
  • Troubleshoot incidents, escalate critical issues and provide feedback to appropriate teams for improvements.
  • Participate in an on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
  • Provide technical support to the GOC Support Specialist team in troubleshooting compute infrastructure related problems.
  • Document incident details, resolutions, and lessons learned to enhance future problem-solving.
  • Maintain clear, accurate, and up-to-date documentation to promote effective knowledge sharing across the team.
  • Communicate effectively with GOC, HPC Engineers, internal teams, stakeholders, and end-users to ensure alignment on issue resolution.
  • Take part in team meetings and knowledge-sharing sessions to foster collaboration and continuous learning.

SKILLS AND EXPERIENCE

  • Bachelor’s degree in computer engineering, computer science, or a related technical field.
  • 5+ years of experience in field service technical areas.
  • Strong understanding of server hardware technology, firmware lifecycle, Linux environments and troubleshooting hardware problems, with adherence to physical and system-level security standards.
  • Experience with scripting languages (eg: Bash, Python)
  • Familiarity with using configuration management, CICD tools, workload manager and cluster softwares (eg: Slurm, Kubernetes, Nvidia BCM) and Observability tools (eg: Prometheus, Grafana, ELK, etc)
  • Excellent problem-solving and analytical skills.
  • Ability to work independently and as part of a team.
  • Strong communication skills, both written and verbal.

Location & Reporting

  • Based in: Singapore
  • Reporting to: Senior Operations Manager

Employment Basis

Full-time

Want to see the full job description?

Sign in to view the complete details and apply to this position.

Job details

Workplace

Office

Location

Singapore

Similar

Jobr Assistant extension

Get the extension →