Infrastructure Engineer – AI HPC Compute

CloudRaft.com

Remote

Full Time

About CloudRaftCloudRaft is a dynamic and forward-thinking company specializing in cutting-edge AI and cloud-native solutions. We thrive on creativity, collaboration, and innovation, empowering our team to solve complex challenges and deliver impactful results. Join us to be part of a team that values growth, excellence, and a passion for technology.
Overview:We are seeking an Infrastructure Engineer to lead the lifecycle management of cutting-edge GPU-based platforms powering AI, ML, and HPC workloads. You will design, automate, and optimize high-performance compute clusters, ensuring robust provisioning, observability, and operational excellence at scale.
Key ResponsibilitiesNOTE: This position involves inclusion in a 24x7 on-call rotation to provide operational support as needed. This is a remote position open to candidates currently residing in India.
GPU & Baremetal Lifecycle Management

Own end-to-end lifecycle management of high-performance GPU servers (NVIDIA/AMD), optimizing for AI/ML training, inference, and HPC workloads.
Install, upgrade, and maintain OS (Ubuntu, RHEL), GPU drivers (NVIDIA), DCGM, firmware, BIOS, and BMC—focusing on system stability and performance.
Perform advanced diagnostics, health-checks, and RMA/hardware swaps in collaboration with data center vendors.
Standardize and maintain golden images tuned for AI/ML performance, ensuring fast and reproducible deployments.

Cluster Networking & Storage

Have understanding of high-speed network fabrics (InfiniBand, 400G Ethernet) and distributed storage (e.g., Ceph, Lustre, Vast Data) for scalable, low-latency compute clusters.
Liaise with Networking engineering and diagnose network bottlenecks and optimize for distributed AI workloads.

Provisioning & Automation

Design and implement automation for GPU node imaging, provisioning, access control, and scheduling in multi-tenant environments (Bash, Python, Ansible).
Ensure rapid and consistent provisioning of GPU resources tailored to workload-specific requirements (AI training, simulation, etc).
Integrate real-time GPU metrics into dashboards and billing systems for transparency and usage reporting.

Monitoring & Observability

Deploy and manage observability stacks (Prometheus, Grafana) for deep monitoring of GPU/server performance, thermal profiles, and system health.
Build dashboards and intelligent alerting systems for proactive incident response (e.g., thermal throttling, resource contention).
Deliver actionable insights to engineering teams and end users on resource utilization and efficiency.

DevOps and Infrastructure Engineering

Manage and operate Kubernetes cluster with GPU scheduling.
Champion Infrastructure-as-Code (Terraform, Ansible) for reproducible, scalable deployment of compute platforms.
Develop and maintain CI/CD pipelines for infrastructure workflows supporting AI Factory or GPUaaS operations.
Apply best practices for security, resiliency, and scaling in a distributed, high-throughput environment.

Qualifications & Skills

3+ years experience managing large-scale, bare metal compute infrastructure with NVIDIA GPU specialization.
Expert-level Linux system administration (Ubuntu/RHEL) in production HPC/AI environments.
In-depth hands-on with NVIDIA drivers, DCGM, network fabrics, BIOS/firmware, and BMC management.
Proficiency in automation and scripting (Python, Bash, Ansible); advanced debug and troubleshooting skills.
Experience in deploying and operating monitoring solutions (Prometheus, Grafana, or equivalent).
Proven track record implementing Infrastructure as Code (Terraform, Ansible, etc).
Solid grasp of HPC/AI workload patterns, complex job schedulers (SLURM, Kubernetes), and best practices for distributed training/inferencing.
Strong communication skills and ability to document operational runbooks, images, and workflows.

Bonus Points

Experience operating Kubernetes clusters in GPU-accelerated, bare metal environments.
Familiarity with distributed storage (Ceph, Lustre, Vast Data, DDN or similar), GPU virtualization, or multi-tenant hardware isolation.
Background in building AI Cloud or large scale private cloud.