Infrastructure Engineer – AI HPC Compute
CloudRaft.com
Remote
Remote
Full Time
About CloudRaftCloudRaft is a dynamic and forward-thinking company specializing in cutting-edge AI and cloud-native solutions. We thrive on creativity, collaboration, and innovation, empowering our team to solve complex challenges and deliver impactful results. Join us to be part of a team that values growth, excellence, and a passion for technology.
Overview:We are seeking an Infrastructure Engineer to lead the lifecycle management of cutting-edge GPU-based platforms powering AI, ML, and HPC workloads. You will design, automate, and optimize high-performance compute clusters, ensuring robust provisioning, observability, and operational excellence at scale.
Key ResponsibilitiesNOTE: This position involves inclusion in a 24x7 on-call rotation to provide operational support as needed. This is a remote position open to candidates currently residing in India.
GPU & Baremetal Lifecycle Management
Cluster Networking & Storage
Provisioning & Automation
Monitoring & Observability
DevOps and Infrastructure Engineering
Qualifications & Skills
Bonus Points
Overview:We are seeking an Infrastructure Engineer to lead the lifecycle management of cutting-edge GPU-based platforms powering AI, ML, and HPC workloads. You will design, automate, and optimize high-performance compute clusters, ensuring robust provisioning, observability, and operational excellence at scale.
Key ResponsibilitiesNOTE: This position involves inclusion in a 24x7 on-call rotation to provide operational support as needed. This is a remote position open to candidates currently residing in India.
GPU & Baremetal Lifecycle Management
- Own end-to-end lifecycle management of high-performance GPU servers (NVIDIA/AMD), optimizing for AI/ML training, inference, and HPC workloads.
- Install, upgrade, and maintain OS (Ubuntu, RHEL), GPU drivers (NVIDIA), DCGM, firmware, BIOS, and BMC—focusing on system stability and performance.
- Perform advanced diagnostics, health-checks, and RMA/hardware swaps in collaboration with data center vendors.
- Standardize and maintain golden images tuned for AI/ML performance, ensuring fast and reproducible deployments.
Cluster Networking & Storage
- Have understanding of high-speed network fabrics (InfiniBand, 400G Ethernet) and distributed storage (e.g., Ceph, Lustre, Vast Data) for scalable, low-latency compute clusters.
- Liaise with Networking engineering and diagnose network bottlenecks and optimize for distributed AI workloads.
Provisioning & Automation
- Design and implement automation for GPU node imaging, provisioning, access control, and scheduling in multi-tenant environments (Bash, Python, Ansible).
- Ensure rapid and consistent provisioning of GPU resources tailored to workload-specific requirements (AI training, simulation, etc).
- Integrate real-time GPU metrics into dashboards and billing systems for transparency and usage reporting.
Monitoring & Observability
- Deploy and manage observability stacks (Prometheus, Grafana) for deep monitoring of GPU/server performance, thermal profiles, and system health.
- Build dashboards and intelligent alerting systems for proactive incident response (e.g., thermal throttling, resource contention).
- Deliver actionable insights to engineering teams and end users on resource utilization and efficiency.
DevOps and Infrastructure Engineering
- Manage and operate Kubernetes cluster with GPU scheduling.
- Champion Infrastructure-as-Code (Terraform, Ansible) for reproducible, scalable deployment of compute platforms.
- Develop and maintain CI/CD pipelines for infrastructure workflows supporting AI Factory or GPUaaS operations.
- Apply best practices for security, resiliency, and scaling in a distributed, high-throughput environment.
Qualifications & Skills
- 3+ years experience managing large-scale, bare metal compute infrastructure with NVIDIA GPU specialization.
- Expert-level Linux system administration (Ubuntu/RHEL) in production HPC/AI environments.
- In-depth hands-on with NVIDIA drivers, DCGM, network fabrics, BIOS/firmware, and BMC management.
- Proficiency in automation and scripting (Python, Bash, Ansible); advanced debug and troubleshooting skills.
- Experience in deploying and operating monitoring solutions (Prometheus, Grafana, or equivalent).
- Proven track record implementing Infrastructure as Code (Terraform, Ansible, etc).
- Solid grasp of HPC/AI workload patterns, complex job schedulers (SLURM, Kubernetes), and best practices for distributed training/inferencing.
- Strong communication skills and ability to document operational runbooks, images, and workflows.
Bonus Points
- Experience operating Kubernetes clusters in GPU-accelerated, bare metal environments.
- Familiarity with distributed storage (Ceph, Lustre, Vast Data, DDN or similar), GPU virtualization, or multi-tenant hardware isolation.
- Background in building AI Cloud or large scale private cloud.
Infrastructure Engineer – AI HPC Compute
Remote
Remote
Full Time
September 15, 2025