At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We're looking for an Inference Optimization MLE to help build and operate the systems that make our foundation models run fast and efficiently in production. You'll be responsible for squeezing maximum performance out of large multimodal models, across cloud and on-robot deployment targets. You will working closely with research and robotics teams to close the gap between training and real-world deployment.

What You'll Do

Own inference performance end-to-end — diagnose and improve latency, throughput, and efficiency of large foundation models in production
Build systematic performance attribution: latency decomposition (compute vs. memory bandwidth vs. I/O), bottleneck identification, and prioritization across model families
Apply and develop optimization techniques including quantization, pruning, distillation, operator fusion, and model compilation (e.g., TensorRT, torch.compile, XLA)
Optimize attention mechanisms, KV caching, and memory layouts for large multimodal models (vision, video, language, proprioception)
Work with kernel-level tooling (e.g., CUDA, Triton) to identify hotspots and implement or tune custom kernels where needed
Build benchmarking and regression detection infrastructure: latency baselines, throughput curves, and automated detection of performance regressions across model versions
Collaborate closely with research engineers to translate model innovations into optimized, deployment-ready implementations

What We're Looking For

3+ years of experience in inference optimization, ML systems, or a closely related field
Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
Strong understanding of compute, memory bandwidth, and I/O bottlenecks in large model inference
Experience with model optimization techniques: quantization (INT8/FP8/AWQ), distillation, pruning, and compilation
Familiarity with inference serving frameworks (e.g., Triton, TensorRT, vLLM, TorchServe)
Exceptional debugging and measurement ability: turn "inference is slow" into clear bottlenecks, experiments, and validated improvements
High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
Experience with multimodal or video model inference (variable-length sequences, packing/bucketing)
Familiarity with edge/cloud hybrid deployment patterns and on-robot inference constraints
Experience with speculative decoding, continuous batching, or other LLM serving optimizations
Background in streaming or low-latency systems relevant to real-time robot control

Why This Role

Direct leverage on research velocity and real-world robot performance — every efficiency gain you make accelerates model iteration and tightens the loop between model and robot behavior
Own the optimization layer that determines how quickly and efficiently our foundation models run in the real world — high ownership, high impact, small elite team

Inference Optimization ML Engineer

Other open roles at Rhoda ai(6)