company logo

Senior / Principal Site Reliability Engineer

DataCrunch.com

Hybrid

Remote (US)

Full Time

Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.

We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!

As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.

We’re seeking a Senior or Principal Site Reliability Engineer (SRE) to become our first U.S. hire, based in the Bay Area. This is a pivotal role as we expand our operations across the West Coast. You’ll work closely with our European engineering teams to scale our high-performance compute (HPC) and cloud infrastructure globally. As our initial U.S.-based engineer, you’ll set the standard for reliability, automation, and operational excellence.

  • Generous cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.).
  • Profitable operations, in addition to fast growth.
  • Role that offers plenty of space to both make a business-critical impact and become a QA team lead or an engineer.
  • Small yet mighty team of 65, challenging the status quo to positively impact the lives of many people.
  • 27 nationalities in total, with 6 different ones in the management team.
  • Work mode: Remote (with plans to open our first U.S. office next year)
  • Seniority level: Senior
  • Employment type: Full-time, permanent
  • Ensure the reliability, scalability, and performance of HPC and cloud systems.
  • Build and maintain automation, observability, and monitoring frameworks for compute clusters.
  • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems.
  • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes.
  • Participate in architecture design and long-term infrastructure strategy discussions.
  • Help establish local infrastructure and contribute to the setup of our future San Francisco office.
  • Play a key role in recruiting and mentoring as our U.S. team grows.
  • 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.
  • Linux expertise (Ubuntu or Debian preferred).
  • Strong experience with scripting and automation (Python, Go, Bash).
  • Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).
  • Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible).
  • Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.
  • Familiarity with ML model training environments.
  • Understanding of Kubernetes (nice to have)
  1. Intro chat with our Talent Acquisition Partner - an initial online conversation to learn more about you and share details about the role.
  2. Technical assignment - a short task (around 15 minutes) to understand your approach and problem-solving style.
  3. Online technical interview with the Hiring Manager - a deeper discussion about your technical experience and ways of working.
  4. In-person interview with one of our team members - a chance to get to know the team and our culture.
  5. Final interview with our CTO & CEO – to align on vision and  expectations.

Senior / Principal Site Reliability Engineer

Hybrid

Remote (US)

Full Time

October 30, 2025