
Forward Deployed Data Engineer
Ellison Institute of Technology
Posted about 18 hours ago
At the Ellison Institute of Technology (EIT), we’re on a mission to translate scientific discovery into real world impact. We bring together visionary scientists, technologists, policy makers, and entrepreneurs to tackle humanity’s greatest challenges in four transformative areas:
- Health, Medical Science & Generative Biology
- Food Security & Sustainable Agriculture
- Climate Change & Managing CO₂
- Artificial Intelligence & Robotics
This is ambitious work - work that demands curiosity, courage, and a relentless drive to make a difference. At EIT, you’ll join a community built on excellence, innovation, tenacity, trust, and collaboration, where bold ideas become real-world breakthroughs. Together, we push boundaries, embrace complexity, and create solutions to scale ideas for lab to society. Explore more at www.eit.org
Requirements
Job Summary:
Our platform connects physical hardware - robotic systems, ambient sensor rigs, lab hardware - to a cloud-native data platform that captures, stores and services high-frequency multimodal data. The platform spans an edge-to-cloud telemetry stack (MQTT, Kafka, multimedia streams), REST and streaming APIs, and a growing set of hardware integrations across multiple sites.
As a Data Engineer, you’ll work directly alongside hardware engineers, scientists, and the core platform team to build the data layer that makes our autonomous operations possible. This is a hands-on, high-impact role for someone who is comfortable combining hardware data streams with reproducible, version-controlled, and well-structured data pipelines, and is comfortable bringing their own expertise into diverse groups.
Day-to-Day, You Might
- Integrate new lab instruments and robotic systems into the platform - writing device APIs, defining event schemas, and validating data at the edge before it reaches cloud storage.
- Build and maintain high-throughput, low-latency ingestion pipelines for streaming data: live video feeds, sensor telemetry, robot joint state, and instrument control signals.
- Work across the edge-to-cloud stack - from configuring edge devices and MQTT brokers through to Kafka topics, cloud storage, and the APIs that expose data to scientists and downstream ML pipelines.
- Design and evolve the common data model for hardware execution and scientific outcome data, with a strong focus on schema stability, versioning, and provenance.
- Collaborate with scientists and hardware engineers to turn raw instrument output into research-ready, schema-validated data for model training.
- Contribute to an engineering culture that values maintainability, testing, robust system design, and deep collaboration, but allows flexibility for rapid prototyping and responsiveness to changing landscapes.
What Makes You a Great Fit
Nobody checks every box - if you’re not sure if you’re qualified, we still encourage you to apply.
- You have strong programming experience in Python, and value code quality, reliability, and readability as much as performance.
- You have experience working on cloud compute platforms, containers and Linux environments.
- You think in systems and own them end-to-end - from device output to APIs - and embrace long-term engineering rather than one-off scripts.
- Hands-on experience building data pipelines for physical systems: autonomous vehicles, robotics, clinical/lab instruments, industrial control systems, or similar environments.
- Experience with real-time data streaming: Kafka or equivalent message brokers, MQTT or similar protocols, and the challenges that come with high-frequency, low-latency data capture.
Great to Also Have
- Familiarity with live video or high-bandwidth media streams as data engineering problems is a strong plus.
- Experience in automated lab, clinical lab, or life sciences environments - understanding of instrument APIs, lab protocols, and the data quality expectations of scientific workflows.
- Comfort with time-series data from sensors and control systems, including sampling rates, data loss handling, and operational (driving live systems) vs analytical (modelling) use cases of the same data.
- Understanding of closed-loop control systems and the data infrastructure needed to support real-time decision making.
Why This Role
You'll be one of a nimble team building infrastructure that directly enables autonomous science at scale. The hardware is real, the data volumes are high, the use cases span from live lab control to training foundation models — and the decisions you make about schemas, protocols, and pipeline architecture now will shape how the platform grows across new sites and hardware categories in 2027 and beyond.
Job details
Jobr Assistant extension
Get the extension →