The data-centric AI platform powered by programmatic labeling and foundation models
Applied Research Engineer
Location
United States
Posted
4 days ago
Salary
$150K - $180K / year
Job Description
Role Description
As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale. You will work closely with research scientists and engineers, translating training requirements into robust, reproducible systems—and proactively removing infrastructure blockers before they slow down the work that matters most.
Snorkel AI operates in a fast-paced, high-impact environment. We are looking for someone who takes pride in operational excellence, loves solving complex distributed systems problems, and thrives when given real ownership.
Location: Redwood City or San Francisco — OR REMOTE
Main Responsibilities
- Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
- Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
- Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
- Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
- Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
- Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.
Qualifications
- Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management.
- Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems.
- Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication.
- Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools.
- Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation.
- Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems.
- Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required.
Salary Range
$150,000.00 – $180,000.00
Benefits
Joining Snorkel AI means becoming part of a company that has market proven solutions, robust funding, and is scaling rapidly—offering a unique combination of stability and the excitement of high growth. As a member of our team, you’ll have meaningful opportunities to shape priorities and initiatives, influence key strategic decisions, and directly impact our ongoing success. Whether you’re looking to deepen your technical expertise, explore leadership opportunities, or learn new skills across multiple functions, you’re fully supported in building your career in an environment designed for growth, learning, and shared success.
Job Requirements
- Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management.
- Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems.
- Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication.
- Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools.
- Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation.
- Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems.
- Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required.
- Salary Range
- $150,000.00 – $180,000.00
Benefits
- Joining Snorkel AI means becoming part of a company that has market proven solutions, robust funding, and is scaling rapidly—offering a unique combination of stability and the excitement of high growth. As a member of our team, you’ll have meaningful opportunities to shape priorities and initiatives, influence key strategic decisions, and directly impact our ongoing success. Whether you’re looking to deepen your technical expertise, explore leadership opportunities, or learn new skills across multiple functions, you’re fully supported in building your career in an environment designed for growth, learning, and shared success.
Related Guides
Related Categories
Related Job Pages
More Research Engineer Jobs
Web-Scraping & bot-defense researcher
MWDNMWDN connects exceptional tech talent with leading companies across Israel, the USA, Great Britain, and Western Europe. We aim to ensure our employees enjoy a rewarding and secure experience while collaborating with prestigious international clients. MWDN is ranked among the top 5 IT employers in our region by DOU, and we pride ourselves on our transparency and commitment to our team.
Our client is an innovative product company that transforms how businesses access and utilize data. Using over 1 million global IPs and a database of more than 650 million verified profiles, they empower companies to make data-driven decisions with precision and ease. They are lo...
Speech Research Intern -3
CentificCentific is a frontier AI data foundry that curates diverse, high-quality data, using our purpose-built technology platforms to empower the Magnificent Seven and our enterprise clients with safe, scalable AI deployment. Our team includes more than 150 PhDs and data scientists, along with more than 4,000 AI practitioners and engineers. We harness the power of an integrated solution ecosystem—comprising industry-leading partnerships and 1.8 million vertical domain experts in more than 230 markets. Our zero-distance innovation™ solutions for GenAI can reduce GenAI costs by up to 80% and bring solutions to market 50% faster. Our mission is to bridge the gap between AI creators and industry leaders by bringing best practices in GenAI to unicorn innovators and enterprise customers.
The intern will be responsible for designing and evaluating speech-first models, focusing on Spoken Language Models that interact conversationally over audio, moving concepts from prototype to practical demonstrations. Key tasks include developing end-to-end speech dialogue systems and aligning speech encoders with text backbones using lightweight adapters.
This is a remote position. Basic Function: Work as part of a multi-disciplinary team to support Research and Development (R&D) tax credit consulting services. The Associate works closely with Project Managers, Directors, and Clients to ensure accurate information is documented an...
The Research Engineer is responsible for building research prototypes, developing evaluation tooling for decision systems, and implementing experimental system architectures to enable fast iteration between theory and validation. This role also involves maintaining the technical research infrastructure necessary for applied research initiatives.