Reddit, Inc.

Dive into anything

Staff Research Engineer – Pre-training Data

Research EngineerResearch EngineerFull TimeRemoteTeam 501-1,000Since 2005H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

41 days ago

Salary

$230K - $322K / year

Bachelor Degree8 yrs expEnglishPythonRayRustSpark

Job Description

• Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale. • Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities. • Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding. • Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality. • Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts. • Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure. • Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.

Job Requirements

  • 8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
  • Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
  • Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video.
  • Strong mathematical foundation in probability, statistics, and importance sampling theory.
  • Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
  • Experience working with Graph data structures or serializing conversation trees is highly valued.

Benefits

  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k with Employer Match
  • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave

Related Categories

Related Job Pages

More Research Engineer Jobs

Senior Research Engineer

The Voleon Group

Applying statistical machine learning to investment management.

Research Engineer42 days ago
Full TimeRemoteTeam 51-200Since 2007H1B No Sponsor

Senior Research Engineer developing research infrastructure at Voleon Securities

CassandraDynamoDBGRPCMongoDBMySQLPostgresPythonSQLiteGo
California
$225K - $310K / year

Research Engineer – User Identity Knowledge Graph

Netflix

Where you come to do the best work of your life. Follow @WeAreNetflix on Twitter, IG, Facebook, & Youtube for more

Research Engineer42 days ago
Full TimeRemoteTeam 10,001+Since 1997H1B Sponsor

Research Engineer developing Netflix’s User Identity Knowledge Graph

PySparkPyTorchScalaSparkTensorflow
California + 1 moreAll locations: California, New York
$466K - $750K / year
Research Engineer45 days ago
Full TimeRemoteTeam 501-1,000Since 2005H1B No Sponsor

Staff Research Engineer leading AI model training strategies at Reddit

AWSDistributed SystemsPythonPyTorch
United States
$230K - $322K / year
Research Engineer49 days ago
Full TimeRemoteTeam 5,001-10,000Since 1952H1B No Sponsor

Senior Operations Research Engineer developing optimization models for energy systems

California
$119.5K - $222.6K / year