poolside

World's most capable AI for software development

Member of Engineering – Pre-training, Synthetic Data

Software EngineerSoftware EngineerFull TimeRemoteTeam 51-200Since 2023H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

45 days ago

Salary

Not specified

Bachelor DegreeEnglishPython

Job Description

• You’ll be working on our data team focused on the quality of the datasets being delivered for training our models. • This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments. • This role particularly focuses on generating synthetic data at scale and determining the best strategies to leverage such data into training large models. • You’ll closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs that map to missing model capabilities and downstream use cases. • Staying in sync with the latest research in synthetic data generation and pretraining is key to success in this role. • You will constantly lead original research initiatives through short, time-bounded experiments while deploying highly technical engineering solutions into production. • With the volumes of data to process being massive, you'll have a performant distributed data pipeline together with a large GPU cluster at your disposal. • To deliver large, high-quality, and diverse synthetic datasets mixing natural language and code modalities to train best-in-class coding agents.

Job Requirements

  • Strong machine learning and engineering background
  • Experience with Large Language Models (LLM)
  • Understanding of how LLMs learn
  • Data ablations and scaling laws
  • Post-training techniques
  • Training reasoning and agentic models
  • Experience with implementing cost-efficient, complex pipelines to generate synthetical datasets at scale optimizing for data quality, correctness, diversity, etc.
  • Experience with evals tracking model capabilities (general knowledge, reasoning, math, coding, long-context, etc)
  • Experience in building trillion-scale pretraining datasets, and familiarity with concepts like data curation, deduplication, data mixing, tokenization, curriculum, impact of data repetition, etc.
  • Excellent programming skills in Python
  • Strong prompt engineering skills
  • Experience working with large-scale GPU clusters and distributed data pipelines
  • Strong obsession with data quality
  • Research experience: Author of scientific papers on any of the topics: applied deep learning, LLMs, source code generation, etc. - is a nice to have
  • Can freely discuss the latest papers and descend to fine details
  • Is reasonably opinionated

Benefits

  • Fully remote work & flexible hours
  • 37 days/year of vacation & holidays
  • Health insurance allowance for you and dependents
  • Company-provided equipment
  • Wellbeing, always-be-learning and home office allowances
  • Frequent team get togethers
  • Great diverse & inclusive people-first culture

Related Job Pages

More Software Engineer Jobs

Full TimeRemote

Do you want to be at the forefront of intelligence-driven cybersecurity? We at Centripetal are innovators of disruptive cybersecurity solutions. Our CleanINTERNET managed service operationalizes billions of threat indicators in real-time to prevent over 90% of known threats again...

ClojureGoPythonDatabricksPySparkKafkaKafka StreamsWarpstreamElasticsearchGoogle Cloud PlatformAWSKubernetesDockerCircleCIGitOpsFluxCDBashRubyJUnitSpockClojure.testJavaLinuxCentOSRHELMicroservicesRESTCI/CDEvent-driven architectureCQRSEvent sourcingAMQPSQLNoSQL
United States

Sr./Staff/Principal Software Engineer (Frontend/Fullstack)

Nursa

Reimagining the healthcare staffing industry by connecting clinicians and facilities directly to improve patient care.

Software Engineer45 days ago
Full TimeRemoteTeam 51-200H1B Sponsor

Design and develop web applications, mentor teammates, shape architectural decisions, write clean code, conduct testing, and troubleshoot issues.

AWSAzureCSSD3.JsDockerGCPGraphQLHTMLJestPlotly.JsPythonReactSQLTypeScript
United States
$120K - $180K / year

Senior Developer

Endava

Technology is our how. And people are our why.

Software Engineer45 days ago
Full TimeRemoteTeam 10,001+Since 2000H1B No Sponsor

Senior Developer implementing Microservices and RESTful APIs for technology solutions

ApacheAWSAzureCloudHibernateJ2EEJavaJenkinsJUnitKafkaMavenMicroservicesMongoDBMySQLOpenShiftSpringSpring BootSpringBootZookeeper
Texas
Software Engineer45 days ago
Full TimeRemoteTeam 201-500Since 2013H1B Sponsor

Senior ABAP Developer leading technical delivery in cloud ERP project

CloudSOAP
United States