Flower Labs

Train AI on distributed data

Founding ML Engineer – Flower Frontier Model Team

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteTeam 11-50Since 2023H1B No SponsorCompany SiteLinkedIn

Location

California + 1 moreAll locations: California, New York

Posted

88 days ago

Salary

Not specified

Postgraduate DegreeExperience acceptedEnglishDistributed SystemsDockerLinuxNode.jsPythonPy Torch

Job Description

• Play a critical role in building SOTA LLMs and foundation models within a small, high-impact team • Help build a reliable, maintainable and scalable software stack • Produce world-leading models that are open-sourced and integrated into new Flower Lab products • Design, implement and optimize core components across the full spectrum of stages relevant to frontier model building: data curation, evals, pre-training, post-training • Diagnose and resolve GPU/kernel issues, memory/storage bottlenecks, and multi-node failures at scale • Collaborate on the debugging of training instabilities and related issues • Devise surrounding infrastructure, tooling, monitoring, and observability, essential for large-scale LLM development • Contribute ideas, be heard and influence the direction of the company across the board.

Job Requirements

  • Exceptional software engineering skills (Python, deep learning frameworks, testing, profiling, refactoring, reproducibility)
  • Expertise with modern ML training stacks: PyTorch, JAX or equivalent; experience implementing model architectures from scratch and working within libraries like DeepSpeed, Megatron or equivalent
  • Ability to tune, debug, and profile large-scale training runs
  • Hands-on experience working with large GPU clusters, including job orchestration, scheduling, multi-node runs, NCCL/RDMA issues, and GPU performance optimization
  • Ability to collaborate effectively with both research-oriented and engineering-oriented colleagues; comfortable turning research ideas into robust, maintainable implementations
  • Good engineering hygiene: modular design, code reviews, documentation, reproducibility, versioning of data/models/configurations
  • Familiarity with common tools (Linux command line, git, Docker, …)
  • Openness to adopting new tooling
  • Solid understanding of distributed systems and networking
  • Strong written English
  • Open, honest and transparent communication skills.

Benefits

  • Flexible working hours
  • Professional development opportunities

Related Job Pages

More Machine Learning Engineer Jobs

Full TimeRemoteTeam 51-200H1B Sponsor

Principal Decision Scientist, Machine Learning Engineer at Aimpoint Digital

AWSAzureGoogle Cloud PlatformPythonPyTorchScikit-LearnTensorflow
United States

Machine Learning Engineer – Deployments Team

Roboflow

Making computer vision easy to use for developers.

Machine Learning Engineer89 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

Machine Learning Engineer designing and delivering advanced AI solutions.

CloudPyTorchTensorflow
California + 1 moreAll locations: California, New York
$163K - $182.5K / year

Machine Learning Engineer

Converge Insurance

Where cyber insurance and technology converge.

Machine Learning Engineer90 days ago
Full TimeRemoteTeam 11-50Since 2022H1B No Sponsor

Machine Learning Engineer developing production ready code at Converge

AirflowDockerMicroservicesNoSQLNumpyPandasPySparkPythonScikit-LearnSQLTensorflow
United States
$135K - $185K / year

AI/ML Engineer

DataVisor

The most powerful fraud and AML detection platform trusted by the world's largest brands.

Machine Learning Engineer90 days ago
Full TimeRemoteTeam 51-200H1B Sponsor

AI/ML Engineer designing scalable fraud intelligence systems

AWSCloudDistributed SystemsDockerJavaKafkaKubernetesPythonSpark
United States
$130K - $200K / year