Flower Labs
Train AI on distributed data
Founding ML Engineer – Flower Frontier Model Team
Machine Learning EngineerMachine Learning EngineerFull TimeRemoteTeam 11-50Since 2023H1B No SponsorCompany SiteLinkedIn
Location
California + 1 moreAll locations: California, New York
Posted
88 days ago
Salary
Not specified
Postgraduate DegreeExperience acceptedEnglishDistributed SystemsDockerLinuxNode.jsPythonPy Torch
Job Description
• Play a critical role in building SOTA LLMs and foundation models within a small, high-impact team
• Help build a reliable, maintainable and scalable software stack
• Produce world-leading models that are open-sourced and integrated into new Flower Lab products
• Design, implement and optimize core components across the full spectrum of stages relevant to frontier model building: data curation, evals, pre-training, post-training
• Diagnose and resolve GPU/kernel issues, memory/storage bottlenecks, and multi-node failures at scale
• Collaborate on the debugging of training instabilities and related issues
• Devise surrounding infrastructure, tooling, monitoring, and observability, essential for large-scale LLM development
• Contribute ideas, be heard and influence the direction of the company across the board.
Job Requirements
- Exceptional software engineering skills (Python, deep learning frameworks, testing, profiling, refactoring, reproducibility)
- Expertise with modern ML training stacks: PyTorch, JAX or equivalent; experience implementing model architectures from scratch and working within libraries like DeepSpeed, Megatron or equivalent
- Ability to tune, debug, and profile large-scale training runs
- Hands-on experience working with large GPU clusters, including job orchestration, scheduling, multi-node runs, NCCL/RDMA issues, and GPU performance optimization
- Ability to collaborate effectively with both research-oriented and engineering-oriented colleagues; comfortable turning research ideas into robust, maintainable implementations
- Good engineering hygiene: modular design, code reviews, documentation, reproducibility, versioning of data/models/configurations
- Familiarity with common tools (Linux command line, git, Docker, …)
- Openness to adopting new tooling
- Solid understanding of distributed systems and networking
- Strong written English
- Open, honest and transparent communication skills.
Benefits
- Flexible working hours
- Professional development opportunities
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Machine Learning Engineer89 days ago
Full TimeRemoteTeam 51-200H1B Sponsor
Principal Decision Scientist, Machine Learning Engineer at Aimpoint Digital
AWSAzureGoogle Cloud PlatformPythonPyTorchScikit-LearnTensorflow
United States
Machine Learning Engineer – Deployments Team
RoboflowMaking computer vision easy to use for developers.
Machine Learning Engineer89 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor
Machine Learning Engineer designing and delivering advanced AI solutions.
CloudPyTorchTensorflow
Machine Learning Engineer90 days ago
Full TimeRemoteTeam 11-50Since 2022H1B No Sponsor
Machine Learning Engineer developing production ready code at Converge
AirflowDockerMicroservicesNoSQLNumpyPandasPySparkPythonScikit-LearnSQLTensorflow
AI/ML Engineer
DataVisorThe most powerful fraud and AML detection platform trusted by the world's largest brands.
Machine Learning Engineer90 days ago
Full TimeRemoteTeam 51-200H1B Sponsor
AI/ML Engineer designing scalable fraud intelligence systems
AWSCloudDistributed SystemsDockerJavaKafkaKubernetesPythonSpark