Train AI on distributed data

Founding ML Engineer – Flower Frontier Model Team

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteTeam 11-50Since 2023H1B No SponsorCompany Site LinkedIn

Location

California + 1 more

Posted

88 days ago

Salary

Not specified

Postgraduate DegreeExperience acceptedEnglishDistributed SystemsDockerLinuxNode.jsPythonPy Torch

Job Description

• Play a critical role in building SOTA LLMs and foundation models within a small, high-impact team • Help build a reliable, maintainable and scalable software stack • Produce world-leading models that are open-sourced and integrated into new Flower Lab products • Design, implement and optimize core components across the full spectrum of stages relevant to frontier model building: data curation, evals, pre-training, post-training • Diagnose and resolve GPU/kernel issues, memory/storage bottlenecks, and multi-node failures at scale • Collaborate on the debugging of training instabilities and related issues • Devise surrounding infrastructure, tooling, monitoring, and observability, essential for large-scale LLM development • Contribute ideas, be heard and influence the direction of the company across the board.

Job Requirements

Exceptional software engineering skills (Python, deep learning frameworks, testing, profiling, refactoring, reproducibility)
Expertise with modern ML training stacks: PyTorch, JAX or equivalent; experience implementing model architectures from scratch and working within libraries like DeepSpeed, Megatron or equivalent
Ability to tune, debug, and profile large-scale training runs
Hands-on experience working with large GPU clusters, including job orchestration, scheduling, multi-node runs, NCCL/RDMA issues, and GPU performance optimization
Ability to collaborate effectively with both research-oriented and engineering-oriented colleagues; comfortable turning research ideas into robust, maintainable implementations
Good engineering hygiene: modular design, code reviews, documentation, reproducibility, versioning of data/models/configurations
Familiarity with common tools (Linux command line, git, Docker, …)
Openness to adopting new tooling
Solid understanding of distributed systems and networking
Strong written English
Open, honest and transparent communication skills.