Senior AI-HPC Cluster Engineer – MLOps
Machine Learning EngineerMachine Learning EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn
Location
California + 1 moreAll locations: California, Texas
Posted
21 days ago
Salary
$184K - $356.5K / year
Bachelor Degree8 yrs expEnglishDockerKubernetesLinuxPythonRustGo
Job Description
• Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage
• Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
• Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs
• Support our researchers to run their workloads including performance analysis and optimizations
• Conduct root cause analysis and suggest corrective action
• Proactively find and fix issues before they occur
• Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale
Job Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
- Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
- Applied experience with AI/HPC workflows that use MPI and NCCL
- Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
- A solid understanding of container technologies like Enroot, Docker and Podman
- Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
- Experience analyzing and tuning performance for a variety of AI/HPC workloads
- Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
- Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
- Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
- equity
- benefits
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Machine Learning Engineer21 days ago
Full TimeRemoteTeam 51-200Since 2017H1B Sponsor
Founding Senior Machine Learning Engineer scaling ML systems end-to-end for identity verification
DockerKubernetesPythonSQL
Machine Learning Engineer21 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor
Education Engineer creating learning content for AI developers and agent builders
Machine Learning Engineer22 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor
Senior Machine Learning Engineer building and scaling ML systems at ShopMy
AWSEC2ETLPandasPythonSQL
Machine Learning Engineer22 days ago
Full TimeRemoteTeam 1,001-5,000Since 2005H1B Sponsor
Lead design, build, and scale of production ML systems (including supervised, unsupervised, and LLM-based models). Translate prototypes into cloud-native deployments, optimize model performance, implement MLOps practices, and collaborate cross-functionally to integrate AI features while promoting model governance, explainability, and responsible AI.
PythonPytorchTensorflowScikit-LearnSparkAirflowDbtMlopsLlmsGenerative AiGraph MlEmbeddingsRag
United States