Senior AI-HPC Cluster Engineer – MLOps

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California + 1 moreAll locations: California, Texas

Posted

21 days ago

Salary

$184K - $356.5K / year

Bachelor Degree8 yrs expEnglishDockerKubernetesLinuxPythonRustGo

Job Description

• Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur • Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale

Job Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
  • Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
  • Applied experience with AI/HPC workflows that use MPI and NCCL
  • Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
  • A solid understanding of container technologies like Enroot, Docker and Podman
  • Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads
  • Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
  • Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.

Benefits

  • equity
  • benefits

Related Job Pages

More Machine Learning Engineer Jobs

Machine Learning Engineer21 days ago
Full TimeRemoteTeam 51-200Since 2017H1B Sponsor

Founding Senior Machine Learning Engineer scaling ML systems end-to-end for identity verification

DockerKubernetesPythonSQL
United States
$170K - $240K / year
Machine Learning Engineer21 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

Education Engineer creating learning content for AI developers and agent builders

Utah
$175K - $195K / year
Machine Learning Engineer22 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

Senior Machine Learning Engineer building and scaling ML systems at ShopMy

AWSEC2ETLPandasPythonSQL
New York
$145K - $225K / year
Machine Learning Engineer22 days ago
Full TimeRemoteTeam 1,001-5,000Since 2005H1B Sponsor

Lead design, build, and scale of production ML systems (including supervised, unsupervised, and LLM-based models). Translate prototypes into cloud-native deployments, optimize model performance, implement MLOps practices, and collaborate cross-functionally to integrate AI features while promoting model governance, explainability, and responsible AI.

PythonPytorchTensorflowScikit-LearnSparkAirflowDbtMlopsLlmsGenerative AiGraph MlEmbeddingsRag
United States