NVIDIA logo
NVIDIA

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.

Senior AI and ML HPC Cluster Engineer

Artificial IntelligenceArtificial IntelligenceFull TimeRemoteSeniorTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California + 4 moreAll locations: California, Colorado, Illinois, Texas, Washington

Posted

26 days ago

Salary

$152K - $287.5K / year

Seniority

Senior

Bachelor Degree5 yrs expExperience acceptedEnglishAnsibleCloudDockerKubernetesLinuxPuppetPythonSaltStack

Job Description

• Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage. • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur

Job Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5+ years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
  • In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
  • Proficiency in Python programming and bash scripting
  • Applied experience with AI/HPC workflows that use MPI
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.

Benefits

  • eligible for equity and benefits

Related Job Pages

More Artificial Intelligence Jobs

Homera logo

AI Video Prompter, Editor

Homera

Food & Sustainability, Caribbean & LatAm

Artificial Intelligence26 days ago
Full TimeRemoteTeam 1-10Since 2003

AI Video Prompter & Editor leading AI-driven video creation in telemedicine.

United States
Dane Street logo

Board Certified Occupational Medicine Physician Advisor, Disability Peer Review

Dane Street

Dane Street is a Boston, Massachusetts-based independent medical examination (IME) and peer-review organization that provides credible, objective exams and revi

Artificial Intelligence27 days ago
Remote

Dane Street wants you to join our dynamic team of expert reviewers! In this role, you will have the opportunity to utilize your medical expertise to conduct thorough reviews of clinical cases.This telework opportunity allows you to customize your sched...

United States
Pfizer logo

AI Solution Expert - Agent Developer

Pfizer

Our purpose ensures that patients remain at the center of all we do. We live our purpose by sourcing the best science in the world; partnering with others in the healthcare system to improve access to our medicines; using digital technologies to enhance our drug discovery and development, as well as patient outcomes; and leading the conversation to advocate for pro-innovation/pro-patient policies.

Artificial Intelligence27 days ago
Full TimeRemoteTeam 10,001+Since 1849H1B Sponsor

Design, develop, and deploy agentic AI solutions and automation workflows using vector databases, embeddings, RAG, and cloud MLOps. Partner cross-functionally, ensure scalable secure systems, manage agile delivery, monitor performance, and promote responsible AI practices.

AWSAzureEmbedding ModelsGCPLangchainLlmsMlops FrameworksPineconeRag ArchitectureVector Databases
Nevada + 1 moreAll locations: Nevada, New York
$99.2K - $160.5K / year
Artificial Intelligence27 days ago
ContractRemote

RE-OPENING OPPORTUNITYTHIS IS NOT A FULL TIME JOB. THIS IS A SEASONAL, 5 Day Camp position! Multiple camps are an option. THIS IS NOT A WORK FROM HOME OPTION: REMOTE MEANS YOU WILL TRAVEL AND DO NOT REPORT TO AN OFFICE INTERVIEWS/OFFERS CONTINGENT UPON...

United States