Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

AI Cluster Architect

Artificial IntelligenceArtificial IntelligenceFull TimeRemoteTeam 201-500Since 2014Company SiteLinkedIn

Location

United States

Posted

14 days ago

Salary

$165K - $185K / year

Professional Certificate7 yrs expEnglishNode.js

Job Description

• Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking. • Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits). • Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies. • Determine network scale limits based on switch radix, link speed, topology, and blocking requirements. • Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms. • Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation. • Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management. • Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics. • Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)

Job Requirements

  • 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
  • Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
  • Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
  • Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
  • Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
  • Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
  • Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
  • Strong documentation, communication, and cross-functional collaboration skills.

Benefits

  • Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums
  • 401(k) plan that matches 100% up to 4% with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
  • Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
  • $500 first year remote office setup + $400 each following year for new equipment
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company-paid Wellable subscription

Related Job Pages

More Artificial Intelligence Jobs

Artificial Intelligence14 days ago
Full TimeRemoteTeam 5,001-10,000H1B No Sponsor

Senior GRC Advisor overseeing governance and risk management in Data and AI.

Illinois
$66.6K - $124.2K / year

Staff Forward Deployed Engineer, AI Sim

SandboxAQ

Leveraging AQ - the powerful compound effects of AI + Quantum technology

Artificial Intelligence14 days ago
Full TimeRemoteTeam 51-200Since 2021

Staff Forward Deployed Engineer focusing on AI Simulation solutions at SandboxAQ

AirflowCloudKubernetesPythonTerraform
United States
$168.3K - $276K / year

AI Automation Intern

Convergent

AI to enhance humans

Artificial Intelligence14 days ago
InternshipRemoteTeam 11-50Since 2024

AI Automation Intern building and optimizing AI-powered workflows

JavaScriptPython
United States
$5K - $10K / month

AI Subject Matter Expert – AI SME

Game Plan Tech

Mission-driven engineering firm helping government teams innovate.

Artificial Intelligence14 days ago
Full TimeRemoteTeam 51-200Since 2023

AI Subject Matter Expert advising on machine learning models at Game Plan Tech

BigQueryCloudGoogle Cloud PlatformPythonPyTorchScikit-LearnTensorflow
United States