Vultr
Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.
AI Cluster Architect
Artificial IntelligenceArtificial IntelligenceFull TimeRemoteTeam 201-500Since 2014Company SiteLinkedIn
Location
United States
Posted
14 days ago
Salary
$165K - $185K / year
Professional Certificate7 yrs expEnglishNode.js
Job Description
• Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking.
• Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
• Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
• Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
• Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
• Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation.
• Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
• Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
• Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)
Job Requirements
- 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
- Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
- Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
- Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
- Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
- Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
- Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
- Strong documentation, communication, and cross-functional collaboration skills.
Benefits
- Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums
- 401(k) plan that matches 100% up to 4% with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
- Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 first year remote office setup + $400 each following year for new equipment
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company-paid Wellable subscription
Related Guides
Related Categories
Related Job Pages
More Artificial Intelligence Jobs
Artificial Intelligence14 days ago
Full TimeRemoteTeam 5,001-10,000H1B No Sponsor
Senior GRC Advisor overseeing governance and risk management in Data and AI.
Staff Forward Deployed Engineer, AI Sim
SandboxAQLeveraging AQ - the powerful compound effects of AI + Quantum technology
Artificial Intelligence14 days ago
Full TimeRemoteTeam 51-200Since 2021
Staff Forward Deployed Engineer focusing on AI Simulation solutions at SandboxAQ
AirflowCloudKubernetesPythonTerraform
Artificial Intelligence14 days ago
InternshipRemoteTeam 11-50Since 2024
AI Automation Intern building and optimizing AI-powered workflows
JavaScriptPython
AI Subject Matter Expert – AI SME
Game Plan TechMission-driven engineering firm helping government teams innovate.
Artificial Intelligence14 days ago
Full TimeRemoteTeam 51-200Since 2023
AI Subject Matter Expert advising on machine learning models at Game Plan Tech
BigQueryCloudGoogle Cloud PlatformPythonPyTorchScikit-LearnTensorflow
United States