Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Performance Engineer - AI Infrastructure

Infrastructure EngineerInfrastructure EngineerFull TimeRemoteTeam 11-50

Location

United States

Posted

21 days ago

Salary

Not specified

No structured requirement data.

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers.

  • Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.
  • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.
  • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.
  • Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions.

Qualifications

  • You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.
  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).
  • Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
  • A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Requirements

  • Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
  • Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
  • Expertise in security best practices for high-scale infrastructure.
  • Familiarity with monitoring tools like Prometheus and Grafana.

Benefits

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Company Description

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

  • We began with a single managed cluster — but it filled almost instantly.
  • Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most.
  • Our long-term vision is to build the liquidity layer for global AI compute.
  • We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Job Requirements

  • You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.
  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).
  • Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
  • A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
  • Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
  • Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
  • Expertise in security best practices for high-scale infrastructure.
  • Familiarity with monitoring tools like Prometheus and Grafana.

Benefits

  • This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Infrastructure Engineer22 days ago
Full TimeRemoteTeam 51-200Since 2017H1B No Sponsor

Senior Network and Infrastructure Engineer maintaining datacenter operations for AST SpaceMobile

AWSCloudFirewallsLinuxSwitching
United States

Infrastructure Engineer

Inngest

Inngest is the developer platform for easily building reliable workflows with zero infrastructure.

Infrastructure Engineer22 days ago
Full TimeRemoteTeam 1-10H1B No Sponsor

Design, build, and operate Inngest's core infrastructure across bare metal and cloud. Develop systems and tooling in Go, manage Kubernetes clusters, networking hardware, high-throughput datastores, observability, and monitoring. Collaborate on architecture, provisioning, and production reliability at scale.

AnsibleAnycastBare MetalBgpCephClickhouseDnsFigmaFoundationdbGitGoKafkaKubernetesLinearLoad BalancingNetwork SwitchesNotionPostgreSQLRedisSeaweedfsSlackTerraformTerragruntVms
California
$160K - $205K / year
Infrastructure Engineer22 days ago
Full TimeRemoteTeam 22

As an ML Infrastructure Engineer, you'll develop platforms and solutions for running ML jobs efficiently, establish CI/CD practices, and lead initiatives in technology alignment.

C++GoJavaPythonRust
California
Infrastructure Engineer22 days ago
Full TimeRemoteTeam 11-50Since 2020H1B No Sponsor

Design and maintain infrastructure to support high-performance APIs and data services at Polymarket. Lead system architecture and enhance development workflows, ensuring reliability and scalability.

AWSCloudflareDockerFluxcdGithub WorkflowsKafkaKubernetesLambdaMongoDBPostgreSQLRedisS3Terraform
New York