Andromeda
Where technology meets empathy – pioneering the future of human-robot interaction.
Performance Engineer – AI Infrastructure
Location
California
Posted
6 days ago
Salary
Not specified
Bachelor DegreeEnglishCloudKubernetesPythonPy TorchRustTensorflow
Job Description
• Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O
• Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution
• Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime
• Design technical processes that help the team operate effectively and avoid repeating performance regressions
Job Requirements
- Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
- Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
- Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
- Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
- Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Benefits
- Ownership and autonomy to shape how systems run
- Celebrate diversity and create an inclusive environment