Yotta Labs is at the forefront of building a cutting-edge protocol that serves as the Decentralized OS for AI workload orchestration at Planet Scale. The Decentralized Operating System (DeOS) from Yotta is designed to maximize the utilization of available resources by optimizing LLM training/inference flows and efficiently scheduling AI workloads across decentralized networks running geo-distributed GPUs worldwide, pushing the aggregated processing limit to an unprecedented Yottascale. (Yottascale is 1,000,000 of exascale, which is current limit of the fastest supercomputer in the world) Founded by a team of industry and academia experts in AI and HPC (High-performance Computing), Yotta Labs team has a proven track record of delivering exceptional work. Through cutting-edge approaches invented by the team to optimize resource orchestration and intra-/inter-node communication, we strive to unlock the maximum potential of decentralized AI. For more information about aelf, please refer to our Whitepaper: https://yottalabs.ai/whitepaper
GPU Cloud Platform Engineer
Location
United States + 4 moreAll locations: United States, Canada, Brazil, Mexico, Argentina
Posted
30 days ago
Salary
Not specified
Job Description
Job Requirements
- Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.
- 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.
- Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.
- Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.
- Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.
- Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.
- Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.
- Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.
- Strong communication skills, self-motivation, and team collaboration
- 🌟
- Preferred Experience
- Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.
- Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.
- Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.
- Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.
- Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.
- 🌐
- Why Join Yotta Labs?
- Be part of a visionary team aiming to redefine AI infrastructure.
- Work on cutting-edge technologies that bridge AI and decentralized computing.
- Collaborate with experts from leading institutions and tech companies.
- Enjoy a flexible, remote work environment that values innovation and autonomy.
- 📩
- How to Apply
- Interested candidates should apply directly or send their resume and a brief cover letter to careers@yottalabs.ai. Please include links to any relevant projects or contributions.
Related Guides
Related Categories
Related Job Pages
More Cloud Engineer Jobs
Oracle Integration Cloud Architect designing integration solutions for enterprise applications
Cloud Engineer responsible for Operations of Development, Testing, and Production environments.
Cloud Architect, Product Management
Rubrik, Inc.As the pioneer in Zero Trust Data Security™, we enable cyber and operational resilience for enterprises and governments.
Cloud Architect bridging Engineering, Product Management, and Field teams at Rubrik
Manager overseeing Azure Cloud Infrastructure at Allstate