Senior DGX Cloud Performance Engineer
Location
California + 2 moreAll locations: California, Texas, Washington
Posted
47 days ago
Salary
$152K - $287.5K / year
Bachelor Degree5 yrs expEnglishAWSAzureCloudGoogle Cloud PlatformPythonPy TorchTensorflow
Job Description
• Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of important applications and services;
• Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective;
• Develop ideas on how to improve the end to end system performance and usability by driving changes in the HW or SW (or both).
• Collaborate with AI researchers, developers, and application service providers to understand internal developer and external customer pain points, requirements, project future needs and share best practice.
• Develop the necessary modeling framework and the TCO (total cost of ownership) analysis to enable efficient exploration and sweep of the architecture and design space
• Develop the methodology needed to drive the engineering analysis to Inform the architecture, design and roadmap of DGX Cloud
Job Requirements
- Expertise in working with large scale parallel and distributed accelerator-based system systems
- Expertise optimizing performance and AI workloads on large scale systems
- Experience with performance modeling and benchmarking at scale
- Strong background in Computer Architecture, Networking, Storage systems, Accelerators
- Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others
- Experience with AI/ML models and workloads, in particular LLMs as well as an understanding of DNNs and their use in emerging AI/ML applications and services
- Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
- 5+ years experience in the above areas
- Proficiency in Python, C/C++
- Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)
Benefits
- equity
- benefits
Related Guides
Related Categories
Related Job Pages
More Engineer Jobs
Engineer49 days ago
Full TimeRemoteTeam 5,001-10,000H1B No Sponsor
BI Engineer II designing and maintaining business intelligence systems for healthcare.
AzureETLSDLCSQL
Engineer49 days ago
Full TimeRemoteTeam 10,001+H1B Sponsor
Engineered Solutions Specialist promoting stormwater products for civil sitework projects
Fire Protection Engineer
LeidosLeidos is an innovation company rapidly addressing the world’s most vexing challenges in national security and health.
Engineer50 days ago
Full TimeRemoteTeam 10,001+Since 1969H1B Sponsor
Fire Protection Engineer supporting FAA's Fire Life Safety Program
Engineer50 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor
Senior Endpoint Engineer focused on Active Directory and Intune policies at Cencora
AzureCloudJamf
Pennsylvania