Rad AI

Made for radiologists, by radiologists.

Staff Machine Learning Engineer – Infrastructure

Full TimeRemoteTeam 51-200Since 2018H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

87 days ago

Salary

$200K - $240K / year

Bachelor Degree8 yrs expEnglishAirflowAnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaJava ScriptKubernetesPythonPy TorchTerraformType Script

Job Description

• Architect the infrastructure that supports our machine learning applications, services, and workflows • Architect and maintain our ML platform that supports continuous integration, continuous delivery, and continuous training for our machine learning models • Develop cloud-native services and serverless architectures to build scalable and resilient systems • Partner with data scientists to design the data pipeline that enable various machine learning models in production • Write code that meets our internal standards for security, style, maintainability, and best practices for a high-scale HIPAA web environment • Design, deploy, and maintain the full ML platform stack including monitoring and observability, data analytics, backend integration with customer-facing products, and the full model R&D lifecycle • Work with Product Management, Research, and Engineering to iterate on new features and address inefficiencies across our AI/ML infrastructure

Job Requirements

  • 8+ years of industry experience in ML Engineering in cloud-native environments
  • In-depth knowledge of Python (required), Javascript/Typescript (nice to have), or other modern languages in the ML domain
  • Strong experience with infrastructure and DevOps tools such as Kubernetes, Docker, and Ansible
  • Strong knowledge of cloud computing platforms such as AWS (preferable), GCP, and Azure
  • Experience architecting distributed systems, storage systems, and databases
  • Experience working with machine learning frameworks such as PyTorch and LangGraph
  • Experience with Airflow (preferable) or other orchestration tools
  • Experience with infrastructure-as-code tools such as Terraform (preferable), Pulumi, Cloud Formation, etc.
  • Experience with monitoring, tracing, and logging tools such Cloudwatch, NewRelic, Grafana, etc.
  • Excellent communication skills, with a strong sense of ownership and a systematic approach to problem-solving
  • Proven ability to manage and lead active incidents, address what caused them, and establish systems to avoid them in the future via blameless postmortems

Benefits

  • Comprehensive Medical, Dental, Vision & Life insurance
  • HSA (with employer match), FSA, & DCFSA
  • 401(k)
  • 11 Paid Company Holidays
  • Location Flexibility (Remote-first company!)
  • Flexible PTO policy
  • Annual company-wide offsite
  • Periodic team offsites
  • Annual equipment stipend
  • For roles based outside the US, your recruiter can share more details

Related Job Pages