Principal Machine Learning Engineer

AI EngineerMachine Learning EngineerFull TimeRemote

Location

United States

Posted

13 days ago

Salary

Not specified

No structured requirement data.

Job Description

This role reports to the Director of MLE and works closely with Engineering, Data Science, Product, and the Principal SRE. You will influence cross team platform standards and help elevate engineering rigor across ML and infrastructure. In addition to system design, you will mentor engineers on ML reliability, architecture decision making, and operational excellence. Own ML systems architecture Define ML lifecycle standards Push event driven ML integration Design model packaging and deployment strategy Introduce systemic improvements Reduce architectural and data debt Establish testing and QA standards across ML workflows Build a resilient, scalable ML platform that: Trains distributed models at scale Supports event driven feature computation Enables portable model deployment (internal + external) Standardizes ML lifecycle across products Aligns infrastructure to product usage patterns ML Platform Architecture Define and evolve training orchestration standards Batch vs. streaming inference strategy Feature store direction State store patterns and tooling CPU/GPU scaling strategy When to extend current tooling, and when to replace it Define and evolve training orchestration standards Batch vs. streaming inference strategy Feature store direction State store patterns and tooling CPU/GPU scaling strategy When to extend current tooling, and when to replace it Event Driven ML Integration Design feature pipelines as first-class ML system components Integrate queuing and event systems with ML workflows Build reactive retraining triggers Define model drift detection and automated response systems Ensure retraining pipelines are reproducible and fault tolerant Design feature pipelines as first-class ML system components Integrate queuing and event systems with ML workflows Build reactive retraining triggers Define model drift detection and automated response systems Ensure retraining pipelines are reproducible and fault tolerant Model Packaging & Distribution Define model artifact standardization Deterministic builds Dependency isolation Runtime configuration injection Security constraints Version compatibility contracts Define model artifact standardization Deterministic builds Dependency isolation Runtime configuration injection Security constraints Version compatibility contracts ML Observability, Testing & Reliability Standards Define model performance SLIs Drift detection frameworks Data freshness guarantees Latency SLOs Model failure modes Establish standards for: Automated testing of feature pipelines Training pipeline validation Model artifact verification CI/CD workflows for ML systems Safe promotion from experiment to production Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack. Define model performance SLIs Drift detection frameworks Data freshness guarantees Latency SLOs Model failure modes Establish standards for: Automated testing of feature pipelines Training pipeline validation Model artifact verification CI/CD workflows for ML systems Safe promotion from experiment to production Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack. Operational Excellence & On Call You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE. This includes: Clear ownership boundaries between ML systems and infrastructure Incident classification and severity alignment Runbooks for model failures and data drift Postmortem processes focused on systemic improvement Reducing operational toil through automation You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare. You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE. This includes: Clear ownership boundaries between ML systems and infrastructure Incident classification and severity alignment Runbooks for model failures and data drift Postmortem processes focused on systemic improvement Reducing operational toil through automation You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare. Reduce Data Architecture Debt Evaluate service landscape alignment to product usage Improve or redefine streaming feature architecture Reduce batch rigidity Recommend infrastructure simplifications Evaluate service landscape alignment to product usage Improve or redefine streaming feature architecture Reduce batch rigidity Recommend infrastructure simplifications

Job Requirements

  • 10–15+ years of experience building and operating production systems
  • Bachelor’s degree in computer science, Engineering, Mathematics, or related field — or equivalent practical experience. Advanced degrees are welcome but not required.
  • Deep production experience with distributed ML systems
  • Strong PyTorch and large-scale data engineering expertise
  • Experience with Ray or comparable distributed frameworks
  • Experience operating ML systems in production at scale
  • Exposure to event driven architectures
  • Experience improving testing and CI/CD practices for ML workflows
  • Adtech experience preferred but not required
  • Strong architectural opinions backed by real production experience
  • Current Technology Environment
  • ML Frameworks: PyTorch, Ray (Train, Tune, Datasets), PySpark ML
  • Data Platform: Databricks (Delta Lake, Unity Catalog), Snowflake, AWS (S3, EC2)
  • MLOps: MLflow (experiment tracking, model registry), GitHub Actions
  • Observability: Prometheus, Grafana, Datadog
  • Languages: Python, SQL, JavaScript/TypeScript
  • External LLM integrations (AWS Bedrock and OpenAI)
  • What They’re Looking For
  • Has designed ML systems from zero
  • Has migrated or rebuilt broken ML infrastructure
  • Has owned production model failures
  • Understands cost implications of ML design
  • Challenges architectural assumptions constructively
  • Anticipated Interview Process
  • Conversational + Architecture Discussion: A live discussion focused on past systems, tradeoffs, and a collaborative diagramming / trouble shooting exercise.
  • Take Home GitHub Exercise: A practical ML systems exercise evaluating structure, testing, reproducibility, and clarity.
  • DS/MLE Deep Dive: Technical and strategic discussion around platform evolution and leadership approach.
  • CEO Conversation: Focused on long term platform direction and company alignment.

Related Job Pages

More AI Engineer Jobs

Director, Client Scientific Solutions

UnitedHealth Group

At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone–of every race, gender, sexuality, age, location and income–deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes — an enterprise priority reflected in our mission. OptumCare is an Equal Employment Opportunity employer under applicable law and qualified applicants will receive consideration for employment without regard to race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status, or any other characteristic protected by local, state, or federal laws, rules, or regulations. OptumCare is a drug-free workplace. Candidates are required to pass a drug test before beginning employment.

AI Engineer13 days ago
Full TimeRemoteTeam 10,001

The Optum AI Director, Client Scientific Solutions manages an applied research team focused on transitioning AI models from research to production. This role emphasizes operational execution, team leadership, and implementation of best practices for model development and deployme...

United States

Senior Solution Consultant

Jobgether

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

AI Engineer13 days ago
Full TimeRemote

This role involves ensuring successful deployment and operationalization of cutting-edge cybersecurity solutions for our customers. Oversee overall customer experience and delivery of services. Build and maintain strong customer relationships. Deploy and integrate cybersecurity c...

United States

Senior AI Solutions Specialist

Jobgether

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

AI Engineer13 days ago
Full TimeRemote

This is a high-impact, client-facing role at the intersection of artificial intelligence, enterprise performance management, and strategic partnerships. You will serve as a trusted AI subject matter expert, collaborating with partners to shape go-to-market strategies and support ...

United States

AI Collaboration Specialist

Corpay

Corpay is an Equal Opportunity Employer. Corpay provides equal employment opportunities to all qualified applicants without regard to race, color, gender (including pregnancy), religion, national origin, ancestry, disability, age, sexual orientation, gender identity or expression, marital status, language, ancestry, genetic information and/or military status or any other group status protected by federal or local law. If you require reasonable accommodation for the application and/or interview process, please notify a representative of the Human Resources Department. This salary range is provided for locations which require such disclosure. Where a position or applicant may fall in a particular wage range varies depending on a number of factors, including but not limited to skill sets, experience, training, licensure and certifications (if applicable), and other business and organization needs. The disclosed range has not been adjusted for the applicable geographic markets. At Corpay, it is not typical for an individual to be hired at or near the top of the range for their role, and compensation decisions are dependent on the facts and circumstances of each case. For more information about our commitment to equal employment opportunity and pay transparency, please click the following links: EEO and Pay Transparency .

AI Engineer13 days ago
Full TimeRemoteTeam 10,001

What We Need Corpay is currently looking to hire an AI Collaboration Specialist within our ALE Solutions division. This position falls under our lodging line of business. In this role, the AI Collaboration Specialist will support the implementation, adoption, and measurement of A...

United States