Domino Data Lab logo
Domino Data Lab

The Enterprise MLOps platform powering over 20% of the Fortune 100

Staff Platform Reliability Engineer

Platform EngineerPlatform EngineerOtherRemoteSeniorTeam 201-500Since 2013H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

2 days ago

Salary

$185K - $230K / year

Seniority

Senior

Job Description

Who we are

At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it. For more information, visit www.domino.ai

What we are building

The Automation Team at Domino acts as a force multiplier for engineering, building the tools and systems that enable teams to ship code confidently and consistently. A core part of this mission is Tempest, an in-house platform that orchestrates realistic, long-duration workloads against live Kubernetes clusters and validates the results against real observability data. Today, when scale testing surfaces a bottleneck, a resource misconfiguration, or a regression in system behavior, the team can identify and report the issue — but we need someone who can take the next step: profiling services, tracing root causes through Prometheus and New Relic data, and partnering with platform engineers to drive durable fixes. Focused on iteration and continuous improvement, the team looks for targeted enhancements that create outsized impact, and this role will close the gap between detection and resolution at the infrastructure level.

What your impact will be

In your first year, you will:

  • Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs
  • Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes, not just file tickets
  • Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
  • Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation, making it faster to pinpoint root causes during and after multi-day load runs
  • Establish and operationalize scale testing on cloud platforms, ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
  • Partner with platform teams to enable effective scale and reliability testing across additional cloud providers, helping position Domino for future multi-cloud success
  • Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow

What we look for in this role

  • Background in SRE, platform engineering, or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
  • Strong proficiency in Python and comfort working in a large, modular codebase that spans orchestration, infrastructure automation, and systems integration
  • Experience with observability stacks (Prometheus, Grafana, New Relic, or similar) — writing queries, building dashboards, and using metrics to diagnose performance and reliability issues at the systems level
  • Demonstrated ability to go beyond detection to resolution: profiling services, identifying resource bottlenecks, and working with engineering teams to ship durable fixes
  • Familiarity with performance and load testing methodologies (e.g., Locust, k6, or similar) as part of a broader infrastructure or reliability practice
  • Clear ownership mindset — self-directed, accountable, and able to communicate priorities and status effectively in a remote, async environment

What we value

  • We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
  • We believe in individuals who seek truth and speak the truth and can be their whole selves at work
  • We value all of you that believe improving is always possible At Domino Everything is a work in progress – we can do better at everything
  • We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company
  • We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply

#LI-Remote

The annual US base salary range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors, including the candidate's experience, qualifications, and location. Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.

Compensation Range

$185,000$230,000 USD

Benefits

  • 401(K), Childcare benefits, Commuter benefits, Company equity, Company-sponsored outings, Company sponsored family events, Continuing education stipend, Customized development tracks, Dental insurance, Disability insurance, Documented equal pay policy, Volunteer in local community, Family medical leave, Fitness stipend, Flexible Spending Account (FSA), Flexible work schedule, Generous parental leave, Company-sponsored happy hours, Health insurance, Job training & conferences, Open door policy, Life insurance, Mean gender pay gap below 10%, Online course subscriptions available, Onsite gym, Open office floor plan, Paid holidays, Paid industry certifications, Pair programming, Paid sick days, Partners with nonprofits, Pet friendly, Pet insurance, Promote from within, Recreational clubs, Lunch and learns, Remote work program, Free snacks and drinks, Team based strategic planning, OKR operational model, Continuing education available during work hours, Mandated unconscious bias training, Unlimited vacation policy, Vision insurance, Wellness programs, Some meals provided, Mental health benefits, Home-office stipend for remote employees, Diversity employee resource groups, Hiring practices that promote diversity

Related Categories

Related Job Pages

More Platform Engineer Jobs

Five9 logo

Delivery Platform Engineer

Five9

Helping Companies Bring Joy to CX.

OtherRemoteTeam 1,001-5,000Since 2001H1B Sponsor

Delivery Platform Engineer at Five9 facilitating cloud contact center software implementation

California
$70.4K - $196.3K / year
Onebrief logo

Manager, Platform Engineering

Onebrief

Software for rapid military planning: make planning fast enough for today's environment

OtherRemoteTeam 1-10Since 2019H1B No Sponsor

Lead and grow the Platform Engineering team to build and operate shared infrastructure and tooling (databases, Kafka, Kubernetes, IaC) that enable secure, reliable delivery across cloud-native and air-gapped environments. Drive GitOps, observability, incident response, SLOs, and cross-functional alignment with Cybersecurity, Product, and Engineering.

United States
$205K - $255K / year