Spotify

Passionate music fans. Innovative tech pros. Perfect harmony. Join our band.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 5,001-10,000Since 2008H1B SponsorCompany SiteLinkedIn

Location

New York

Posted

1 day ago

Salary

$164.4K - $234.9K / year

5 yrs expEnglishAWSCloudDistributed SystemsGoogle Cloud PlatformJavaKubernetesPythonReactTerraformType ScriptGo

Job Description

• Own fleet reliability. Lead the reliability, security, and scalability strategy for Portal’s SaaS infrastructure, including the runtime environments that power our platform and LLM-driven agent workflows. Define SLOs, drive capacity planning, and ensure our systems meet the demands of a rapidly growing product. • Architect for the agentic era. Design and evolve infrastructure on GCP and AWS using Terraform and infrastructure-from-code patterns. Shape how we structure environments for non-deterministic AI workloads — including sandboxing, resource isolation, cost governance, and security boundaries. • Drive operational excellence. Evolve our incident management, on-call, and postmortem practices. Leverage AI assistants to accelerate root cause analysis and build increasingly self-healing capabilities into our production systems. • Lead fullstack reliability. Operate across a modern web stack (TypeScript, React, Python). While not frontend-heavy, you’ll diagnose and resolve issues across the stack and drive reliability improvements end-to-end. • Mentor and multiply. Raise the reliability IQ of the broader engineering team. Establish SRE best practices, conduct production-readiness reviews, and mentor engineers on operational thinking. • Shape the roadmap. Partner with engineering and product leadership to evolve our infrastructure in step with generative AI features. Translate operational insights into strategic input on the product roadmap.

Job Requirements

  • 5+ years of hands-on experience operating cloud infrastructure (GCP and/or AWS), using Terraform and Kubernetes to run production systems at scale.
  • practical experience — or a strong demonstrated interest — in operating LLM-based systems, RAG pipelines, or agentic workloads, and understand the reliability challenges of non-deterministic systems.
  • think in distributed systems first principles — consistency, availability, partition tolerance — and translate that thinking into pragmatic infrastructure decisions.
  • proficient in at least one modern language (TypeScript, Java, Go, or Python) and comfortable navigating large, heterogeneous codebases, including environments where AI-generated PRs are common.
  • build automation and improve systems so that whole categories of operational issues disappear over time.
  • communicate complex infrastructure trade-offs clearly to both technical and non-technical stakeholders, and write postmortems that lead to meaningful change.

Benefits

  • health insurance
  • six-month paid parental leave
  • 401(k) retirement plan
  • monthly meal allowance
  • 23 paid days off
  • paid flexible holidays
  • paid sick leave

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Engineer I

TrueML

TrueML is a fintech company building software to create positive experiences for consumers seeking financial health.

DevOps Engineer2 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

As a Senior DevOps Engineer, you will enhance our cloud-native infrastructure, manage IaC with Terraform, and optimize CI/CD processes, focusing on AWS and Kubernetes operations.

ArgocdAWSGithub ActionsGoHelmKubernetesPythonTerraformTypeScript
Indiana

DevOps Architect

Effectual

Cloud Confidently®

DevOps Engineer2 days ago
Full TimeRemoteTeam 201-500H1B Sponsor

DevOps Architect designing and optimizing DevOps practices at Effectual

AnsibleAWSAzureCloudDockerGoogle Cloud PlatformJenkinsKubernetesPythonTerraform
United States
DevOps Engineer2 days ago
Full TimeRemoteTeam 405

This position is on the DevOps team, supporting the MNTN platform and Engineers. The right person will not only have a deep knowledge of system administration and GCP, but will also be able to work with a variety of Developers. You will work closely with our Engineering team and ...

Google CloudKubernetesGKEPythonTerraformHelmArgoCDGitOpsCI/CDGitHub ActionsDockermicroservicesIAMFinOpsmonitoring
United States

Senior Site Reliability Engineer

Akamai

Akamai powers and protects life online. Leading companies worldwide choose Akamai to build, deliver, and secure their digital experiences helping billions of people live, work, and play every day. With the world's most distributed compute platform from cloud to edge we make it easy for customers to develop and run applications, while we keep experiences closer to users and threats farther away. Join us Are you seeking an opportunity to make a real difference in a company with a global reach and exciting services and clients? Come join us and grow with a team of people who will energize and inspire you!

DevOps Engineer2 days ago
Full TimeRemoteTeam 5,001-10,000

Do you enjoy collaborating with teams to solve complex challenges? Do you enjoy solving large scale distributed content delivery challenges? Join our critical Platform and Reliability Engineering Team! The Platform & Reliability Engineering team is responsible for defining, measu...

United States