Your app, Enterprise Ready.

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2019Company Site LinkedIn

Location

United States

Posted

15 days ago

Salary

$175K - $275K / year

Bachelor DegreeEnglishAWSCloudGrafanaKubernetesPrometheusType Script

Job Description

• Design and evolve the systems, tooling, and processes that improve the reliability and performance of WorkOS • Collaborate with product and infrastructure teams to ensure services are production-ready, observable, and resilient to failure • Define and measure SLIs/SLOs to guide reliability improvements • Write and optimize backend systems (in TypeScript) with a focus on performance, maintainability, and graceful degradation • Improve our incident response process, lead postmortems, and drive follow-through on reliability risks • Develop internal tools and automations that make it easier to operate and scale our systems • Participate in our on-call rotation—responding to, resolving, and learning from production incidents • Contribute to design and architecture discussions with a focus on operability and long-term sustainability • Document systems, share learnings, and help grow a reliability-minded engineering culture

Job Requirements

Experience operating and scaling production systems in cloud environments (we use AWS)
Familiarity with service reliability concepts—monitoring, alerting, incident response, and root cause analysis
Comfort working across infrastructure layers (e.g. compute, networking, storage, observability tooling)
Strong debugging and systems thinking skills—you can follow problems across services and layers
Ability to work independently, take ownership, and drive projects from problem discovery through resolution
Nice to have*
Familiarity with Kubernetes or similar orchestration systems
Exposure to observability stacks (e.g. Prometheus, Grafana, Datadog, OpenTelemetry)
Exposure to TypeScript or interest in working in a TypeScript-based codebase