The Best Way to Move People High-capacity, on-demand, and affordable mobility

Distributed Systems & Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2016H1B SponsorCompany Site LinkedIn

Location

United States

Posted

87 days ago

Salary

Not specified

Bachelor DegreeEnglishAnsibleDistributed SystemsKubernetesGo

Job Description

• Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters. • Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned. • Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting. • Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states. • Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way. • Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence). • Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary). • Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work. • Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees. • Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.

Job Requirements

Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.

Benefits

Equal employment opportunities
Prohibits discrimination and harassment

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More US Remote Jobs

More DevOps Engineer Jobs

Senior DevSecOps Engineer

Adaptive Biotechnologies Corp.

Every immune system has a story to tell; the key is knowing how to listen.

DevOps Engineer87 days ago

Full TimeRemoteTeam 501-1,000Since 2009H1B No Sponsor

Company Site LinkedIn

Senior DevSecOps Engineer designing and implementing DevOps ecosystems at Adaptive Biotechnologies

CloudDockerKubernetesPythonTerraform

View details: Senior DevSecOps Engineer

United States

$132K - $198K / year

Apply

DevOps Engineer

Cyera

The first true data security platform is here.

DevOps Engineer87 days ago

Full TimeRemoteTeam 201-500H1B No Sponsor

Company Site LinkedIn

DevOps Engineer designing and optimizing data security infrastructure

AWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxPythonTerraformGo

View details: DevOps Engineer

United States

Apply

DevSecOps Engineer

Alto

Expert Software Engineering On Demand

DevOps Engineer87 days ago

Full TimeRemoteTeam 51-200Since 2015H1B Sponsor

Company Site LinkedIn

Engineer on DevSecOps team managing cloud infrastructure at Alto Pharmacy

AWSAzureCloudGoogle Cloud PlatformGrafanaJavaJavaScriptJenkinsKotlinKubernetesPrometheusPythonReactReact NativeRubyRuby on RailsSplunkSQLSwiftTerraformTypeScript

View details: DevSecOps Engineer

California + 18 more

$144K - $180K / year

Apply