Glydways
The Best Way to Move People High-capacity, on-demand, and affordable mobility
Distributed Systems & Reliability Engineer
Location
United States
Posted
87 days ago
Salary
Not specified
Bachelor DegreeEnglishAnsibleDistributed SystemsKubernetesGo
Job Description
• Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters.
• Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned.
• Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting.
• Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states.
• Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way.
• Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence).
• Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary).
• Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work.
• Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees.
• Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.
Job Requirements
- Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
- Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
- Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
- Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
- Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
- Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
- Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
- Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
- Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
- Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.
Benefits
- Equal employment opportunities
- Prohibits discrimination and harassment
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevSecOps Engineer
Adaptive Biotechnologies Corp.Every immune system has a story to tell; the key is knowing how to listen.
DevOps Engineer87 days ago
Full TimeRemoteTeam 501-1,000Since 2009H1B No Sponsor
Senior DevSecOps Engineer designing and implementing DevOps ecosystems at Adaptive Biotechnologies
CloudDockerKubernetesPythonTerraform
DevOps Engineer87 days ago
Full TimeRemoteTeam 201-500H1B No Sponsor
DevOps Engineer designing and optimizing data security infrastructure
AWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxPythonTerraformGo
United States
DevOps Engineer87 days ago
Full TimeRemoteTeam 51-200Since 2015H1B Sponsor
Engineer on DevSecOps team managing cloud infrastructure at Alto Pharmacy
AWSAzureCloudGoogle Cloud PlatformGrafanaJavaJavaScriptJenkinsKotlinKubernetesPrometheusPythonReactReact NativeRubyRuby on RailsSplunkSQLSwiftTerraformTypeScript
California + 18 moreAll locations: California, Colorado, Connecticut, Florida, Illinois, Kansas, Nevada, New Jersey, New York, North Carolina, Oregon, Maryland, Missouri, Pennsylvania, South Carolina, Tennessee, Texas, Washington, Wisconsin
$144K - $180K / year
DevOps Engineer88 days ago
Full TimeRemoteTeam 11-50Since 2023H1B No Sponsor
Senior DevOps Engineer enhancing cybersecurity using AI technology
AzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaPrometheusPython