Glydways

The Best Way to Move People High-capacity, on-demand, and affordable mobility

Distributed Systems & Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteTeam 51-200Since 2016H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

87 days ago

Salary

Not specified

Bachelor DegreeEnglishAnsibleDistributed SystemsKubernetesGo

Job Description

• Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters. • Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned. • Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting. • Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states. • Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way. • Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence). • Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary). • Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work. • Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees. • Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.

Job Requirements

  • Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
  • Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
  • Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
  • Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
  • Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
  • Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
  • Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
  • Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
  • Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
  • Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.

Benefits

  • Equal employment opportunities
  • Prohibits discrimination and harassment

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevSecOps Engineer

Adaptive Biotechnologies Corp.

Every immune system has a story to tell; the key is knowing how to listen.

DevOps Engineer87 days ago
Full TimeRemoteTeam 501-1,000Since 2009H1B No Sponsor

Senior DevSecOps Engineer designing and implementing DevOps ecosystems at Adaptive Biotechnologies

CloudDockerKubernetesPythonTerraform
United States
$132K - $198K / year

DevOps Engineer

Cyera

The first true data security platform is here.

DevOps Engineer87 days ago
Full TimeRemoteTeam 201-500H1B No Sponsor

DevOps Engineer designing and optimizing data security infrastructure

AWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxPythonTerraformGo
United States

DevSecOps Engineer

Alto

Expert Software Engineering On Demand

DevOps Engineer87 days ago
Full TimeRemoteTeam 51-200Since 2015H1B Sponsor

Engineer on DevSecOps team managing cloud infrastructure at Alto Pharmacy

AWSAzureCloudGoogle Cloud PlatformGrafanaJavaJavaScriptJenkinsKotlinKubernetesPrometheusPythonReactReact NativeRubyRuby on RailsSplunkSQLSwiftTerraformTypeScript
California + 18 moreAll locations: California, Colorado, Connecticut, Florida, Illinois, Kansas, Nevada, New Jersey, New York, North Carolina, Oregon, Maryland, Missouri, Pennsylvania, South Carolina, Tennessee, Texas, Washington, Wisconsin
$144K - $180K / year

Senior DevOps Engineer

Dropzone AI

AI SOC Analysts that never sleep. So you can.

DevOps Engineer88 days ago
Full TimeRemoteTeam 11-50Since 2023H1B No Sponsor

Senior DevOps Engineer enhancing cybersecurity using AI technology

AzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaPrometheusPython
United States
$170K - $190K / year