Railway - We're investigating a single host failure in the us-east4 region, some deployments and volumes in the region may be inaccessible. – Incident details

We're investigating a single host failure in the us-east4 region, some deployments and volumes in the region may be inaccessible.

Resolved
Partial outage
Started about 2 months agoLasted about 1 hour

Affected

Deployments

Partial outage from 9:05 PM to 10:08 PM

US East (us-east4 / Virginia, USA)

Partial outage from 9:05 PM to 10:08 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved. We found a bug in our scheduling algorithm which resulted in an instance in the US East region to schedule replica workloads on non-unique hosts, resulting in over subscription. To mitigate this, we have provisioned extra capacity in our non-primary regions (US East, Europe, and Southeast Asia)

    Additionally, we will be digging into the scheduling issue to prevent this in the future.

  • Monitoring
    Monitoring

    The impacted host has been successfully restarted and all affected workloads are back online. We are investigating the root failure cause and monitoring the host closely.

  • Identified
    Identified

    We've begun a failover process to recover the impacted host.

  • Investigating
    Investigating
    We're investigating a single host failure in the us-east4 region, some deployments and volumes in the region may be inaccessible.