Leaf Switch Outage

Incident Report for Faithlife

Postmortem

First, we would like to apologize for the unavailability and extreme degradation of the majority of Faithlife sites and services starting at approximately 10:30 PM PDT September 3rd and lasting until 2:30 PM September 4th. Faithlife Operations is dedicated to giving our customers the most reliable and performant experience possible, and it frustrates us deeply when we do not deliver that experience.

Scope of Outage

Between September 3rd at approximately 10:30 PM and September 4th at approximately 2:30 PM PDT most Faithlife family sites and services were unavailable or severely degraded.

Background Information

We use a spine and leaf topology that uses OSPF. Each cabinet has a stacked MLAG pair of leaf switches. Each compute node in a given cabinet has a bonded link aggregation to the stacked leaves with each compute node having a physical link to each leaf switch. Compute nodes communicate with compute nodes in a different cabinet by traversing through the leaves to the spine, then down through the leaves in the correct cabinet.

Root Cause

At 10 PM, we began a planned upgrade on a pair of leaf switches. During the attempted upgrade on the first switch, we lost all connectivity to the switch. At that point, it failed to continue passing traffic. This should not have been a problem since compute nodes have a link to each switch. The compute nodes should have recognized that link in the aggregation as “down” and discontinued use of that link. However, our compute nodes continued sending traffic to that link. To verify the compute nodes didn’t incorrectly see the downed link as “up”, we physically disconnected the links to the degraded switch. Unfortunately, the compute nodes still attempted to send traffic over that link. At this point, we could have removed the bad link from the link aggregation on each compute node in the cabinet, but this would have required a reboot once for removing the link and again when we re-add the link later. Rebooting each compute node would involve rebooting over 150 zones/VMs and thus require an enormous amount of orchestration itself along with its own set of risks. Because of this, at approximately 4:15 AM, we opted to leave the compute nodes in their current state while we rebuilt the degraded switch, which we believed to be faster and less risky.

After the switch was rebuilt and brought online at approximately 5:15 AM, we continued to see sporadic behavior from our applications. We eventually narrowed this problem down to a configuration error on the rebuilt switch that added an incorrect default route. This configuration change had been made previously, but was not saved properly in the backup configuration.

Future Steps

First, we have since learned of a better method for upgrading the switch software. This method will allow us to very quickly rollback an upgrade should it go wrong. Using this method would have gotten our switch back online within minutes, instead of hours.

Second, we will connect all core infrastructure gear to our Opengear infrastructure manager. Having the infrastructure manager configured and connected to our hardware would have saved us the time contacting remote hands for support. We will do this as soon as possible.

Finally, we will investigate why the compute nodes did not properly stop using the downed link in the link aggregations. On September 4th, a bug fix was pushed to our compute node’s operating system to address this issue. We will not be performing any leaf switch maintenance until we have fully tested the fix and ensured that it is, in fact, resolved.

Timeline of Events

10:00 PM - Planned software upgrade on the first leaf switch began

10:30 PM - Lost contact with the switch - shortly after sites and services started showing signs of degradation

11:10 PM - Logged tickets with switch vendor and remote hands in datacenter

11:30 PM - Began contact with remote hands to get a serial connection to the unresponsive switch

12:15 AM - Got console access to unresponsive switch

12:30 AM - Rebooted the switch after determining the upgrade did not succeed and was no longer proceeding

12:40 AM - Got on the phone with switch vendor to help us determine what went wrong

4:15 AM - We decide to rebuild the switch

5:15 AM - The switch has been rebuilt and has joined the topology

5:30 AM - Most sites and services become somewhat responsive, but time out about 50% of the time

1:00 PM - Switch misconfiguration is recognized on the newly provisioned leaf that added an incorrect default route

2:00 PM - Datacenter remote hands are contacted to get us a reliable console session into the switch before making the needed configuration change

2:30 PM - Configuration change has been made and sites and services are fully restored

Posted Sep 08, 2015 - 16:24 PDT

Resolved

All sites and services should be fully functional at this point. We have found and fixed a configuration issue in the problematic leaf switch. Stay tuned for a formal post-mortem.

Posted Sep 04, 2015 - 14:52 PDT

Update

We're continuing to investigate residual issues that are causing some services to be intermittently unavailable.

Posted Sep 04, 2015 - 07:57 PDT

Monitoring

We've brought back the failed switch and are monitoring sites and services. Please let us know if you see any problems!

Posted Sep 04, 2015 - 05:26 PDT

Identified

One of our leaf switches died during an upgrade last night and we are working to bring it back online. Most sites and services will be down until we restore functionality to the down switch.

Posted Sep 04, 2015 - 04:39 PDT