Documents services in degraded state
Incident Report for Faithlife

We apologize for the Sync-related outage lasting from 7:40 PM PDT September 21st to 8:50 PM PDT September 22nd. This outage was caused by a sequence of human errors with poor automation safety measures.

We appreciate your patience and understanding.

Scope of outage

Multiple Sync services were unavailable starting at 7:40 PM PDT September 21st, partially restored after 4:25 PM PDT September 22nd, and fully restored at 8:50 PM PDT September 22nd.

Impacted services: Bibliography, Clippings, Favorites, Guides, Handouts, Hidden Resources, History, Layouts, Library Metadata, Personal Books, Preferences, Resource Collections, Sentence Diagrams, and Word Lists.

Background

This maintenance was a continuation of Sync-related database separation, to improve disaster recovery, lower backup duration, and reduce hardware requirements in case of physical failure.

Long version

These services were previously part of a 4TB MS SQL Server database in our Bellingham Datacenter. Over the past year, Development and Operations have worked together to make the migration to MySQL and break up this database into smaller MySQL instances in our Seattle datacenter.

Earlier this year, databases for these services were split off from the 4TB database onto a 1.1TB server, mysql03. Our MySQL convention is a read-only replication slave in our Bellingham datacenter, but because of the lower-capacity Bellingham datacenter (27% of Seattle) and size of mysql03, this replication slave had not been set up yet.

Our goal for the maintenance, started at 2:15 PM PDT September 21st, was to split the Layouts database, 400GB/1.1TB, onto mysql08, then set up Bellingham replication slaves for mysql03 and mysql08.

What happened

Intended events

In order to migrate Layouts (400GB) to its own server with minimal downtime, we needed to make mysql08 a replication slave of mysql03. This process, using in-house scripts, has been performed a number of times without issue. Our intended 3 hour maintenance window was comprised of these steps:

  1. Disable scheduled mysql backup on mysql03
  2. Prepare remote slave (mysql08), via remote SSH: Disable MySQL, remove existing /var/mysql contents, and create /var/tmp/mysql
  3. Take a backup of master (mysql03) and stream, via SSH: $backup_command | ssh **mysql08** 'sudo xbstream -x -C /var/tmp/mysql/'
  4. Finalize restore, via remote SSH (to mysql08): remove /var/mysql, decompress backup, apply log file, move /var/tmp/mysql to /var/mysql, and start mysql
  5. Create slave replication user on master (mysql03)
  6. Configure replication on slave (mysql08), via remote SSH
  7. Disable Layouts API service and switch connection strings to mysql08
  8. Stop replication on mysql08, unset global mysql08 read-only
  9. Enable Layouts API service, using mysql08 backend
  10. Delete mysql03:Layouts and everything but Layouts on mysql08

Actual events

Steps 1, 2, and 3 completed successfully, then step 4 was accidentally aborted while trying to detach from the screen session. Detaching from screen involves, ctrl+a followed by d to detach, but ctrl+c was hit out of habit. The mysql03 to mysql08 backup process was interrupted during the final stage of step 4, executing --apply-log.

7:40 PM PDT, we then tried to complete the step 4 process, and accidentally ran restore finalization on mysql03, which proceeded to delete /var/mysql and cause the outage. The mistake was realized immediately after erroneously executing the finalization script.

We attempted to failover to mysql08, but setup was not complete enough to start serving traffic at that time. xtrabackup_logfile was corrupted, which prevented manual correction of mysql08's restore; resulting in a partially restored three hour old copy of mysql03.

While we were gathering more information about the outage, we noticed mysql03 still had 1.1TB disk allocated to something. Upon further investigation, the MySQL process was still running, and /proc/$MYSQL_PID/fd/ contained 176 files totalling 1.1TB. This discovery turned out to be our saving grace.

As we approached 11 PM PDT, we launched three recovery options to copy overnight.

  1. rsync of mysql03:/proc/$MYSQL_PID/fd/ to a backup file server
  2. Restoring a September 21st 2 PM PDT backup of mysql03 to a new MySQL server, mysql09
  3. rsync of mysql08:/var/tmp/mysql, the partially restored backup from mysql03, to a backup file server

The team arrived at the office around 8 AM PDT the next day and evaluated our options.

  1. proc rsync completed, giving us a backup of mysql03's file descriptors, verified with md5sum
  2. mysql09 restore failed overnight, kicked off another restore attempt at 8:15 AM PDT
  3. mysql08 partial restore rsync completed

Throughout the day, the team proceeded to use data from these three recovery options to put together mysql10. Alongside this work, another team was working through system logs to determine impacted users and data integrity checks. They had built scripts to iterate over all impacted users and verifiy data consistency across other endpoints.

The 176 file descriptors from mysql03 were all integers, per File Descriptor mechanics and innodbfileper_table file structure. The team compared file sizes of mysql03's file descriptors with sizes of mysql08's partially-restored databases to manually map and rename the 176 integer files to table files. The primary databases were easily found, and many of the less smaller files had older timestamps, lending themselves to be mostly-static databases or schema files.

After a few hours of work, we had mapped all 84 recently changed .idb files from mysql03, and verified their integrity with innochecksum. Once these recovered .idb files were innochecksum'd, we copied them over to mysql10.

We then copied general MySQL files and .frm schema definitions from the 2 PM restore on mysql09 to mysql10 to fill in missing static files we didn't recover from file descriptors. ZFS snapshots were frequently taken of mysql10, providing a rollback stage if any part of this process corrupted data in our restored files.

At 4:05 PM PDT, we launched MySQL and were greeted by a few minor warnings, but MySQL had successfully started!

At 4:25 PM PDT, starting with the lowest throughput service, we ran consistency checks and brought services slowly back online.

At 8:50 PM PDT, we brought the final service back online.

We ended up using file descriptors from the running MySQL process on mysql03, file sizes from mysql08, and unchanged files from mysql09's restore to produce a patchwork mysql10 with no data loss.

Future steps

This outage was caused by human error with poor environment verification in automation, compounded by lack of MySQL replication slave.

To ensure this doesn't happen in the future, we will abort automation if the environment doesn't match script assumptions, rather than making the environment match assumptions. We will also improve automation steps to verify master or slave and abort on improper server, and ensure that MySQL backups are more frequent and granular. Replication slaves will be created in Seattle if Bellingham can't accomodate the resource requirements.

Posted 11 months ago. Sep 27, 2016 - 11:17 PDT

Resolved
This incident has been resolved.
Posted 11 months ago. Sep 22, 2016 - 21:16 PDT
Monitoring
All services are back online. Data integrity has been verified for every user who made sync requests during the outage.
Posted 11 months ago. Sep 22, 2016 - 21:06 PDT
Update
Most services are back online, verifying the final databases.
Posted 11 months ago. Sep 22, 2016 - 18:33 PDT
Update
Preliminary data verification complete, bringing services slowly online.
Posted 11 months ago. Sep 22, 2016 - 16:37 PDT
Update
Databases have been recovered. We're verifying data integrity before bringing services back online.
Posted 11 months ago. Sep 22, 2016 - 16:18 PDT
Update
Recovered data integrity is being verified on a new database. This is the primary recovery option which will prevent data loss if all goes as planned.
Posted 11 months ago. Sep 22, 2016 - 13:56 PDT
Update
Multiple recovery options are being evaluated after data transfers completed overnight.
Posted 11 months ago. Sep 22, 2016 - 08:51 PDT
Update
Impacted services are being kept offline overnight to minimize possible data loss while multiple recovery options are evaluated.
Posted 11 months ago. Sep 21, 2016 - 22:36 PDT
Update
We're in the process of restoring impacted database tables from a ~30 minute old backup and investigating replaying possible missing content.
Posted 11 months ago. Sep 21, 2016 - 21:11 PDT
Identified
Guides, Favorites, Handouts, and other sync services are currently in a degraded state after a failed mysql database migration
Posted 11 months ago. Sep 21, 2016 - 20:17 PDT