We apologize for the Sync-related outage lasting from 7:40 PM PDT September 21st to 8:50 PM PDT September 22nd. This outage was caused by a sequence of human errors with poor automation safety measures.
We appreciate your patience and understanding.
Multiple Sync services were unavailable starting at 7:40 PM PDT September 21st, partially restored after 4:25 PM PDT September 22nd, and fully restored at 8:50 PM PDT September 22nd.
Impacted services: Bibliography, Clippings, Favorites, Guides, Handouts, Hidden Resources, History, Layouts, Library Metadata, Personal Books, Preferences, Resource Collections, Sentence Diagrams, and Word Lists.
This maintenance was a continuation of Sync-related database separation, to improve disaster recovery, lower backup duration, and reduce hardware requirements in case of physical failure.
These services were previously part of a 4TB MS SQL Server database in our Bellingham Datacenter. Over the past year, Development and Operations have worked together to make the migration to MySQL and break up this database into smaller MySQL instances in our Seattle datacenter.
Earlier this year, databases for these services were split off from the 4TB database onto a 1.1TB server, mysql03. Our MySQL convention is a read-only replication slave in our Bellingham datacenter, but because of the lower-capacity Bellingham datacenter (27% of Seattle) and size of mysql03, this replication slave had not been set up yet.
Our goal for the maintenance, started at 2:15 PM PDT September 21st, was to split the Layouts database, 400GB/1.1TB, onto mysql08, then set up Bellingham replication slaves for mysql03 and mysql08.
In order to migrate Layouts (400GB) to its own server with minimal downtime, we needed to make mysql08 a replication slave of mysql03. This process, using in-house scripts, has been performed a number of times without issue. Our intended 3 hour maintenance window was comprised of these steps:
$backup_command | ssh **mysql08** 'sudo xbstream -x -C /var/tmp/mysql/'
Steps 1, 2, and 3 completed successfully, then step 4 was accidentally aborted while trying to detach from the screen session. Detaching from screen involves,
ctrl+a followed by
d to detach, but
ctrl+c was hit out of habit. The mysql03 to mysql08 backup process was interrupted during the final stage of step 4, executing --apply-log.
7:40 PM PDT, we then tried to complete the step 4 process, and accidentally ran restore finalization on mysql03, which proceeded to delete /var/mysql and cause the outage. The mistake was realized immediately after erroneously executing the finalization script.
We attempted to failover to mysql08, but setup was not complete enough to start serving traffic at that time. xtrabackup_logfile was corrupted, which prevented manual correction of mysql08's restore; resulting in a partially restored three hour old copy of mysql03.
While we were gathering more information about the outage, we noticed mysql03 still had 1.1TB disk allocated to something. Upon further investigation, the MySQL process was still running, and /proc/$MYSQL_PID/fd/ contained 176 files totalling 1.1TB. This discovery turned out to be our saving grace.
As we approached 11 PM PDT, we launched three recovery options to copy overnight.
The team arrived at the office around 8 AM PDT the next day and evaluated our options.
Throughout the day, the team proceeded to use data from these three recovery options to put together mysql10. Alongside this work, another team was working through system logs to determine impacted users and data integrity checks. They had built scripts to iterate over all impacted users and verifiy data consistency across other endpoints.
The 176 file descriptors from mysql03 were all integers, per File Descriptor mechanics and innodbfileper_table file structure. The team compared file sizes of mysql03's file descriptors with sizes of mysql08's partially-restored databases to manually map and rename the 176 integer files to table files. The primary databases were easily found, and many of the less smaller files had older timestamps, lending themselves to be mostly-static databases or schema files.
After a few hours of work, we had mapped all 84 recently changed .idb files from mysql03, and verified their integrity with innochecksum. Once these recovered .idb files were innochecksum'd, we copied them over to mysql10.
We then copied general MySQL files and .frm schema definitions from the 2 PM restore on mysql09 to mysql10 to fill in missing static files we didn't recover from file descriptors. ZFS snapshots were frequently taken of mysql10, providing a rollback stage if any part of this process corrupted data in our restored files.
At 4:05 PM PDT, we launched MySQL and were greeted by a few minor warnings, but MySQL had successfully started!
At 4:25 PM PDT, starting with the lowest throughput service, we ran consistency checks and brought services slowly back online.
At 8:50 PM PDT, we brought the final service back online.
We ended up using file descriptors from the running MySQL process on mysql03, file sizes from mysql08, and unchanged files from mysql09's restore to produce a patchwork mysql10 with no data loss.
This outage was caused by human error with poor environment verification in automation, compounded by lack of MySQL replication slave.
To ensure this doesn't happen in the future, we will abort automation if the environment doesn't match script assumptions, rather than making the environment match assumptions. We will also improve automation steps to verify master or slave and abort on improper server, and ensure that MySQL backups are more frequent and granular. Replication slaves will be created in Seattle if Bellingham can't accomodate the resource requirements.