This a short post to cover the basic of the events over the last 72 hours. Once I’ve caught up on some sleep I’ll go over this posting and prepare another if there’s anything I missed.
As noted in the previous posting we took downtime 5th July to allow us to change out a failed hard disk. The new disk went in fine. In attempting to sync data from the remaining old drive we discovered it too had hard errors which meant we could no longer trust it to hold data. We decided to copy the data one logical volume at a time from the old drive onto the new so we could locate the problem area. All data transferred successfully except for a single test volume that is only used by developers. Service was restored Monday morning 6th July.
Our plan was to obtain a further new drive and, come Sunday, swap out the last old drive and mirror the new drives to provide resilience.
Roll forward to Saturday. Somewhere just after midday the new disk started suffering from unrecoverable errors. Such a failure for a new drive is rare but not unheard of, a function of the Bathtub Curve. At this point we were faced with the problem of no reliable disks in place to copy data onto for recovery purposes. We tried various attempts to force remapping of the damaged areas but had to resort to taking down the primary machine whilst we went into recovery mode. The secondary server went into a special mode where incoming mail was accepted without the more stringent acceptance rules it usually enforces, so no inbound email was lost from that point forward.
We started a two-fold approach to data recovery. Firstly, we created another Logical Volume on the damaged disk and transferred what data we could from damaged Logical Volume. This copied the OS and program data correctly, but suffered a number of problems in the mail data both with the source area, and because we had to use the same disk for the target triggered a cascade failure there as well.
Our second recovery portion was to accelerate the transfer of data on a further pair of disks in the machine out to a second server. We had already started this, but allocated more bandwidth in the light of the ongoing problem. Once these disks were freed we prepared them to be a fresh mirror pair, but only put a single disk online in attempt to restore service sooner.
Once both of the above were complete we copied the rescued volume onto the prepared disk of the new mirror pair, and then brought the primary mail server online. This occured sometime after 01:00 12th July, and the spooled mail on the secondary machine was given priority for delivery to the inboxes.
We then worked on transferring the other logical volumes from the new dying disk to the new working disk and bring the various other systems online. This completed mid-morning 12th July. Next step was to get the system into a resilient state. This is a slow process whilst the disks are in use and took until midday 13th July.
Whilst the disks were syncing we performed system checks and monitoring to confirm that we had as much of the system back that we could at that point whilst not impacting the mirroring.
Subsequent to the mirroring of the disks completing we have brought the new dying disk back online in such a way that it would not interfere with normal operations and performed a number of searches to determine which emails were still recovereable that might not have been transferred in the initial rescue attempt. A fair number of these emails were identified and restored. These may appear as unread again, or appear to have been undeleted.
There are a few damaged directories on the new dying disk, but most of these have been recovered from the old dying disk. Whilst there are only five or so directories where we cannot yet determine the extent of the loss apart from this we believe that there are only some two dozen emails that failed to copy. We will be performing further analysis of the disks in an attempt to recover all we can.
Comments are closed.