The internal disk of my primary computer had a near-death experience the other day.
(Recent heat waves have been hard to endure, both for people and for hardware devices.)
That got it kicked out of the RAID1 with a larger external disk, and the system kept on running, as expected.
Today I set out to bring them back in sync, so I rebooted the machine.
That's generally much easier than figuring out exactly how to get the failed devices unused and removed, and then bringing them back up manually.
Not so today.
The boot loader would recognize the internal disk, decrypt it, but for some reason it would not find a filesystem where one was expected.
It would recognize the external disk too, but, surprisingly, not even try to decrypt it to find the updated filesystem.
I suppose last time I installed the boot loader, I only had the internal disk on or something.
Well, I could try to boot from the external disk, and that would probably look for both disks and get things back up.
So I told the boot loader to start from the external disk, and wasn't much luckier.
In these attempts, I was using the boot loaders installed on the disks themselves, loaded by SEABIOS in the freedom-respecting BIOS I use.
Maybe I could pull things off by using the GRUB embedded in the BIOS itself?
It sure saw both disks, and I could cryptomount the internal disk all right. But that still wasn't enough to give me a boot filesystem.
As for the external disk, it failed to cryptomount, claiming my passphrase was wrong. Uh oh.
How could I have got myself into such a screwy situation?
It was my turn to go through a near death experience, as a very scary theory formed in my mind.
Here are some technical details of my set up for the theory to make sense, hopefully.
Each of my disks is fully encrypted, because they have different sizes, I intended to use the extra TBs in the external disk, and I didn't want any of my data to end up accidentally in plain text on them.
Each fully-encrypted disk is configured as an LVM physical volume in a separate volume group, so that they can be brought up independently without hassle (that was the theory anyway).
Each volume group has a logical volume for root and one for swap, and the corresponding devices form md RAID1 devices. Again, so that they can be brought up independently, in case either disk fails.
GRUB has had no trouble decrypting the devices, finding the LVMs, forming the MDs, and loading files from them. I'd tested that set up with either disk offline long ago, when I started using it.
But I also avoid having to enter the passphrases too many times by holding alternate key files in the initramfs, so that all I have to do is enter enough passphrases for GRUB to be able to load the kernel and the initrd from the encrypted devices, and then the booting-up system can decrypt and mount root all by itself.
Could it be that I had removed or otherwise lost the passphrase for the external disk, and it kept working only because I most often booted off of the internal disk, and it had the key file in initramfs?
If that was so, and the internal disk died, that would leave my redundant copy in the external disk perfectly integral, but entirely inaccessible. That was a scary moment.
So I plugged the external disk in on another computer, and tested that theory.
Phew! No problem, cryptsetup accepted my passphrase and brought the device up.
So I took a copy of /boot to a usb flash device, created an alternate grub.cfg to offer choices to load the current kernel and initrd files from the usb flash device itself, or to cryptomount both devices and try to load the kernel and initrd from there.
I also tried to add some more modules to the GRUB installed on the external disk. And then I plugged the external disk and the usb flash drive back on the primary computer, and tried the solutions I'd prepared in reverse order.
GRUB from the disk didn't find the external disk, as before. I suppose it needs the device to be fully synced for grub-install to probe both devices properly and request all needed modules from mkimage.
The attempt to use the BIOS's copy of GRUB to cryptomount both devices failed. There's something about that GRUB that's unable to bring up the external disk's encrypted volume. Maybe I'm missing some module or something. To be investigated once the crisis is addressed.
The last attempt succeeded in loading the kernel and initrd from the flash drive, and so I'm typing this blog post while the devices resync. Phew!
So blong,