From the time I had set up my first server at home over a decade ago, I’ve performed numerous operating system upgrades. Usually, it used to take me several hours – if not days – to complete each upgrade and make sure that everything would work as expected. During all these years, I’ve been working hard whenever time permitted it in order to make several pieces of software work flawlessly together requiring the least possible time for manual maintenance. Despite the deployment of my services having reached a high level of automation, I recently spent almost a whole day upgrading CentOS in one of my remote boxes.
According to my initial plan this procedure shouldn’t have taken longer than 2-3 hours. I had simulated it in Virtualbox at home and I knew exactly what to expect. Unfortunately, I didn’t strictly follow the plan, but deviated from it 2 times and this almost cost me the whole day.
The first thing that went wrong had to do with testing my backup, a step that was not in my original plan. I keep my server data in encrypted containers on Amazon S3 using duplicity. Although I have restored data from the backup numerous times and I was certain it worked OK, I had this strange idea to test the restoration of the data to a virtual machine at home just to make sure. For that purpose I happened to use a VM whose state had been saved several days ago, meaning that its time was way out of sync. That was a detail I had n’t taken into account. So, when I tried to restore the data on that box, I got a glorious exception from duplicity informing me that it could not find any signatures on the S3 bucket. That message was really unhelpful and it resulted in wasting many hours trying to figure out what was wrong with my backup or duplicity, until I finally realized that it was the box’s wrong time that had caused the exception. Once the time was updated, duplicity worked like a charm.
The second thing that went wrong had to do with pvGRUB, which is based on the grub 0.97 code and used to boot Xen DomUs (guests). Due to some limitations of the VPS provider regarding pvgrub, I have to use a very small partition that contains a GRUB configuration file which eventually boots CentOS (root LVM setup). This small partition was initially formatted using ext3. Again, I had a strange hunch to reformat that small partition to ext4! This would have absolutely no benefit, but at that moment I had just thought “why not?”. I was completely unaware that grub 0.97 and eventually pvgrub did not support the ext4 filesystem. To make things even worse, pvgrub deceptively reported that it had recognized the partition as ext2, but could not locate the file I had configured it to load. Disaster. It was a few hours later, after having gone through several bug trackers and mailing lists, that I realized that pvgrub did not actually support reading from ext4. I reformatted the small partition to ext3 and everything went on smoothly.
If I had stuck to the original plan, none of the incidents above would have taken place. No matter how much I trust free software, deciding to experiment with it while I should be doing a specific job is admittedly one of the worst decisions possible. Regardless of how popular a piece of free software might be, it can still have serious bugs and limitations hidden in the last place you’d ever look. Lesson learned: stay on your path and strictly follow the plan.
The Lessons learned from a recent OS upgrade by George Notaras, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.