Two days ago a strange phenomenon caused several of my boxes at 012 to fail. I am in Cyprus, so I had to rely on the staff on location, and on friends who went to the server farm to check if anything can be done. It was later revealed that one power supply failed completely and had to be replaced. Another machine had a corrupt FS; things were made worse by the fact Linux does not yet “love” SATA drives, it shows the partition is mounted RW (Read/Write) while it really is RO (Read Only), but I confirmed this strange behavior with colleagues from channel #debian on freenode, so once this psychological hurdle had passed, I called staff on location and walked them through a reboot of the box, with properly answering questions asked by fsck.ext3.
Yet another machine could not see the SATA drive, I asked on location staff to inspect cables, and they found nothing wrong. On the other hand my friend Milez found that the plastic enclosure of the SATA Power connector was broken! Sabotage? Maybe… and maybe the on location staff broke it while removing it and re-inserting it… But the bigger question is still: why did this all happen in the first place?
Some more background: I was using kernel 2.6.9 up until 2 weeks ago. I know it’s old, and 2.6.13 stable is out for a few weeks, but I am using Debian, and I like using debianized kernel package sources that I later modify myself. Anyway version 2.6.9 had serious issues with certain sensor chips on my motherboards, and I could not read motherboard temperatures or voltages. It was not until early this morning that I realized this was fixed in 2.6.11 and later versions.
I mainly have two routine modifications: 1) The enabling of various grsecurity & pax features 2) The manual patching of a newer sk98lin kernel module for the Syskonnect Marvell Yukon 1 gigabit onboard ethernet adapter I have on my Intel D915GUX motherboards.
So today I installed lm-sensors again, and hurray, it now properly detects my sensors and shows temperatures and voltages, including ALARM signals when they happen. I also installed sensorsd and librrd (to plot graphs). I am now investigating simple ways to generate such plots that will update every few minutes. If I had such graphs before, I might have been able to point a blaming finger at 012 for poor electricity, or unusually high temperatures in the farm. It would then be much easier for me to ask for compensation for damages incurred. So remember boys & girls: Saving a log of temperatures and voltages is important! It’s not just a fun tool, and while it may not save you the headache (disasters happen anyway), but at least the compensation will help a little.