Jan 102011
 

After aptitude or similar upgrade tool completes its job, new versions of files are installed into filesystem. However, some running processes may be still using old files (and actually Linux file systems continue to store invisible copies of old files while those are mapped by running processes).

In case of a security update, having old versions running keeps system vulnerable even after update completes, which is not good. A reboot is possible, but that’s an ugly solution.

Fortunately tools exist to find processes that have deleted files mapped. One of these tools is checkrestart (from debian-goodies package).

Just run checkrestart -p after upgrade, and you will see what to restart to be safe – and even what init.d scripts to run with restart argument.

P.S.
On a server with hundreds of running processes, checkrestart -p is terribly slow, likely because it makes separate queries to dpkg about files mapped by each process. Perhaps it could be seriously accelerated if first all names to query are gathered, and then each name is queried only once. Any volunteers to do that? :)

 

Imagine you suffered a hard drive failure. Your system is still running, you even could not notice a problem, but you found a message from mdadm in your mailbox, informing you about degraded array. You first tried to reboot, and all looked ok (after cold restart failed drive looked working again), but a bit after you got a new failure event.

So you purchased a new drive, shut down the computer, opened it’s chassis…

… and are at a loss – which of these 4 hard drives needs a replacement? There is not much space in the chassis, it’s hard or impossible to read what is written on drives.

Here is a relatively easy way to physically locate failed drive.

Disconnect signal cable from one drive. Then boot system from knoppix or other livecd. To make things faster, boot into command line (in case of knoppix, type ‘knoppix 2‘ at boot prompt).

Important: don’t try normal boot with disconnected drive! If you did not guess, this will result in array with two failed components, it won’t start, even after you reconnect drive!

Once you have command line, run mdadm -E in turn on remaining array components. Note that drive ordering could have changed.

If you run mdadm -E on component that is ok, then raid component table will list one of components as failed. If you run mdadm -E on the failed array component, then raid component table will list all components as ‘active sync’.

So if you have guessed and disconnected failed drive, then each of remaining drives will have information about failed component in it’s raid superblock. And if you disconnected a good drive, then one of remaining drives will not have that information. In the later case turn computer off, reconnect disconnected drive and repeat attempt with another one.

 

Those who worked with PC intenals at pre-SATA times, remember that PATA (or “IDE”) allowed to connect two drives on a single bus, master and slave.

SATA protocol is almost the same ATA that was on parallel bus. So master/slave is still supported there at protocol level. And some motherboards do have SATA connectors that correspond to master and slave of the same controller.

Beware of using that! Especially if you use linux software raid.

Because once you will have a failure on one of two disks connected to the same controller. And that will result in bus reset, affecting both master and slave. At this moment your raid array will loose not just failed drive, but the second drive also. And likely entire array will become broken, and your system will hang and not boot until you do low-level recovery.

To find out if you are in danger, run

dmesg | grep -i 'sata link'

soon after boot (i.e. while boot messages are still in log buffer). This is a safe output:

...
[    2.147767] ata5.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.147825] ata5.01: SATA link down (SStatus 0 SControl 300)
[    2.147984] ata6.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.148042] ata6.01: SATA link down (SStatus 0 SControl 300)
...

Here ata5 and ata6 controllers have two connectors, but only one connector per controller is used.

If you see that both ataN.00 and ataN.01 have link up, then better to reconnect to other SATA connectors. Or at least double check that drives connected to these lines are not in the same raid array.

 

Here is a self-memo :)

To make kernel find a hotplugged drive, ask it to rescan SATA bus:

echo "- - -" > /sys/class/scsi_host/hostN/scan

To remove a hard drive from system:

echo 1 > /sys/class/scsi_host/hostN/device/targetN:0:0/N:0:0:0/delete

But first make sure that drive and all partitions of the drive are not mounted, not connected to a raid, etc!

To find out controller number for particular drive (sdc in this example):

echo /sys/class/scsi_host/*/device/*/*/block/sdc

But remember that numbers may be different on next boot.

P.S.
If anyone knows a higher level tool for these and similar operations, please leave a link in a comment :) .

 

We are still running several servers under Xen control. That was perhaps a mistake to choose Xen, but now over 20 domU’s are there and moving to something better is hard.

Today during system maintaince, I faced that one server is unable to start any domU, with an error

Error: Device 0 (vif) could not be connected. Hotplug scripts not working

Looking the net did not give an answer – people do ask about it, but answers either don’t exist, or are clearly not my situation.

So had to search for solution myself.

At some moment I noticed that server has several udevd processes running. That looked wrong for me. I attached to one of those with strace -p and found that it is hanged in sendto() to a unix domain socked with multipathd at the other side.

Restarting multipathd restored normal server functionality.

So who could think that hanged multipathd may affect starting Xen domUs …

© 2011 yoush.homelinux.org Suffusion theme by Sayontan Sinha