Archive for 29 December 2008

I usually have very long-running desktop session, with many tens of opened windows split over up to 12 virtual desktops. Those accumulate the current state of many different activities I switch between (or have a hope to return to). Loosing such a session is always very unpleasant.

Unfortunately KDE3, which is my current desktop, sometimes misbehaves. Actually there are several different issues, but today I’d like to talk about particular one.

Symptom is: I press Alt+F2, paste an URL, press Enter – but konqueror window does not appear. Or amarok does not start playback after a click on a music file. Or something similar. Using ps command in a terminal window, I may see that processes in question are actually started – but hanged before doing anything useful.

Possible reason of this is a DCOP-level freeze. DCOP is an IPC protocol used by KDE2 and KDE3. Many KDE applications do some DCOP queries at startup, to all registered peers in turn. Problem is that these queries are synchronous, and not timed out. If some peer does not respond (e.g. because it hanged), query issuer hangs. This is exactly what happens with all those processes that are visible in ps output, but don’t work.

To check if system has a DCOP-level freeze situation, try to start kdcop from command line. If window appears, but then kdcop hangs before filling the tree, then DCOP-level freeze situation exists.

To recover from this without loosing the running session, one needs to locate the original hanged process and kill it (or make it unhang if possible). Note the word original – at the moment when freeze situation is detected, there are usually several processes already hanged.

Possible method is as follows.

  • Run the following command:

    strace -s 128 -o /tmp/log kdcop
  • Once it freezes, kill it with Ctrl-C.
  • Run tail /tmp/log to check several last lines of strace log. It would contain a line that looks like
    write(9, "\0\0\0\20anonymous-23233\0\0\0\0\20konqueror-23214\0\0\0\0\27konqueror-mainwindow#1\0\0\0\0\7icon()\0\0\0\0\0"..., 82) = 82

    Second string (konqueror-23214 in this case) is the DCOP name of non-responding peer. Usually it contains the PID.

  • Kill (or unhang) the process in question.
  • Repeat those steps until kdcop will start without freezing.

At this moment desktop should be recovered.

However, sometimes original hang reason persists, and freeze situation happens again very soon. In this case, use strace on the freezing process itself, to find out what is going on.

I’ve seen situations when konqueror hanged inside connect() operation on a unix domain socket with fam string in it’s name. In this case, the actual hanged process is gam_server, and that is what should be killed.

Another possible cause of a hang may be an NFS mount from a no-longer-reachable server. In this case umount -l helps.

Our today’s hero is company named Port, with their can4linux drivers.

These people demonstrate wonderful knowledge on how to do mutual-exclusion in kernel. Here is an example:

    if(0 == atomic_read(&Can_isopen[minor])) {
        /* first time called, initialize hardware and global data */
        ...
    }
    ... /* many more code, without any locks held */
    atomic_inc(&Can_isopen[minor]);

Looks like they are sure that using atomic here will help them to catch first call reliably.

One more example:

    spin_lock(&waitflag_lock);
    for(i = 0; i < CAN_MAX_OPEN; i++) {
        if(CanWaitFlag[minor][i] == 0) break;
    }
    spin_unlock(&waitflag_lock);

And now they think that i is a reliable index of zero array element.

Both examples taken from their open() routine.

There are numerous other issues in the code - races, improper use of kernel infrastructure, etc. A very good example of out-of-community, no-review development.

Anyway, they are still in business, and do offer all that to their customers. Let's wish them good luck :) .

Libtool has a widely used feature called Libtool Convenience Libraries.

Today we discovered an not-very-expected side effect of using this feature – it breaks make -n.

This happens because when Makefile.am contains

libxxx_la_LIBADD = dir/libyyy.la

generated Makefile gets libxxx.la dependent on dir/libyyy.la.

On normal build, subdirs get processed first and dir/libyyy.la is generated, so at the moment when make starts processing libxxx.la goal, this dependency is already satisfied.

However, if running make -n, nothing is generated. So dependency on dir/libyyy.la is not satisfied. Since local Makefile has no information on how to build non-local goals, make aborts with a dependency error.

Btw, something similar could happen on parallel builds? Likely make and/or autotools use some magic to workaround that. And the dependency on dir/libyyy.la may be a part of that magic – don’t know.

Anyway, if make -n functionality is needed, it may be restored by removing such dependencies from generated Makefile’s. Physically, these dependences look like references to variables $(libxxx_la_DEPENDENCIES). And these variables are used only for these dependences – so it is safe to just remove them.

So the command to restore make -n is

find . -name Makefile | xargs sed -i 's/\$([^ ]*_la_DEPENDENCIES)//'

Once done with make -n, Makefile’s may be regenerated by running config.status script.