FNAL upgraded a lot of their worker nodes to cvmfs-2.4.4. and 45 of them hung for several hours in the cvmfs_config reload. The admin then ran cvmfs_config killall on those nodes and it got worse, with a number of repositories hung with 'Transport endpoint is not connected' and a mount of the config repository hung. Then they called me in to investigate. On one machine I got the mount to proceed with rmdir /var/run/cvmfs/cvmfs.pause. I could do umount on two of the respositories but two others said that the mountpoint was busy. I couldn't do fuser on the mountpoint because it was stale, and had to use umount -l to unmount.
So there's two questions: why did the original reload hang, and why did the killall not clean it up? I think the latter is because of one of the repositories that can't be unmounted; killall does not proceed to clean /var/run/cvmfs if an unmount fails. Maybe it's just a matter of using umount -l. For the former, about 10% of the nodes did not get upgraded (I don't know why) so we should be able to catch it tomorrow before they attempt a killall so maybe I can tell something from examining a machine at that point.