[ale] Lab Workstation Mystery

Todor Fassl fassl.tod at gmail.com
Fri Apr 22 11:41:08 EDT 2016


For completeness sake, I am going to reply to my own post  with what I 
think is a solution...

The problem: Machines in 2 labs in different buildings on different 
physical switches but on the same sub-lan would get into a state where 
home directories on an NFS server could not be mounted or unmounted. 
Attempts to mount or unmount a user's home directory would simply hang. 
  I discovered that even if a user logged out, all of his processes 
would not die.  For most users, it was 4 processes in particular: 
systemd, sd-pam, ibus-daemon, and ibus-dconf.

Solution: I've been periodically logging in remotely to all 15 of these 
machines and killing off the processes for any user who has logged out 
and has those 4 processes and only those 4 processes running.   The 
automounter then umounts their homedirectory. The nfsv4 kernel module 
hasn't gotten wedged on any machine since I started doing that 2 days ago.

Two days probably isn't long enough to be sure I've found a solution. 
But there are 15 machines and at least one of them (usually more) got 
wedged every day. I think that what was happening is that nfs couldn't 
save changes or even close a file after a certain period of time. Either 
it accumulated too many mounts, open files, or whatever or it had some 
problem when a user logged back in and tried to access files that were 
already hosed. At any rate, killing off those processes so the nfs share 
can be unmounted seems to have solved the problem.

Well, sort of ... It turns out the read-only file system thing is a 
different problem. In the past couple of days, I've seen 3 machines 
where nfsv4 is not messed up, users can log in okay and it mounts their 
home directory just fine. But all of the partitions on the hard disk are 
mounted read-only. I reinstalled so that /, /var, /tmp, and /usr/local 
are all on seperate partitions. That did not help.

At first I thought the kernel was getting so messed up with the nfsv4 
problem that it would eventually remount it's local file systems 
read-only. But this is probably a different problem possibly related to 
power surges or something.  If I find a cause/solution to this problem, 
I'll post about it too. But don't hold your breath.

On 04/20/2016 01:00 PM, Todor Fassl wrote:
> I verified that if you log in and then just log back out immediately,
> those same 4 processes remain running, systemd, sd-pam, ibus-daemon, and
> ibus-dconf. I don't have a plain ubuntu 15.10 system handy but I'll bet
> it does the same thing. Anybody have a machine like that? Login at the
> console, log out, ssh to the machine as another user, and see if there
> are any processes still running for the user who just logged out.
>
> I tried switchng a machine to use gdm instead of lightdm, no joy.  I
> think I'm logging in via gnome. I can try unity too.
>
> I think it's a systemd issue. In fact, I think it's a "feature" of
> systemd. It messes up autofs though.
>
>
>
>
>
> On 04/20/2016 12:31 PM, Jim Kinney wrote:
>> Anyone using screen, tmux or nohup?
>> On Wed, 2016-04-20 at 11:52 -0500, Todor Fassl wrote:
>>> I posted about this problem a couple of weeks ago and still have not
>>> figured it out. The problem is that on a group of machines running
>>> ubuntu 15.10, after a period of time, mounting home directories via
>>> NFS
>>> hangs. Attempting to mount or unmount home directories via NFS
>>> simply
>>> hangs. Eventually, the root filesystem getsremounted read-only and
>>> the
>>> machine becomes unusable even as a local user. One thing I've
>>> discovered
>>> since my first post about this is that when end-users log out, some
>>> processes do not get killed off. The automounter can't umount the
>>> home
>>> directory because the user still has some processes running.
>>> Eventually,
>>> the machine has several home directories mounted via NFS for users
>>> who
>>> are no longer logged in. I am thinking that what is happening is
>>> that
>>> eventually this causes NFS to get wedged which in turn leads to the
>>> kernel freaking out. Or something. Here is an example of the output
>>> from
>>> listing the processes for a user who has logged out:
>>>
>>> # ps -u enduser1
>>>       PID TTY          TIME CMD
>>>    101794 ?        00:00:00 systemd
>>>    101795 ?        00:00:00 (sd-pam)
>>>    103049 ?        00:00:00 ibus-daemon
>>>    103057 ?        00:00:00 ibus-dconf
>>>
>>>
>>> So frequently, even though a user has logged out days ago, the
>>> systemd
>>> and ibus-deamon might still be running. I am thinking after enough
>>> time,
>>> these things mess up the nfsv4 kernel module which eventually messes
>>> up
>>> the kernel itself.
>>>
>>> But why would logging out *not* killoff all of an end-user's
>>> processes?
>>>
>>>
>>> _______________________________________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/listinfo/ale
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/listinfo
>>>
>>>
>>> _______________________________________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/listinfo/ale
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/listinfo
>

-- 
Todd


More information about the Ale mailing list