The Linux Questions Thread: a bunch of pitfalls, but technically it's possible

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Linux Questions Thread: a bunch of pitfalls, but technically it's possible

Wedge of Lime: Sep 4, 2003; I lack indie hair superpowers.

Aoi-chan posted:

Also, since we ran on NFS, sometimes we REALLY needed to kill processes that were uselessly stuck in disk wait.

I am not entirely sure if this is correct for Linux, however under Solaris if you imagine that a program which opens a file over nfs must ask the kernel to do it, as it is a privileged operation.

When this occours the process effectively switches operations from user land into the kernel. In this example imagine that the process running in the 'user process space' is "cat" reading a file from an NFS mount.

code:

   ----function calls --->

  | User Process Space | Kernel Space  | File System (NFS)
t | open a file(x)     |               | 
i | ------------------>|  open file(x) |
m |                    |-------------->|
e |                    |<--------------|
  V <------------------|               |

Now, thats a rather crude diagram but it should demonstrate my point.

When you 'kill' a process you simply set a flag in the kernel which says, next
time you deal with this process terminate it.

These 'flags' are processed when a process passes through the boundary between the user land and the kernel space. A process which has got stuck waiting for NFS is stuck all the way over to the right in kernel space.

Until this NFS file open / read / write / whatever times out and fails, the process will never cross the user/kernel boundary and thus your request for the process to be killed is never handled.

I may have got some things incorrect there but I'm fairly sure thats the gist of it without getting out my unix system internals book.
The solution to this problem (which has already been mentioned) is to fix NFS, or in some cases you can 'umount -f' the offending file system.

# ¿ Mar 27, 2007 18:38

Adbot: ADBOT LOVES YOU

# ¿ May 2, 2024 23:05

Wedge of Lime: Sep 4, 2003; I lack indie hair superpowers.

teapot posted:

Killing init...

One can kill init with a live kernel debugger, other than that I do not think its possible with standard system utilities. From my experience of killing init on a Solaris 9 system, you end up with hundreds of zombie processes. Eventually you run out of free pids (due to zombies) and.. well you're pretty stuffed then.

# ¿ Jun 3, 2007 10:48

Wedge of Lime: Sep 4, 2003; I lack indie hair superpowers.

MonikaTSarn posted:

Hm, we have an nfs server thats a bit overloaded at times, that might be it. Why would portmap use 100% cpu then - if its not getting any response it keeps retrying constantly and uses 100% cpu ? Not nice !
I thought we had some configuration problem causing this, some misbehaving server doing weird broadcasts or something. If this is the expected result of a swamped nfs server or network I guess I know where to look. Or I can at least blame the hardware and its not my problem anymore.

Output from nfsstat will help you, high numbers under the retrans column indicate a saturated network. High levels of null calls are also bad.

You can look this up on the clients or the server for extra information. You might want to strace the portmap process when its running away to see what its doing.

I was going to suggest running a pstack against portmap but most linux binaries are stripped, therefore it wont give you useful information.

However, portmap is not actually serving nfs, its just providing port lookups for people who want to use rpc services from that system. If (during the 100% cpu portmap time period) you tcpdump and watch for rpc packets - portmap listens on port 111 tcp/udp. You will be able to see what rpc requests are coming in from the network.

# ¿ Dec 7, 2007 21:51

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Linux Questions Thread: a bunch of pitfalls, but technically it's possible