On Thu, Dec 03, 2009 at 02:02:05PM -0800, Mike Ryan wrote: > > An NFS client which tries to delete an inexistent file will trigger the > bug. This can result from rebooting the ops node while experiments are > running. I have a little clarification here: The bug isn't exercised by *any* attempt to delete a non-existant file. The problems we saw were caused by trying to delete a file from a directory to which the client had a stale NFS filehandle. As Mike says, a likely cause of this would be a server reboot while one or more clients were in a recursive rm. Parenthetically, it also looks like a couple other corner cases are not well handled - things like attempting to delete the root of an NFS mounted hierarchy. Importantly, a standard ENOENT (no such file or directory) request is properly handled. The situation is somewhat insidious in that if an experimental node issues an NFS remove for a file in a stale directory, the file server never sends a response to it. So the experimental nodes keep sending the requests. These requests are coming from the kernel - the NFS RPC retransmission code is the resender; the rm process may have been killed. The file server keeps slowly leaking memory. The situation persists across reboots or panics of the file server, only stopping when all the experimental nodes resending the request are rebooted. One can spot the problem by looking at a tcpdump and seeing NFS remove requests to the server at regular intervals that are completely ignored rather than being responded to. (That's in addition to the other diagnoses Mike suggested). Fortunately, the bug is doesn't get exercised a lot, but, on the other hand, once it's affecting your testbed, it will continue to do so until the nodes tickling it are rebooted. -- Ted Faber http://www.isi.edu/~faber PGP: http://www.isi.edu/~faber/pubkeys.asc Unexpected attachment on this mail? See http://www.isi.edu/~faber/FAQ.html#SIG
Attachment:
pgpmaaLXWEii4.pgp
Description: PGP signature