Can't you just use soft mounts? Especially since in the scenario you described y...

acdha · on June 29, 2016

It depends on the nature of the failure. If you're doing something which triggers hard failures – i.e. where a read() call on a ocket would unblock with an error – soft mounts will eventually recover. The problem is that, in my experience anyways, the vast majority of failures aren't that clean – things like a server which processes packets but never responds, a network connection which drops packets but doesn't change the link status, etc. — and in those cases soft mounts behave no better than hard mounts. There also used to be kernel bugs in *BSD, Darwin, and Linux where the client could deadlock in heavy activity, which were hard to reproduce and get fixed.

In all of those cases, anything which tries to access something on the NFS mount will block in the kernel (i.e. “kill -9“ won't work) and the mount cannot be unmounted normally.

I wrote https://github.com/acdha/mountstatus awhile back – if memory serves, 2004 or so – because we found that on Linux a lazy unmount would still work in this case and so you could have a process monitor the mount status (fork() a child to check the mount, alert if it doesn't get a response within a set interval) and a watchdog could respond to an alert by issuing a “umount -l” and remounting, which doesn't fix the blocked process but is less disruptive than rebooting and that new processes won't block because they tried to access that mount.