Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, monit doesn't manage services. Monit tries to follow clues you've given it about what's running, it polls them once in a while, and if something appears to be not running (as measured by the instructions you've given it), it runs the one-liner you've given it that should start the thing up again.

Monit does a thing that approximates managing a process, for certain values of "approximates", "managing", and "process". Supervisory process management is one of Linux's absolute weakest points. I cut my teeth on fault-tolerant HA minicomputers, and it pains me to think that 30 years later, we still don't have a way to say "make sure apache is always running. period."

As a great blog pointed out, there is exactly one process that KNOWS when a service has stopped running, and it doesn't need .pid files or polling or anything else to tell it: process 1.

I'm not a systemd advocate - I don't know enough about it, and we're using Ubuntu so I'll end up learning upstart anyway - but read this, it's way more eloquent that I can be:

http://dustin.github.com/2010/02/28/running-processes.html



Fair points. And thanks, by the way, for actually advancing the discussion.

Init can and does manage processes. Somewhat crudely, mostly via the 'respawn' directive. One thing it isn't particularly good at is telling if a process is doing something useful (say, serving out web pages successfully), but it will let you know that it's running. There was a semi-popular hack some years back to run sshd out of init (via respawn) to ensure you always had an SSH daemon on your box (Dustin mentions this). The downside is that while it will ensure sshd is running, it doesn't give you much flexibility over the process (you've got to edit inittab and 'init q' to make changes).

What monit and kin can do, above and beyond process-level monitoring, is check that the service attributes of a process are sane. That a webserver, say, kicks out a 200 OK response rather than a 4## or 5## error, and restart the service if this isn't the case. Checking for correct operation can be more useful than simply verifying a process is running (though going too far overboard in defining "correctness" can also cause problems).

For realtime/HA tools, attacking things on the single-system level is probably the wrong way to roll. You want a load balancer in front of multiple hosts with response detection -- is host A still up or not? Whether or not this ties into mitigation (restart) or alerting (notifications to staff) is another matter.

There are also places other than init you can watch things from. /proc contains within it multitudes, including a lot of interesting/useful process state. Daemons can be written with control/monitoring sockets instrumented directly into themselves. Debuggers, strace, ltrace, dtrace, and systemtap all provide resolution inside a running process/thread. Creating something sane, effective, efficient, and sufficient out of all these tools ... interesting problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: