This was an old nemesis of mine. Once again, a post on Serverfault showed me the light.
A simple cron job can turn into a forkegeddon when your back is turned. It is easy to see how it happens after you've been bitten.
- 
Some applications require more than pid monitoring to assess process state. A custom script is banged out to test the application's health. Using curl -sto test web services, ormysql -eto test databases, etc.
- 
These scripts restart the process if failed. Easy - service,initctl,invoke-rc.d,supervisorctlall have restart capabilities.
- 
You'll want to test health all the time, so into cronit goes.Into /etc/cron.d/scriptnameit goes with* * * * * /usr/local/sbin/scriptname
What could go wrong?
But then, someday it happens.  You start getting emails every minute, sometimes two at a time, nagging about process restarts. Its been working for months and you log in and run top.  There are dozens of CRON forks invoking your script and all trying to restart the service!
Checking and restarting the service took longer than 60 seconds. Thats what happened. Once that threshold is breached its mayhem. Restart requests queue up and future service checks show the process is down, and it joins the restart konga line.
nginx restart, mysql backups, daylight savings time -- all well known culprits for this bug. What is needed is prevention of the script from running if it is already running.  This is locking comes in, and in this case the flock utility. Checking all my systems it seems this is widely installed by default.
The manpage for flock leaves much to be desired which is probably why it is underutilized by sysadmins.  It just sits there in the toolbox, smoldering at us everytime we reinvent this wheel.  flock can be invoked three different ways and it can get confusing.  We are going to illustrate just one.
#!/bin/bash
# Fail if already running
exec 200>>/var/lock/check-myBackend-service.lock
flock --nonblock --exclusive 200
curl -s -H 'Host: example.com' 'http://localhost/service?test=1' ||
  supervisorctl restart myBackendService
Who watches this watcher?
I've already covered the black magic of noticing what didn't happen.
To notice if the watcher stops watching make sure syslog knows it, add
logger -p $0 "Checking ____ for ____." at the beggining.
Then create an entry in logstash, and a passive freshness check in icinga.  If you are clever you can use json in the logger event and make a generic logstash stanza to organize all of these.  Then all you would need is the nagios bit.  check_mk can organize that.
To get a cheap kibana histogram of restarts add:
logger -p $0 "Restarting ____ because ____." in there too.
While you are in there, add an email notification, too.
Now its done.
Go Top