Noticing what didn't happen with syslog, icinga, and logstash.

Context

If no one complains the problem doesn't exist. Yet. System administrators know better. We have scars.

A failing backup regime can sit there busily wasting tapes for years. Backend APIs could be left un-rsynced. Mysql replication may replican't. Our tools are pretty good at logging everything they do. None of them log what they don't.

I want to know what didn't happen. So I asked serverfault:

How does an administrator generalize alerting when an event doesn't happen?

Often my users require me to be just as responsible for knowing if an event hasn't happened. I've always had to build custom and brittle solutions with cron'ed shell scripts and lots of date edge case testing.

Centralized logging ought to allow for a better, more maintainable way to get a grip on what didn't happen in the last N hours. Something like logstash noticing and nagios alerting.

User toppledwagon came through with a shockingly thorough answer.

Seriously, if this page helps you, go vote up his answer.

The general idea

It works like this: Use syslog as a distributed message queue, feed them to logstash, emit them to nagios/icinga, and finally alert if a freshness threshold has been exceeded. Brilliant.

I wanted to illustrate how I utilized his insights into our pipeline.

Jenkins

It starts with a cron job. Doesn't everything? It checks if there is new data to be assembled into a product build.
If there is new data found, nudge jenkins via its REST API.

We generally follow a convention for naming jenkins jobs and git repos after data sources. So if Atlanta.zip data comes in, we search for all jenkins jobs named atlanta, and build them.
The atlanta job checks out the latest from git and runs assemble.sh if it exists.

This is where we keep the scripts to integrate the new data into the workspace. When/if the atlanta job completes, it calls another parameterized job called "Package" repo=atlanta.
Jenkins job "Package" transfers atlanta job's workspace but with an artificial filesystem root.

Using more conventions in the git repo it stages the fake filesystem putting supervisor.conf into $fake/etc/supervisor/conf.d/$repo.conf, etc.

Then is uses fpm to create debian package for various distros and architectures. Finally it calls another paramaterized jenkins job "Install on QA".
Jenkins job "Install on QA" uses salt to apt-get install the new package on the QA servers.

salt makes this trivial. Then an email is sent to the QA testers, with links in the email back to Jenkins to kick off "Install on Cluster" when everything is OK.
Jenkins job "Install on Cluster" apt-gets deployment.

Again, salt orchestrates the apt-get rollout to the production server. Then we emit to syslog:

logger -t jenkins job=$JENKINS_JOB repo=$repo

That last line is the new sauce. It starts a whole new pipeline.

Logstash and Icinga

We use logstash to process logs, and beaver to ship the logs to logstash.

Beaver tags the logs before sending them as being of type 'syslog'. On logstash two sections need to be configured: 'filter' and 'output'

Logstash filter

filter {
  if [type] == "syslog" {
    grok {
      match => [ "message", '%{SYSLOGBASE} job="%{DATA:job}"(?: repo="%{DATA:repo}")?$',
                 "message", "%{SYSLOGLINE}" ]
      break_on_match => true
    }
    date { match => [ "timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss" ] }
  }
}

The magic is in that double pair of patterns in grok's match parameter along with break_on_match => true. Logstash will try each pattern in turn until one of them matches.

Logstash output

We use the logstash nagios_nsca output plugin to let icinga know we saw the jenkins job in syslog.

output {
  if [type] == "syslog"
    and [program] == "jenkins"
    and [job] == "Install on Cluster"
    and "_grokparsefailure" not in [tags] {
      nagios_nsca {
        host => "icinga.example.com"
        port => 5667
        send_nsca_config => "/etc/send_nsca.cfg"
        message_format => "%{job} %{repo}"
        nagios_host => "jenkins"
        nagios_service => "deployed %{repo}"
        nagios_status => "2"
      }
   } # if type=syslog, program=jenkins, job="Install on Cluster"
} # output

icinga (nagios)

Finally, we have arrived at icinga (nagios) by way of nsca. Now we will need passive service checks defined for each and every job we want to notice didn't happen on time. That can be a lot of jobs so lets use check_mk to transform python lists of jobs into nagios object definitions.

check_mk is cool like that.

check_mk

# check_mk requires local variables be prefixed with '_'

_dailies = [ 'newyork' ]
_day_stale = 86400 * 1.5

_weeklies = [ 'atlanta', 'denver', ]
_week_stale = 86400 * 8

_monthlies = [ 'stlouis' ]
_month_stale = 86400 * 32

_service_opts = [
    ("active_checks_enabled", "0"),
    ("passive_checks_enabled", "1"),
    ("check_freshness", "1"),
    ("notification_period", "workhours"),
    ("contacts", "root"),
    ("check_period", "workhours"),
]

# Define a new command 'check-periodically' that sets the service to UKNOWN.
# This is called after _week_stale seconds have passed since the service last checked in.

extra_nagios_conf += """
  define command {
    command_name check-periodicaly
    command_line $USER1$/check_dummy 3 $ARG1$
  }

  """
# Loop through all passive checks and assign the new check-period command to them.

for _repo in _dailies + _weeklies + _monthlies:
    _service_name = 'deployed %s' % _repo
    legacy_checks += [(('check-periodicaly', _service_name, False), ['lead'])]


# Look before you leap - python needs the list defined before appending to it.
# We can't assume it already exists because it may be defined earlier.

if "freshness_threshold" not in extra_service_conf:
    extra_service_conf["freshness_threshold"] = []

# Some check_mk wizardry to set when the check has passed its expiration date.
# Results in (659200, ALL_HOSTS, [ 'atlanta', 'denver' ]) for weeklies, etc.

extra_service_conf["freshness_threshold"] += [
    (_day_stale,   ALL_HOSTS, ["deployed %s"   % _x for _x in _dailies]  ),
    (_week_stale,  ALL_HOSTS, ["deployed %s"  % _x for _x in _weeklies] ),
    (_month_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _monthlies] ),
]

# Now we assign all the other nagios directives listed in _service_opts

for _k,_v in _service_opts:
    if _k not in extra_service_conf:
        extra_service_conf[_k] =  []

    extra_service_conf[_k] += [(_v, ALL_HOSTS, ["deployed "]) ]

Icinga (nagios)

check_mk -N will now generate nagios configuration objects for you, like this:

define service {
  use                           check_mk_default
  host_name                     lead
  service_description           deployed atlanta
  check_command                 check-periodicaly
  contact_groups                +admins
  active_checks_enabled         0
  passive_checks_enabled        1
  check_period                  workhours
  notification_period           workhours
  freshness_threshold           691200.0
  contacts                      root
  max_check_attempts            1
  check_freshness               1
}

Notice "check_period" and "notification_period" are set to "workhours". This lets the poor sysadmin sleep through the night and preserves cell phone integrity by not being hurled out the window. Icinga (nagios) will wait until workhours to check if these periodic jobs are stale.

Go Top

Dan

Published