My journey into the SaltMine to keep nagios fed with check_mk.

// under salt saltmine icinga nagios check_mk flock

My journey into the saltstack's salt mine, python's file locking, and check_mk's alternative datasources.

Lesson 1: salt-call peer.publish doesn't scale

My first attempt was over a year ago in 2013. I used Salt's peer publishing interface from the monitoring server to target minions. Each minion then ran check_mk_agent and returned the results. This didn't scale well, fell on its face, and resulted in nagios waking me up to ACKNOWLEGE false alarms. I reverted and gave up.

Lesson 2: salt-call mine.get doesn't scale

To configure salt to mine data you need to list what to run in the minion_functions option. You can set this in /etc/salt/minion, grains, or pillars. I went with pillars - it seems like the best place to put it.

Add saltmine.sls to the list of pillars in /etc/pillar/top.sls.

base:
  '*':
    - saltmine

Now create /srv/pillar/saltmine.sls:

mine_functions:
  check_mk.agent: []
  network.interfaces: []
  status.uptime: []

I included status.uptime to confirm that mine.update is actually running on the minions. network.interfaces is also handy.

Previous attempts to finagle salt.mine '*' cmd.run check_mk_agent exposed a "feature". You can only have exactly one of each function signature in your salt.mine configuration. Want to mine more than one cmd.run you say? Start coding. I opted to create my own salt module to namespace it:

$ cat /srv/salt/_modules/

#!/usr/bin/env python
''' Support for running check_mk_agent over salt '''
import os
import salt.utils
from salt.exceptions import SaltException

def __virtual__():
    ''' Only load the module if check_mk_agent is installed '''
    if os.path.exists('/usr/bin/check_mk_agent'):
        return 'check_mk'
    return False

def agent():
    ''' Return the output of check_mk_agent '''
    return __salt__['cmd.run']('/usr/bin/check_mk_agent')

Push both module and pillar data to all minions:

salt '*' saltutil.sync_all

Confirm mine_functions are recognized by all minions:

salt '*' config.get mine_functions

Minions update mine data at startup and every mine_interval minutes thereafter. Force 20 minions at a time to run mine.update:

salt '*' -b 20 mine.update

Confirm you can retrieve mined data for minion zinc:

salt mine.get zinc check_mk.agent

Finally on the monitoring server I configure /etc/check_mk/main.mk to mine data:

mineget_cmd = (
    "sudo salt-call mine.get <HOST> check_mk.agent |"
    " sed -n 's/^        //p'"
)

datasource_programs = [
    ( mineget_cmd, ['salt'], ALL_HOSTS ),
]

It works! So I set the new datasource to a dozen hosts and... Boom! Just five simultaneous salt-calls from minion to master and it spikes both CPUs. So much so that checks are failing (timeouts) and munin runs are overlapping itself.

I up the VM's CPUs to 4, no dice, just locks up faster.

Abandon ship! I revert all HOSTS back to ssh and spend the next hour cleaning up nagios caches and notifications.

Lesson 3: redis scales, but doesn't work

Tried --return redis. I install python-redis on all minions. I again restart all salt-minions (a multistep process). It doesn't work. Maybe it is --return redis_return? Some docs say so. No go.

I give up on redis for now. I still think this is the way to go. Someday I'll post again with redis success.

Lesson 4: Batch processing still rules

The next day I thought . o O ( Maybe I can just make one salt-call for all minions and parse a json dump? ) So I did it.

$ cat /etc/cron.d/get_mined_check_mk_agents

*/1 * * * * root salt-call mine.get '*' check_mk.agent --out=json > /tmp/cmk.json

$ cat /etc/check_mk/mine_agent.py

#!/usr/bin/env python
import sys,json
a=json.load(file("/tmp/cmk.json", "r"))
print a["local"][sys.argv[1]]

$ grep mine_agent /etc/check_mk/main.mk

('/etc/check_mk/mine_agent.py <HOST>', ['mine'], ALL_HOSTS),

This works! But it is complaining of stale data. I revert back to ssh and call it a night.

Lesson 5: Minion's mine_interval configuration

The default mine interval is 60 minutes, this at least made my previous observations make sense.. OK, I set it to 1 minute along with the other mine related pillar data.

Wrong.

Let me save you hours. Through a lot of trial and error, and IRC, I decipher the mine_interval docs. Unlike mine_function mine_interval must be set in the minion's runtime config. The configuration option is ignored in grains or pillar data.

This causes no end of trouble where you can't set a config value on a running salt-minion. Doing so requires restarting the salt-minion -- and that isn't exactly a science yet. So how to get this config option into newly salt-cloud'ed servers?

Configure all minions to set mine_interval to one minute:

salt '*' file.touch /etc/salt/minion.d/mine.conf
salt '*' file.append /etc/salt/minion.d/mine.conf "mine_interval: 1"

salt -G os:debian cmd.run '/etc/init.d/salt-minion stop; /etc/init.d/salt-minion start;'

salt -G os:ubuntu service.full_restart salt-minion

Confirm that the settings took:

salt '*' config.get mine_interval --out json --static

At this point I'm pretty close. There remains a race condition where cron and check_mk both want to overwrite cmk.json. I've decided to keep the cron job running over the weekend while reading up on locking.

Lesson 6: file locking, revisited

I already knew how to solve read/write race conditions in shell scripts but now was the time to learn how to flock in python.

While I was at it I added do_update() to handle the mine.get '*' call and switched from /tmp/cmk.json to the RAM filesystem /dev/shm/cmk.json. I shaved 100ms per check by switching to /dev/shm. They add up.

$ cat /etc/cron.d/cron_get_mined_agents

*/1 * * * * root /etc/check_mk/mine_agent.py --update

$ cat mine_agent.py

#!/usr/bin/env python
import sys
import json
import fcntl

DATAFILE="/dev/shm/cmk.json"
NAG_UID = 105
NAG_GID = 107

def do_update():
    import os
    import salt.client

    caller = salt.client.Caller()
    data = caller.function('mine.get', '*', 'check_mk.agent')

    lockfile = open(DATAFILE+".lock", "w")
    fcntl.flock(lockfile, fcntl.LOCK_EX)

    datafile = open(DATAFILE, "w")
    datafile.write(json.dumps(data))

    for f in (DATAFILE, DATAFILE+".lock"):
        os.chmod(f, 0644)
        os.chown(f, NAG_UID, NAG_GID)

def get_agent(minion):
    lockfile = open(DATAFILE+".lock", "w")
    fcntl.flock(lockfile, fcntl.LOCK_SH)

    data = json.load(file(DATAFILE))
    return data[minion]

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Usage: mine_agent.py --update | <minion id>"
    elif sys.argv[1] in ['--update', '-u']:
        do_update()
    else:
        minion = sys.argv[1]
        print get_agent(minion)

Notice that the ["local"] top-level indice is missing when mine.get is called from within a python script. That was a surprise.

Sub-second Performance

check_mk now returns 20 checks for a single host in under a second:

$ time check_mk -v zinc
Check_mk version 1.1.12p7
Calling external programm /etc/check_mk/mine_agent.py zinc
CPU load             OK - 15min Load 0.05 at 8 CPUs                         
CPU utilization      OK - user: 0.3%, system: 0.0%, wait: 0.0%              
Disk IO SUMMARY      OK - 0.00B/sec read, 2.36KB/sec write                  
Kernel Context Switches OK - 176/s in last 60 secs                             
Kernel Major Page Faults OK - 0/s in last 60 secs                               
Kernel Process Creations OK - 1/s in last 60 secs                               
Memory used          OK - 0.22 GB used (0.21 GB RAM + 0.01 GB SWAP, this is 22.3% of 0.98 GB RAM)
Mount options of /   OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Mount options of /home/scponly/sofia/html OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Mount options of /home/scponly/sofia/logs OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Number of threads    OK - 136 threads                                       
TCP Connections      OK - ESTABLISHED: 2, TIME_WAIT: 3                      
Uptime               OK - up since Fri Apr 18 23:40:10 2014 (107d 23:16:21) 
Vmalloc address space OK - total 120.0 MB, used 5.6 MB, largest chunk 114.3 MB
fs_/                 OK - 30.5% used (5.90 of 19.3 GB), (levels at 92.0/95.0%), trend: +431.14KB / 24 hours
proc_MySQL           OK - 1 processes                                       
proc_Nginx           OK - 1 processes                                       
proc_beaver          OK - 3 processes                                       
proc_openvpn         OK - 1 processes                                       
proc_salt-minion     OK - 2 processes                                       
OK - Agent version 1.2.0p3, execution time 0.7 sec|execution_time=0.663

Notes

2014-07-17 - Daniel Jagszent pointed out that users should be aware that salt-mined data is available to all minions. I don't see an issue with that in my archetecture but Caveat Emptor.

2014-08-04 - This is now my daily driver for gathering check_mk output from all minions.

Go Top