My journey into the saltstack's salt mine, python's file locking, and check_mk's alternative datasources.
Lesson 1: salt-call peer.publish doesn't scale
My first attempt was over a year ago in 2013. I used Salt's peer publishing
interface from the monitoring server to target minions. Each minion then ran
check_mk_agent and returned the results. This didn't scale well, fell on its
face, and resulted in nagios waking me up to ACKNOWLEGE false alarms. I
reverted and gave up.
Lesson 2: salt-call mine.get doesn't scale
To configure salt to mine data you need to list what to run in the minion_functions option. You can set this in /etc/salt/minion, grains, or pillars. I went with pillars - it seems like the best place to put it.
Add saltmine.sls to the list of pillars in /etc/pillar/top.sls.
base:
  '*':
    - saltmine
Now create /srv/pillar/saltmine.sls:
mine_functions:
  check_mk.agent: []
  network.interfaces: []
  status.uptime: []
I included status.uptime to confirm that mine.update is actually running on
the minions. network.interfaces is also handy.
Previous attempts to finagle salt.mine '*' cmd.run check_mk_agent exposed a
"feature".  You can only have exactly one of each function signature in your
salt.mine configuration. Want to mine more than one cmd.run you say?  Start
coding. I opted to create my own salt module to namespace it:
$ cat /srv/salt/_modules/
#!/usr/bin/env python
''' Support for running check_mk_agent over salt '''
import os
import salt.utils
from salt.exceptions import SaltException
def __virtual__():
    ''' Only load the module if check_mk_agent is installed '''
    if os.path.exists('/usr/bin/check_mk_agent'):
        return 'check_mk'
    return False
def agent():
    ''' Return the output of check_mk_agent '''
    return __salt__['cmd.run']('/usr/bin/check_mk_agent')
Push both module and pillar data to all minions:
salt '*' saltutil.sync_all
Confirm mine_functions are recognized by all minions:
salt '*' config.get mine_functions
Minions update mine data at startup and every mine_interval minutes thereafter. Force 20 minions at a time to run mine.update:
salt '*' -b 20 mine.update
Confirm you can retrieve mined data for minion zinc:
salt mine.get zinc check_mk.agent
Finally on the monitoring server I configure /etc/check_mk/main.mk to mine data:
mineget_cmd = (
    "sudo salt-call mine.get <HOST> check_mk.agent |"
    " sed -n 's/^        //p'"
)
datasource_programs = [
    ( mineget_cmd, ['salt'], ALL_HOSTS ),
]
It works! So I set the new datasource to a dozen hosts and... Boom! Just five simultaneous salt-calls from minion to master and it spikes both CPUs. So much so that checks are failing (timeouts) and munin runs are overlapping itself.
I up the VM's CPUs to 4, no dice, just locks up faster.
Abandon ship! I revert all HOSTS back to ssh and spend the next hour cleaning up nagios caches and notifications.
Lesson 3: redis scales, but doesn't work
Tried --return redis. I install python-redis on all minions. I again restart
all salt-minions (a multistep process). It doesn't work. Maybe it is --return
redis_return?  Some docs say so. No go. 
I give up on redis for now. I still think this is the way to go. Someday I'll post again with redis success.
Lesson 4: Batch processing still rules
The next day I thought . o O ( Maybe I can just make one salt-call for all minions and parse a json dump? ) So I did it.
$ cat /etc/cron.d/get_mined_check_mk_agents
*/1 * * * * root salt-call mine.get '*' check_mk.agent --out=json > /tmp/cmk.json
$ cat /etc/check_mk/mine_agent.py
#!/usr/bin/env python
import sys,json
a=json.load(file("/tmp/cmk.json", "r"))
print a["local"][sys.argv[1]]
$ grep mine_agent /etc/check_mk/main.mk
('/etc/check_mk/mine_agent.py <HOST>', ['mine'], ALL_HOSTS),
This works! But it is complaining of stale data. I revert back to ssh and call it a night.
Lesson 5: Minion's mine_interval configuration
The default mine interval is 60 minutes, this at least made my previous observations make sense.. OK, I set it to 1 minute along with the other mine related pillar data.
Wrong.
Let me save you hours. Through a lot of trial and error, and IRC, I decipher
the mine_interval docs. Unlike mine_function mine_interval must
be set in the minion's runtime config. The configuration option is ignored in
grains or pillar data.
This causes no end of trouble where you can't set a config value on a running salt-minion. Doing so requires restarting the salt-minion -- and that isn't exactly a science yet. So how to get this config option into newly salt-cloud'ed servers?
Configure all minions to set mine_interval to one minute:
salt '*' file.touch /etc/salt/minion.d/mine.conf
salt '*' file.append /etc/salt/minion.d/mine.conf "mine_interval: 1"
salt -G os:debian cmd.run '/etc/init.d/salt-minion stop; /etc/init.d/salt-minion start;'
salt -G os:ubuntu service.full_restart salt-minion
Confirm that the settings took:
salt '*' config.get mine_interval --out json --static
At this point I'm pretty close.  There remains a race condition where cron
and check_mk both want to overwrite cmk.json. I've decided to keep the cron
job running over the weekend while reading up on locking.
Lesson 6: file locking, revisited
I already knew how to solve read/write race conditions in shell scripts but now was the time to learn how to flock in python.
While I was at it I added do_update() to handle the mine.get '*' call
 and switched from /tmp/cmk.json to the RAM filesystem
/dev/shm/cmk.json.  I shaved 100ms per check by switching to /dev/shm. They
add up.
$ cat /etc/cron.d/cron_get_mined_agents
*/1 * * * * root /etc/check_mk/mine_agent.py --update
$ cat mine_agent.py
#!/usr/bin/env python
import sys
import json
import fcntl
DATAFILE="/dev/shm/cmk.json"
NAG_UID = 105
NAG_GID = 107
def do_update():
    import os
    import salt.client
    caller = salt.client.Caller()
    data = caller.function('mine.get', '*', 'check_mk.agent')
    lockfile = open(DATAFILE+".lock", "w")
    fcntl.flock(lockfile, fcntl.LOCK_EX)
    datafile = open(DATAFILE, "w")
    datafile.write(json.dumps(data))
    for f in (DATAFILE, DATAFILE+".lock"):
        os.chmod(f, 0644)
        os.chown(f, NAG_UID, NAG_GID)
def get_agent(minion):
    lockfile = open(DATAFILE+".lock", "w")
    fcntl.flock(lockfile, fcntl.LOCK_SH)
    data = json.load(file(DATAFILE))
    return data[minion]
if __name__ == '__main__':
    if len(sys.argv) != 2:
        print "Usage: mine_agent.py --update | <minion id>"
    elif sys.argv[1] in ['--update', '-u']:
        do_update()
    else:
        minion = sys.argv[1]
        print get_agent(minion)
Notice that the ["local"] top-level indice is missing when mine.get is
called from within a python script. That was a surprise.
Sub-second Performance
check_mk now returns 20 checks for a single host in under a second:
$ time check_mk -v zinc
Check_mk version 1.1.12p7
Calling external programm /etc/check_mk/mine_agent.py zinc
CPU load             OK - 15min Load 0.05 at 8 CPUs                         
CPU utilization      OK - user: 0.3%, system: 0.0%, wait: 0.0%              
Disk IO SUMMARY      OK - 0.00B/sec read, 2.36KB/sec write                  
Kernel Context Switches OK - 176/s in last 60 secs                             
Kernel Major Page Faults OK - 0/s in last 60 secs                               
Kernel Process Creations OK - 1/s in last 60 secs                               
Memory used          OK - 0.22 GB used (0.21 GB RAM + 0.01 GB SWAP, this is 22.3% of 0.98 GB RAM)
Mount options of /   OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Mount options of /home/scponly/sofia/html OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Mount options of /home/scponly/sofia/logs OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw
Number of threads    OK - 136 threads                                       
TCP Connections      OK - ESTABLISHED: 2, TIME_WAIT: 3                      
Uptime               OK - up since Fri Apr 18 23:40:10 2014 (107d 23:16:21) 
Vmalloc address space OK - total 120.0 MB, used 5.6 MB, largest chunk 114.3 MB
fs_/                 OK - 30.5% used (5.90 of 19.3 GB), (levels at 92.0/95.0%), trend: +431.14KB / 24 hours
proc_MySQL           OK - 1 processes                                       
proc_Nginx           OK - 1 processes                                       
proc_beaver          OK - 3 processes                                       
proc_openvpn         OK - 1 processes                                       
proc_salt-minion     OK - 2 processes                                       
OK - Agent version 1.2.0p3, execution time 0.7 sec|execution_time=0.663
Notes
2014-07-17 - Daniel Jagszent pointed out that users should be aware that salt-mined data is available to all minions. I don't see an issue with that in my archetecture but Caveat Emptor.
2014-08-04 - This is now my daily driver for gathering check_mk output from all minions.
Go Top