My journey into the saltstack's salt mine, python's file locking, and check_mk's alternative datasources.
Lesson 1: salt-call peer.publish doesn't scale
My first attempt was over a year ago in 2013. I used Salt's peer publishing
interface from the monitoring server to target minions. Each minion then ran
check_mk_agent and returned the results. This didn't scale well, fell on its
face, and resulted in nagios waking me up to ACKNOWLEGE false alarms. I
reverted and gave up.
Lesson 2: salt-call mine.get doesn't scale
To configure salt to mine data you need to list what to run in the minion_functions option. You can set this in /etc/salt/minion, grains, or pillars. I went with pillars - it seems like the best place to put it.
saltmine.sls to the list of pillars in /etc/pillar/top.sls.
base: '*': - saltmine
Now create /srv/pillar/saltmine.sls:
mine_functions: check_mk.agent:  network.interfaces:  status.uptime: 
status.uptime to confirm that
mine.update is actually running on
network.interfaces is also handy.
Previous attempts to finagle
salt.mine '*' cmd.run check_mk_agent exposed a
"feature". You can only have exactly one of each function signature in your
salt.mine configuration. Want to mine more than one
cmd.run you say? Start
coding. I opted to create my own salt module to namespace it:
$ cat /srv/salt/_modules/
#!/usr/bin/env python ''' Support for running check_mk_agent over salt ''' import os import salt.utils from salt.exceptions import SaltException def __virtual__(): ''' Only load the module if check_mk_agent is installed ''' if os.path.exists('/usr/bin/check_mk_agent'): return 'check_mk' return False def agent(): ''' Return the output of check_mk_agent ''' return __salt__['cmd.run']('/usr/bin/check_mk_agent')
Push both module and pillar data to all minions:
salt '*' saltutil.sync_all
Confirm mine_functions are recognized by all minions:
salt '*' config.get mine_functions
Minions update mine data at startup and every mine_interval minutes thereafter. Force 20 minions at a time to run mine.update:
salt '*' -b 20 mine.update
Confirm you can retrieve mined data for minion
salt mine.get zinc check_mk.agent
Finally on the monitoring server I configure /etc/check_mk/main.mk to mine data:
mineget_cmd = ( "sudo salt-call mine.get <HOST> check_mk.agent |" " sed -n 's/^ //p'" ) datasource_programs = [ ( mineget_cmd, ['salt'], ALL_HOSTS ), ]
It works! So I set the new datasource to a dozen hosts and... Boom! Just five simultaneous salt-calls from minion to master and it spikes both CPUs. So much so that checks are failing (timeouts) and munin runs are overlapping itself.
I up the VM's CPUs to 4, no dice, just locks up faster.
Abandon ship! I revert all HOSTS back to ssh and spend the next hour cleaning up nagios caches and notifications.
Lesson 3: redis scales, but doesn't work
--return redis. I install python-redis on all minions. I again restart
all salt-minions (a multistep process). It doesn't work. Maybe it is
redis_return? Some docs say so. No go.
I give up on redis for now. I still think this is the way to go. Someday I'll post again with redis success.
Lesson 4: Batch processing still rules
The next day I thought . o O ( Maybe I can just make one salt-call for all minions and parse a json dump? ) So I did it.
$ cat /etc/cron.d/get_mined_check_mk_agents
*/1 * * * * root salt-call mine.get '*' check_mk.agent --out=json > /tmp/cmk.json
$ cat /etc/check_mk/mine_agent.py
#!/usr/bin/env python import sys,json a=json.load(file("/tmp/cmk.json", "r")) print a["local"][sys.argv]
$ grep mine_agent /etc/check_mk/main.mk
('/etc/check_mk/mine_agent.py <HOST>', ['mine'], ALL_HOSTS),
This works! But it is complaining of stale data. I revert back to ssh and call it a night.
Lesson 5: Minion's mine_interval configuration
The default mine interval is 60 minutes, this at least made my previous observations make sense.. OK, I set it to 1 minute along with the other mine related pillar data.
Let me save you hours. Through a lot of trial and error, and IRC, I decipher
the mine_interval docs. Unlike
be set in the minion's runtime config. The configuration option is ignored in
grains or pillar data.
This causes no end of trouble where you can't set a config value on a running salt-minion. Doing so requires restarting the salt-minion -- and that isn't exactly a science yet. So how to get this config option into newly salt-cloud'ed servers?
Configure all minions to set mine_interval to one minute:
salt '*' file.touch /etc/salt/minion.d/mine.conf salt '*' file.append /etc/salt/minion.d/mine.conf "mine_interval: 1" salt -G os:debian cmd.run '/etc/init.d/salt-minion stop; /etc/init.d/salt-minion start;' salt -G os:ubuntu service.full_restart salt-minion
Confirm that the settings took:
salt '*' config.get mine_interval --out json --static
At this point I'm pretty close. There remains a race condition where
check_mk both want to overwrite cmk.json. I've decided to keep the cron
job running over the weekend while reading up on locking.
Lesson 6: file locking, revisited
I already knew how to solve read/write race conditions in shell scripts but now was the time to learn how to flock in python.
While I was at it I added do_update() to handle the
mine.get '*' call
and switched from
/tmp/cmk.json to the RAM filesystem
/dev/shm/cmk.json. I shaved 100ms per check by switching to /dev/shm. They
$ cat /etc/cron.d/cron_get_mined_agents
*/1 * * * * root /etc/check_mk/mine_agent.py --update
$ cat mine_agent.py
#!/usr/bin/env python import sys import json import fcntl DATAFILE="/dev/shm/cmk.json" NAG_UID = 105 NAG_GID = 107 def do_update(): import os import salt.client caller = salt.client.Caller() data = caller.function('mine.get', '*', 'check_mk.agent') lockfile = open(DATAFILE+".lock", "w") fcntl.flock(lockfile, fcntl.LOCK_EX) datafile = open(DATAFILE, "w") datafile.write(json.dumps(data)) for f in (DATAFILE, DATAFILE+".lock"): os.chmod(f, 0644) os.chown(f, NAG_UID, NAG_GID) def get_agent(minion): lockfile = open(DATAFILE+".lock", "w") fcntl.flock(lockfile, fcntl.LOCK_SH) data = json.load(file(DATAFILE)) return data[minion] if __name__ == '__main__': if len(sys.argv) != 2: print "Usage: mine_agent.py --update | <minion id>" elif sys.argv in ['--update', '-u']: do_update() else: minion = sys.argv print get_agent(minion)
Notice that the
["local"] top-level indice is missing when
called from within a python script. That was a surprise.
check_mk now returns 20 checks for a single host in under a second:
$ time check_mk -v zinc Check_mk version 1.1.12p7 Calling external programm /etc/check_mk/mine_agent.py zinc CPU load OK - 15min Load 0.05 at 8 CPUs CPU utilization OK - user: 0.3%, system: 0.0%, wait: 0.0% Disk IO SUMMARY OK - 0.00B/sec read, 2.36KB/sec write Kernel Context Switches OK - 176/s in last 60 secs Kernel Major Page Faults OK - 0/s in last 60 secs Kernel Process Creations OK - 1/s in last 60 secs Memory used OK - 0.22 GB used (0.21 GB RAM + 0.01 GB SWAP, this is 22.3% of 0.98 GB RAM) Mount options of / OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw Mount options of /home/scponly/sofia/html OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw Mount options of /home/scponly/sofia/logs OK - mount options are barrier=0,data=writeback,errors=remount-ro,relatime,rw Number of threads OK - 136 threads TCP Connections OK - ESTABLISHED: 2, TIME_WAIT: 3 Uptime OK - up since Fri Apr 18 23:40:10 2014 (107d 23:16:21) Vmalloc address space OK - total 120.0 MB, used 5.6 MB, largest chunk 114.3 MB fs_/ OK - 30.5% used (5.90 of 19.3 GB), (levels at 92.0/95.0%), trend: +431.14KB / 24 hours proc_MySQL OK - 1 processes proc_Nginx OK - 1 processes proc_beaver OK - 3 processes proc_openvpn OK - 1 processes proc_salt-minion OK - 2 processes OK - Agent version 1.2.0p3, execution time 0.7 sec|execution_time=0.663
2014-07-17 - Daniel Jagszent pointed out that users should be aware that salt-mined data is available to all minions. I don't see an issue with that in my archetecture but Caveat Emptor.
2014-08-04 - This is now my daily driver for gathering check_mk output from all minions.Go Top