实战Nagios NSCA方式监控Linux系统资源使用情况 -- Nagios配置篇 -- 被监控端

浏览数：45 / 时间：2015年06月11日

Nagios要求被监控端按照约定格式定时将数据发送到Nagios端。监控包括节点和服务2种。

节点监控约定数据格式如下：
[<timestamp>] PROCESS_HOST_CHECK_RESULT;<host_name>;<host_status>;<plugin_output>

格式很容易理解，数据提交时间戳，被监控节点名称，节点状态(UP/DOWN/UNREARCHABLE)，插件自定义的额外数据。状态具体每个字段的解释如下：
1. timestamp is the time in time_t format (seconds since the UNIX epoch) that the host check was perfomed (or submitted). Please note the single space after the right bracket.
2. host_name is the short name of the host (as defined in the host definition)
3. host_status is the status of the host (0=UP, 1=DOWN, 2=UNREACHABLE)
4. plugin_output is the text output of the host check

服务监控约定数据格式如下：
[<timestamp>] PROCESS_SERVICE_CHECK_RESULT;<host_name>;<svc_description>;<return_code>;<plugin_output>

数据提交时间戳，被监控节点名称，被监控的服务名称，服务状态(OK/WARNING/CRITICAL/UNKNOWN)，插件自定义的额外数据。具体每个字段的解释如下：

1. timestamp is the time in time_t format (seconds since the UNIX epoch) that the service check was perfomed (or submitted). Please note the single space after the right bracket.
2. host_name is the short name of the host associated with the service in the service definition
3. svc_description is the description of the service as specified in the service definition
4. return_code is the return code of the check (0=OK, 1=WARNING, 2=CRITICAL, 3=UNKNOWN)
5. plugin_output is the text output of the service check (i.e. the plugin output)

插件自定义的额外数据要特别说明下，它的目的是告诉Nagios被监控节点的更细节信息，包括状态细节或者失败原因，以及性能数据。它的格式如下：

SERVICE STATUS: First line of output | First part of performance data

output可以自定义显示更加详细的监测数据，显示在Nagios的status Information栏。Performance data显示在Performance Data栏，它就有特殊的格式要求，具体结构如下：

'label'=value[UOM];[warn];[crit];[min];[max] 'label'=value[UOM];[warn];[crit];[min];[max]

每个‘label‘，value组合由空格分开，在我们系统中没有额外定义性能数据的UOM,warn,crit,min,max，具体含义可以参考：https://nagios-plugins.org/doc/guidelines.html#PLUGOUTPUT

我们系统中服务用的是被动，节点用的是主动ping，接下来说说我们系统中如何监控被监控节点的CPU，Memory，IO，Network使用情况，以CPU数据收集为主要解释对象：

CPU

先看看发送数据长什么样：
[1402018148] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;CPU_Status;0;STATISTICS OK : user=52.02% system=9.72% iowait=0.20% stealed=0.20% idle=35.02% | user=52.02%;;;; system=9.72%;;;; iowait=0.20%;;;; stealed=0.20%;;;; idle=35.02%;;;;
结合上面的数据格式，我们可以知道，这条数据是在1402018148（UNIX时间）被监控端发送的一条Service check数据，从节点192.168.0.6发送出来的，server description是CPU_Status，返回结果是0（OK），output详细说明数据统计成功，这次检测中用户进程占用了CPU的52.02%，内核进程占用了9.72%，idle了35.02%，没有达到WARNING的阀值（非IDLE的CPU占用率70%）。同时在符号|后面按照性能数据格式传输了性能数据，我们只关心了各种CPU使用情况，不关心统计的其他内容，所以都置位空。

接下来说说具体实现
1. 数据收集，系统从/proc/stat收集数据，这里不解释太多/proc/stat的细节，大概说下它的数据结构

<span style="font-size:10px;">cat /proc/stat
nagios:/usr/local/nagios/var/rw # cat /proc/stat
cpu  3793272 14468 2247200 1691101331 60064 0 929 363857 0
cpu0 947929 2161 571867 422672541 57686 0 569 142613 0
cpu1 1012031 5207 579405 422725361 828 0 121 72264 0
cpu2 953097 4324 557950 422803715 752 0 117 75309 0
cpu3 880213 2775 537976 422899713 797 0 120 73670 0
intr 395106792 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
......

<span style="font-size:12px;">参数 解释
user: 从系统启动开始累计到当前时刻，用户态的CPU时间（单位：jiffies） ，不包含 nice值为负进程。1jiffies=0.01秒
nice: 从系统启动开始累计到当前时刻，nice值为负的进程所占用的CPU时间（单位：jiffies）
system： 从系统启动开始累计到当前时刻，内核运行时间（单位：jiffies）
idle： 从系统启动开始累计到当前时刻，除硬盘IO等待时间以外其它等待时间（单位：jiffies）
iowait： 从系统启动开始累计到当前时刻，硬盘IO等待时间（单位：jiffies） ，
irq： 从系统启动开始累计到当前时刻，硬中断时间（单位：jiffies）
softirq： 从系统启动开始累计到当前时刻，软中断时间（单位：jiffies）
steal_time：  Stolen time, which is the time spent in other operating systems whenrunning in a virtualized environment
guest： Time spent running a virtual CPU for guestoperating systems under the control of the Linux kernel

我们的做法很简单，读取第一行CPU信息，统计5秒钟2次/proc/stat的差值，计算我们关心的user,system,idle,iowait,idle,steal_time所占总CPU时间的百分比

2. 按服务的四种状态阀值组装发送数据

cpu使用率大于90%，CRITICAL

cpu使用率大于70%小于90%，WARNING

cpu使用率小于70%，OK

cpu使用率获取失败，UNKNOWN

</pre><pre name="code" class="python"># Check if CPU Usage is Critical/Warning/OK
if cpu_usage_percent >= 90:
    return_code = pynsca.CRITICAL
    plugin_output = 'STATISTICS CRITICAL : '
elif  cpu_usage_percent >= 70:
    return_code = pynsca.WARNING
    plugin_output = 'STATISTICS WARNING : '
elif cpu_usage_percent >= 0: 
    return_code = pynsca.OK
    plugin_output = 'STATISTICS OK : '
else
    return_code = pynsca.UNKNOWN
    plugin_output = 'STATISTICS UNKNOWN: '

3. 发送数据到Nagios端

plugin_output += 'user=%(user).2f%% system=%(system).2f%% iowait=%(iowait).2f%% stealed=%(steal_time).2f%% idle=%(idle).2f%% | user=%(user).2f%%;;;; system=%(system).2f%%;;;; iowait=%(iowait).2f%%;;;; stealed=%(steal_time).2f%%;;;; idle=%(idle).2f%%;;;;' % cpu_stat_map

#print plugin_output
nscaClient = pynsca.NSCANotifier(nagios_address)
nscaClient.svc_result(cmd_options.local_address, service_description, return_code, plugin_output)

4. 添加crontab job，每分钟发送一次CPU数据到Nagios端。

*/1 * * * * /home/nagios/check_cpu_status.py >/dev/null 2>&1

Note: 真正的脚本还需要考虑可移植性，被监控机器节点名称，阀值等必须参数化，监控多样性，可以只监控其中一个核，加入>/dev/null 2>&1是为了去除crontab的邮件通知

Memory

[1402017111] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;Memory_Status;0;OK: Used memory percentage is 37.2979746597% (2935 MiB) | usedMemory=37.2979746597%;80;90;;

Memory每分钟查看一次/proc/meminfo，从total-free-buffers-cached得出正在被用的memory，看它占用总内存的百分比，超过80%报WARNING，90%报CRITICAL

nagios:/usr/local/nagios/var/rw # cat /proc/meminfo
MemTotal:        8401536 kB
MemFree:         6881104 kB
Buffers:          190732 kB
Cached:           497344 kB
<div><span style="font-family:Tahoma;font-size:12px;color:#000000;font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; display: inline !important; float: none;">*/1 * * * * /home/nagios/check_mem.py >/dev/null 2>&1</span><span style="font-size:12px;"> </span></div>

Disk

[1383817502] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;Disk_Status;0;STATISTICS OK : "MOUNT/"Usage=40% | "MOUNT/"Usage=40%;80;90;;

Disk每十分钟执行一次：“df -h”，查看磁盘Use%，超过80%报WARNING，90%报CRITICAL

IO & IOPS
[1383817420] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;IO_Status;0;STATISTICS OK : await=0.00% util=0.00% | await=0.00%;;;; util=0.00%;;;;

IO每分钟执行一次：“iostat -xkd”，获取磁盘rkB/s（读取数据速率）, wkB/s（写入数据速率），await（请求响应时间）和util（CPU用来处理IO的时间百分比）信息

[1399532518] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;IOPS_Status;2;STATISTICS CRITICAL : iops=58.0 | iops=58.0;;;;

IOPS每分钟执行一次：“iostat”，获取磁盘io tps信息

Note: 需要安装sysstat

iostat命令具体使用参考

Bandwidth

[1399532517] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;Bandwidth_Status_eth1;0;STATISTICS OK : - The Traffic In is 0.8Kbps, Out is 0.53Kbps, Total is 1.33Kbps.|In=0.8Kbps;;;0;0 Out=0.53Kbps;;;0;0 Total=1.33Kbps;;;0;0

Bandwidth和CPU类似，通过计算一段时间内/proc/net/dev的tx,rx以及total的数值差除以时间间隔得到吞吐量。

JVM Heap

[1399533015] PROCESS_SERVICE_CHECK_RESULT;192.168.0.6;Heap_Status;0;OK: Used heap percentage is 50.3057759255% (1054988 MiB) | usedHeap=50.3057759255%;90;98;;

我们系统用的是weblogic，可以通过weblogic.Admin获取JVM的信息（weblogic.Admin GET -pretty -type JRockitRuntime | egrep ‘FreeHeap|UsedHeap|TotalHeap‘），现在系统比较关注的信息就是heap使用情况。

看看最终Nagios中的数据样子（以某节点以及节点的CPU Service数据为例）：

题外话：另外很多客户都有自己的监控系统，他们的系统不止监控CPU，内存这些硬件资源的使用情况，还会扫描系统日志，看看是否有error/warning，扫描系统开启的端口和服务，是否符合安全规范，比如是否有可以匿名直接登陆的ftp，在整体系统设计的时候要千万小心操作系统应该打开多少服务，日志的打印也要精心设计。

第一篇： Nagios 使用介绍
 第二篇： Nagios配置篇 -- Nagios Server端

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

实战Nagios NSCA方式监控Linux系统资源使用情况 -- Nagios配置篇 -- 被监控端