由于我们常用的node_exporter并不能覆盖所有监控项,这里我们使用Process-exporter 对进程进行监控。process-export主要用来做进程监控,比如某个服务的进程数、消耗了多少CPU、内存等资源。
一、process-exporter使用
1.1 下载 process-exporter
process-exporter GibHUB地址
process-exporter可以使用命令行参数也可以指定配置文件启动。
1.2 配置 process-exporter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| vim /home/admin/process-exporter/process_name.yaml
process_names:
# - name: ""
# cmdline:
# - '.+'
- name: ""
cmdline:
- 'nginx'
- name: ""
cmdline:
- '/opt/atlassian/confluence/bin/tomcat-juli.jar'
- name: ""
cmdline:
- 'vsftpd'
- name: ""
cmdline:
- 'redis-server'
|
cmdline: 所选进程的唯一标识,ps -ef 可以查询到。如果改进程不存在,则不会有该进程的数据采集到。
redis 4287 4127 0 Oct31 ? 00:58:12 redis-server *:6379
| |
groupname=”redis-server” |
exe或者sh文件名称 |
| |
groupname=”redis-server *:6379” |
/ |
| |
groupname=”/usr/bin/redis-server *:6379” |
ps中的进程完成信息 |
| |
groupname=”redis” |
使用进程所属的用户进行分组 |
| |
groupname=”map[:redis]” |
表示配置到关键字“redis” |
1.3 编写启动脚本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| vim /usr/lib/systemd/system/process_exporter.service
[Unit]
Description=Prometheus exporter for processors metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/ncabatoff/process-exporter
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/home/admin/process-exporter
ExecStart=/home/admin/process-exporter/process-exporter -config.path=/home/admin/process-exporter/process_name.yaml
Restart=on-failure
[Install]
WantedBy=multi-user.target
|
1.4 启动 procexx-export
1
2
3
| systemctl daemon-reload
systemctl start process_exporter
systemctl enable process_exporter
|
验证监控数据
1
| curl http://localhost:9256/metrics
|
二、prometheus 配置
添加或修改配置
1
2
3
4
5
6
7
8
9
10
| - job_name: 'doog_dev_prometheus'
scrape_interval: 10s
honor_labels: true
metrics_path: '/metrics'
static_configs:
- targets: ['192.168.10.73:9090','192.168.10.73:9100']
labels: {cluster: 'dev',type: 'basic',env: 'dev',job: 'prometheus',export: 'prometheus'}
- targets: ['192.168.10.73:9256']
labels: {cluster: 'dev',type: 'process',env: 'dev',job: 'prometheus',export: 'process_exporter'}
|
重启prometheus服务
1
| curl -X POST http://127.0.0.1:9090/-/reload
|
三、grafana出图
process-exporter对应的dashboard为:https://grafana.com/grafana/dashboards/249
四、常用监控规则
进程数
1
2
3
4
5
6
7
| alert: 进程告警
expr: sum(namedprocess_namegroup_states) by (cluster,job,instance) > 500
for: 20s
labels:
severity: warning
annotations:
value: 服务器当前已产生 个进程,大于告警阈值
|
僵尸进程数
1
2
3
4
5
6
7
| alert: 进程告警
expr: sum by(cluster, job, instance, groupname) (namedprocess_namegroup_states{state="Zombie"}) > 0
for: 1m
labels:
severity: warning
annotations:
value: 当前产生 个僵尸进程
|
进程重启
1
2
3
4
5
6
7
8
| alert: 进程重启告警
expr: ceil(time() - max by(cluster, job, instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
for: 25s
labels:
label: alert_once
severity: warning
annotations:
value: 进程 在 秒前发生重启
|
进程退出
1
2
3
4
5
6
7
| alert: 进程退出告警
expr: up{export="process_exporter"} == 0 or max by(cluster, job, instance, groupname) (delta(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^map.*"}[10d])) < 0
for: 55s
labels:
severity: warning
annotations:
value: 进程 已退出
|