Prometheus+Alertmanager 搭建告警系统
Prometheus Alertmanager DevOps About 7,061 wordsPrometheus 配置文件
prometheus.yml
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.0.100:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/etc/prometheus/alert_rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
告警规则
alert_rules.yml
:配置在prometheus.yml
的rule_files
节点下。
groups:
- name: example # 定义规则组
rules:
- alert: InstanceDown # 定义报警名称
expr: up == 0 #Promql语句,触发规则
for: 15s # 15 秒
labels: # 标签定义报警的级别和主机
name: instance
severity: Critical
annotations: #注解
summary: "Instance {{ $labels.instance }} down" # 报警摘要
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 15 seconds." # 报警信息
value: "{{ $value }}%" # 当前报警状态值
- name: Host
rules:
- alert: HostMemory Usage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
name: Memory
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: "宿主机内存使用率超过80%."
value: "{{ $value }}"
- alert: HostCPU Usage
expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance,job) > 0.65
for: 1m
labels:
name: CPU
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: "宿主机CPU使用率超过65%."
value: "{{ $value }}"
- alert: HostLoad
expr: node_load5 > 4
for: 1m
labels:
name: Load
severity: Warning
annotations:
summary: "{{ $labels.job }} "
description: " 主机负载5分钟超过4."
value: "{{ $value }}"
- alert: HostFilesystem Usage
expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) > 0.8
for: 1m
labels:
name: Disk
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: " 宿主机 [ {{ $labels.mountpoint }} ]分区使用超过80%."
value: "{{ $value }}%"
- alert: HostDiskio
expr: irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10
for: 1m
labels:
name: Diskio
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: " 宿主机 [{{ $labels.device }}]磁盘1分钟平均写入IO负载较高."
value: "{{ $value }}iops"
- alert: Network_receive
expr: irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576 > 3
for: 1m
labels:
name: Network_receive
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: " 宿主机 [{{ $labels.device }}] 网卡5分钟平均接收流量超过3Mbps."
value: "{{ $value }}3Mbps"
- alert: Network_transmit
expr: irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576 > 3
for: 1m
labels:
name: Network_transmit
severity: Warning
annotations:
summary: " {{ $labels.job }} "
description: " 宿主机 [{{ $labels.device }}] 网卡5分钟内平均发送流量超过3Mbps."
value: "{{ $value }}3Mbps"
- name: Container
rules:
- alert: ContainerCPU Usage
expr: (sum by(name,instance) (rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
for: 1m
labels:
name: CPU
severity: Warning
annotations:
summary: "{{ $labels.name }} "
description: " 容器CPU使用超过60%."
value: "{{ $value }}%"
- alert: ContainerMem Usage
# expr: (container_memory_usage_bytes - container_memory_cache) / container_spec_memory_limit_bytes * 100 > 10
expr: container_memory_usage_bytes{name=~".+"} / 1048576 > 1024
for: 1m
labels:
name: Memory
severity: Warning
annotations:
summary: "{{ $labels.name }} "
description: " 容器内存使用超过1GB."
value: "{{ $value }}G"
Alertmanager 配置文件
alertmanager.yml
:Alertmanager
启动时指定的配置文件。
global:
resolve_timeout: 5m
smtp_from: 'xxxxxxxx@qq.com' # 发件人
smtp_smarthost: 'smtp.qq.com:465' # 邮箱服务器的 POP3/SMTP 主机配置 smtp.qq.com 端口为 465 或 587
smtp_auth_username: 'xxxxxxxx@qq.com' # 用户名
smtp_auth_password: 'xxxxxxxxxxxxxxx' # 授权码 不是 QQ 密码
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/etc/alertmanager/template/alert.tmpl'
route:
group_by: ['alertname'] # 告警分组
group_wait: 5s # 在组内等待所配置的时间,如果同组内,5 秒内出现相同报警,在一个组内出现。
group_interval: 5m # 如果组内内容不变化,合并为一条警报信息,5 分钟后发送。
repeat_interval: 5m # 发送告警间隔时间 s/m/h,如果指定时间内没有修复,则重新发送告警
receiver: 'wechat' # 优先使用 wechat 发送
routes: #子路由,使用 email 发送
- receiver: email
match_re:
serverity: email
receivers:
- name: 'email'
email_configs:
- to: 'xxxxxxxx@163.com' # 如果想发送多个人就以 ',' 做分割
send_resolved: true
- name: 'wechat'
wechat_configs:
- corp_id: 'xxxxxxxxxxxxx' #企业 ID
api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 企业微信 api 接口 统一定义
to_party: '2' # 通知组 ID
agent_id: '1000002' # 新建应用的 agent_id
api_secret: 'xxxxxxxxxxxxxx' # 生成的 secret
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
说明:smtp_require_tls
是否使用tls
,根据环境不同,来选择开启和关闭。如果提示报错email.loginAuth failed: 530 Must issue a STARTTLS command first
,那么就需要设置为true
。着重说明一下,如果开启了tls
,提示报错starttls failed: x509: certificate signed by unknown authority
,需要在email_configs
下配置insecure_skip_verify: true
来跳过tls
验证。
验证告警路由
https://www.prometheus.io/webtools/alerting/routing-tree-editor
告警模板
alert.tmpl
:配置在alertmanager.yml
中的templates
下。
自定义模板
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
========监控报警==========
告警状态:{{ .Status }}
告警级别:{{ $alert.Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
告警应用:{{ $alert.Annotations.summary }}
告警主机:{{ $alert.Labels.instance }}
告警详情:{{ $alert.Annotations.description }}
触发阀值:{{ $alert.Annotations.value }}
告警时间:{{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
========end=============
{{ end }}
{{ end }}
默认模板
https://github.com/prometheus/alertmanager/blob/main/template/default.tmpl
说明
Prometheus Alert
告警状态有三种状态:Inactive
、Pending
、Firing
。
Inactive
:非活动状态,表示正在监控,但是还未有任何告警触发。Pending
:表示这个告警必须被触发。由于告警可以被分组、压抑/抑制或静默/静音,所以等待验证,一旦所有的验证都通过,则将转到Firing
状态。Firing
:将告警发送到Alertmanager
,它将按照配置将告警的发送给所有接收者。一旦告警解除,则将状态转到Inactive
,如此循环。
相关地址
Prometheus Alerts - http://localhost:9090/alerts
Prometheus Rules - http://localhost:9090/rules
Alertmanager - http://localhost:9093
官方文档
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules
————        END        ————
Give me a Star, Thanks:)
https://github.com/fendoudebb/LiteNote扫描下方二维码关注公众号和小程序↓↓↓