一、SkyWalking 告警
SkyWalking 告警功能是在6.x版本新增的,其核心由一组规则驱动,这些规则定义在config/alarm-settings.yml文件中。
1.告警规则的定义分为两部分:
- 告警规则:它们定义了应该如何触发度量警报,应该考虑什么条件。
- Webhook(网络钩子):定义当警告触发时,哪些服务终端需要被告知。
1.1 告警规则
SkyWalking 的发行版都会, 默认提供config/alarm-settings.yml文件
,里面预先定义了一些常用的告警规则。
yaml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
# 1.服务响应时间-规则
service_resp_time_rule:
# 度量名称,取值为oal脚本中的度量名,目前只支持long、double和int类型。详见OfficialOALscript
metrics-name: service_resp_time
# 操作符,目前支持>、<、=
op: ">"
# 阈值
threshold: 1000
# 多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配
period: 10
# 在一个Period窗口中,如果values超过Threshold值(按op),达到Count值,需要发送警报
count: 3
# 在时间N中触发报警后,在TN->TN+period这个阶段不告警。默认情况下,它和Period一样,这意味着相同的告警(在同一个Metricsname拥有相同的Id)在同一个Period内只会触发一次
silence-period: 5
# 告警消息: 服务{name} 过去3分钟内服务平均响应时间超过1秒。
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
# 2.服务-sla-规则
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
# 阈值
threshold: 8000
# 评估指标的时间长度
period: 10
# 在指标匹配条件后,会触发多少次警报
count: 2
silence-period: 3
# 告警消息: 过去2分钟服务成功率低于80%。
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
# 3.服务-响应时间百分比规则
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
# 告警消息: 过去3分钟内服务响应时间超过1s的百分比
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
# 4.服务实例-响应时间规则
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
# 告警消息: 服务实例-在过去2分钟内平均响应时间超过1s,并且实例名称与正则表达式匹配。
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
# 5.数据库访问-响应时间规则
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
# 告警消息: 过去2分钟内数据库访问平均响应时间超过1秒。
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
# 6.端点(接口)关联关系-响应时间规则
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
# 告警消息: 过去2分钟内端点平均响应时间超过1秒。
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_resp_time_rule:
# metrics-name: endpoint_resp_time
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
规则配置项的说明:
*_rule
:规则名称,也是在告警信息中显示的唯一名称。必须以_rule结尾
,前缀可自定义metrics-name
:度量名称,取值为oal脚本中的度量名,目前只支持long、double和int类型。详见OfficialOALscriptinclude-names
:该规则作用于哪些实体名称,比如: 服务名,终端名(可选,默认为全部)exclude-names
:该规则不用于哪些实体名称,比如: 服务名,终端名(可选,默认为空)threshold
:阈值oP
:操作符,目前支持>、<、=period
:多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配count
:在一个Period窗口中,如果values超过Threshold值(按op),达到count值,需要发送警报silence-period
:在时间N中触发报警后,在TN->TN+period这个阶段不告警。默认情况下,它和Period一样,这意味着相同的告警(在同一个Metricsname拥有相同的Id)在同一个Period内只会触发一次message
:告警消息
1.2 Webhook
Webhook
可以简单理解为是一种Web层面的回调机制,通常由一些事件触发,与代码中的事件回调类似,只不过是Web层面的。由于是 Web层面的,所以当事件发生时,回调的不再是代码中的方法或函数,而是服务接口。
例如: 在告警这个场景,告警就是一个事件。当该事件发生时,SkyWalking就会主动去调用一个配置好的接口,该接口就是所谓的Webhook。
SkyWalking的告警消息会通过 HTTP 请求进行发送,请求方法为 POST,Content-Type 为 application/json,其JSON 数据实基于 List<org.apache.skywalking.oap.server.core.alarm.AlarmMessage进行序列化的。
JSON数据示例:
json
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "ep-gateway",
"id0": "ZXAtZ2F0ZXdheQ==.1",
"id1": "",
"ruleName": "service_sla_rule",
"alarmMessage": "Successful rate of service ep-gateway is lower than 80% in 2 minutes of last 10 minutes",
"startTime": 1560524171000,
"tags": []
}]
字段说明:
scopeId、scope
:所有可用的 Scope 详见org.apache.skywalking.oap.server.core.source.DefaultScopeDefine
。name
:目标 Scope 的实体名称.id0
:Scope 实体的 ID.id1
:保留字段,目前暂未使用.ruleName
:告警规则名称.alarmMessage
:告警消息内容.startTime
:告警时间,格式为时间戳.
2.通过 Webhook 实现告警消息,推送到企业微信消息。
根据以上两个小节的介绍,可以得知:SkyWalking是不支持直接向邮箱、短信等服务发送告警信息的。
- SkyWalking只会在发生告警时,将告警信息发送至配置好的Webhook接口。
- 但我们总不能人工盯着该接口的日志信息来得知服务是否发生了告警,因此我们需要在该接口里实现发送邮件或短信等功能,从而达到个性化的告警通知。
2.1 根据 SkyWalking 发送的JSON数据,新建一个DTO AlarmMessage.java
,用于接口接收数据:
java
package org.jeecg.dto;
import lombok.Data;
import lombok.Getter;
import lombok.Setter;
import java.util.List;
/** * Alarm message represents the details of each alarm. */
@Setter
@Getter
public class AlarmMessage {
/** 服务实例 */
private int scopeId;
private String scope;
/** 目标 Scope 的实体名称 */
private String name;
/** Scope 实体的ID */
private String id0;
/** 保留字段,目前暂未使用 */
private String id1;
/** 告警规则名称 */
private String ruleName;
/** 告警消息内容 */
private String alarmMessage;
/** 标签列表 */
private List<Tag> tags;
/** 告警时间,格式为时间戳 */
private long startTime;
private transient int period;
private transient boolean onlyAsCondition;
/** 标签 */
@Data
public static class Tag {
private String key;
private String value;
}
}
2.2 新建一个webhook 回调接口,实现接收SkyWalking的告警通知,并将数据发送至企业微信:
- 新建一个
SwAlarmController.java
类, 用于接收SkyWalking的告警通知。
java
package org.jeecg.controller;
import com.alibaba.fastjson.JSON;
import lombok.extern.slf4j.Slf4j;
import org.jeecg.common.util.SpringContextHolder;
import org.jeecg.dto.AlarmMessage;
import org.jeecg.utils.DateUtil;
import org.jeecg.utils.RobotWebhookUtil;
import org.springframework.cloud.context.config.annotation.RefreshScope;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.Date;
import java.util.List;
/**
* SkyWalking 告警-控制器
*
* @author Calvin
* @date 2022/10/9
*/
@Slf4j
@RefreshScope
@RestController
@RequestMapping("/sw/alarm")
public class SwAlarmController {
/**
* 告警通知
*
* @param alarmMessages 告警消息
*/
@PostMapping(value = "/notify")
public void notify(@RequestBody List<AlarmMessage> alarmMessages) {
String content = getContent(alarmMessages);
// 发送到企业微信
SpringContextHolder.getBean(RobotWebhookUtil.class).requestRobot(content);
log.info("告警邮件已发送: {}", content);
}
/**
* 获取内容
*
* @param alarmMessages 告警消息
* @return {@link List}
*/
private String getContent(List<AlarmMessage> alarmMessages) {
StringBuilder sb = new StringBuilder();
for (AlarmMessage alarm : alarmMessages) {
sb.append("> scopeId:[").append("<font color=\"warning\">").append(alarm.getScopeId()).append("</font>" + "]\n")
.append("> scope:[").append("<font color=\"warning\">").append(alarm.getScope()).append("</font>" + "]\n")
.append("> 目标 Scope 的实体名称:[").append("<font color=\"warning\">").append(alarm.getName()).append("</font>" + "]\n")
.append("> Scope 实体的 ID:[").append("<font color=\"warning\">").append(alarm.getId0()).append("</font>" + "]\n")
.append("> id1:[").append("<font color=\"warning\">").append(alarm.getId1()).append("</font>" + "]\n")
.append("> 告警规则名称:[").append("<font color=\"warning\">").append(alarm.getRuleName()).append("</font>" + "]\n")
.append("> 告警消息内容:[").append("<font color=\"warning\">").append(alarm.getAlarmMessage()).append("</font>" + "]\n")
.append("> 告警时间:[").append("<font color=\"warning\">").append(DateUtil.getDateTimeFormat(new Date(alarm.getStartTime()))).append("</font>" + "]\n")
.append("> 标签:[ ").append("<font color=\"warning\">").append(JSON.toJSONString(alarm.getTags())).append("</font>" + "]")
.append("\n\n‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐\n\n");
}
return sb.toString();
}
}
- 新建一个
RobotWebhookUtil.java
类, 发送到企业微信。
java
package org.jeecg.utils;
import lombok.extern.slf4j.Slf4j;
import org.json.JSONObject;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.cloud.context.config.annotation.RefreshScope;
import org.springframework.core.env.Environment;
import org.springframework.http.MediaType;
import org.springframework.stereotype.Component;
import javax.annotation.PostConstruct;
import javax.servlet.http.HttpServletRequest;
@RefreshScope
@Component
@Slf4j
public class RobotWebhookUtil {
// 企业微信机器人地址
@Value("${robot.webhook.url:https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxx}")
private String robotWebhookUrl;
@Value("${robot.open:true}")
private Boolean openRobot;
@Value("${spring.application.name:ep-system}")
private String applicationName;
@Autowired
private Environment env;
@Value("${robot.environment:online}")
private String environmentValue;
@Value("${spring.system.service.name:鹏贸通}")
private String systemServiceName;
@PostConstruct
public void config() {
openRobot = Boolean.valueOf(env.getProperty("robot.open"));
robotWebhookUrl = env.getProperty("robot.webhook.url");
applicationName = env.getProperty("spring.application.name");
environmentValue = env.getProperty("robot.environment");
}
/**
* 请求企业微信人通知
*/
public void requestRobot(String swAlarmMsgContent){
// 是否开启通知企业微信机器人
if (openRobot) {
String messageText =
new StringBuilder()
.append("<font color=\"warning\">" + environmentValue + "</font>" + systemServiceName +"异常,请相关同事注意。\n")
.append("> appName:[" + "<font color=\"warning\">" + applicationName + "</font>" + "]\n")
.toString();
messageText += swAlarmMsgContent;
JSONObject jsonObject = new JSONObject();
jsonObject.put("msgtype", "markdown");
JSONObject jsonObjectText = new JSONObject();
jsonObjectText.put("content", messageText);
jsonObject.put("markdown", jsonObjectText);
log.error("请求报文:{}", jsonObject.toString());
try {
HttpClientUtil.httpPost(MediaType.APPLICATION_JSON, robotWebhookUrl, jsonObject.toString());
} catch (Exception e1) {
log.error("请求企业微信机器人异常:{}", e1.getMessage());
}
}
}
/**
* 获取请求路径
*
* @return
*/
private String getRequestUri() {
try {
HttpServletRequest request = SpringContextUtils.getHttpServletRequest();
return request.getRequestURI();
} catch (Exception e) {
return "";
}
}
}
- 在
config/alarm-settings.yml
文件中的配置选项Webhook中,添加接收告警消息回调的接口URL。
yaml
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: 服务【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
tags:
level: WARNING
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: 服务【{name}】的成功率在最近10分钟内有2分钟低于80%
tags:
level: WARNING
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 5
message: 服务【{name}】的响应时间在最近10分钟内有3分钟内百分比产生告警,由于p50 > 1000、p75 > 1000、p90 > 1000、p95 > 1000、p99 > 1000等多个条件。
tags:
level: WARNING
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 5
message: 实例【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
tags:
level: WARNING
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 数据库【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
tags:
level: WARNING
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
message: 端点关系【{name}】的平均响应时间在最近10分钟内有2分钟超过1秒
tags:
level: WARNING
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#
# endpoint_resp_time_rule:
# metrics-name: endpoint_resp_time
# op: ">"
# threshold: 1000
# period: 10
# count: 2
# silence-period: 5
# message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
- http://127.0.0.1:7001/sw/alarm/notify
3.重启 SkyWalking OAP, 重新加载配置。
shell
$ sh /skywalking/apache-skywalking-apm-bin/bin/startup.sh
4. 测试功能,发送告警消息。
- 企业微信接收到消息
- SkyWalking 控制台
告警菜单栏
中可以看到消息内容