监控部署流程:
- 基于前一篇,当部署好各应用服务后,如何进行持续的监控与后续的参数优化(目前基于docker swarm部署,推荐使用的是cAdvisor+prometheus/victoriametrics(推荐)+Grafana)
方案一:直接以global方式进行cadvisor部署,但无法及时获取容器的OOM事件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| services: cadvisor: image: gcr.io/cadvisor/cadvisor:v0.49.1 volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker:/var/lib/docker:ro - /dev/disk:/dev/disk:ro ports: - "8080:8080" deploy: mode: global restart_policy: condition: on-failure resources: limits: memory: 1G
|
方案二:在外面套一层docker,需要宿主机开启允许访问docker.sock文件(容器通过这个sock自动创建cadvisor容器),与此同时,基于prom的自动发现,所以需要提前创建网络,同时,prom/vmagent是通过file_sd_configs实现自动发现,所以需要在新启动节点的时候以config的方式借助脚本自动挂载进service中
步骤一:先创建好prom/vmagent-stack
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| services: vmagent: image: dockerproxy.cn/victoriametrics/vmagent:v1.96.0 environment: - TZ=Asia/Shanghai configs: - source: vmagent-config target: /etc/prometheus/prometheus.yml volumes: - ./data/:/vmagentdata/ command: - '-promscrape.config=/etc/prometheus/prometheus.yml' - '-remoteWrite.url=https://远程存储地址' - '-remoteWrite.urlRelabelConfig=/etc/prometheus/relabel.yml' - '-remoteWrite.forceVMProto' - '-remoteWrite.tmpDataPath=/vmagentdata' - '-remoteWrite.maxDiskUsagePerURL=100GB' - '-promscrape.maxScrapeSize=2000000000' - '-promscrape.streamParse' - "-promscrape.configCheckInterval=5m" networks: - monitoring deploy: mode: global restart_policy: condition: on-failure labels: "monitoring_tag": "swarm_monitor" placement: constraints: - node.role == manager resources: limits: memory: 1G
networks: monitoring: driver: overlay external: true attachable: true
scrape_configs: - job_name: "cadvisor" scrape_interval: 15s file_sd_configs: - files: - '/data/sd_config/*_targets.yml' metric_relabel_configs: - source_labels: [__address__] regex: '(.*)_(.*)_(.*):(.*)' replacement: ${3} target_label: 'hostName'
|
步骤二:创建cadvisor(main/auxiliary)-service,并配置好相应的auto-discovery脚本的config
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
| services: auxiliary: image: dockerproxy.cn/docker:latest volumes: - /var/run/docker.sock:/var/run/docker.sock:ro entrypoint: ["/bin/sh","-c"] networks: - monitoring deploy: mode: global resources: limits: memory: 256m configs: - source: auto-discovery target: /opt/auto-discovery.sh environment: - CHILDNAME=cadvisor_main_{{.Node.Hostname}} command: - | sed -i "s/dl-cdn.alpinelinux.org/mirrors.ustc.edu.cn/g" /etc/apk/repositories && \ apk add --no-cache jq curl && \ echo "*/30 * * * * /bin/sh /opt/auto-discovery.sh $${CHILDNAME} 'pjcx' 'dev-bot-pjcx' 'workspace1' > /tmp/auto-discovery.log 2>&1 " >> /var/spool/cron/crontabs/root && crond && \ tail -f /dev/null main: image: dockerproxy.cn/docker:latest volumes: - /var/run/docker.sock:/var/run/docker.sock:ro entrypoint: ["/bin/sh","-c"] networks: - monitoring deploy: mode: global resources: limits: memory: 256m environment: - CHILDNAME={{.Service.Name}}_{{.Node.Hostname}} command: - | exec docker run -i --rm \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:ro \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --volume=/dev/disk/:/dev/disk:ro \ --name=$${CHILDNAME} \ --privileged \ --device=/dev/kmsg \ --network monitoring \ -m 1g \ gcr.io/cadvisor/cadvisor:v0.49.1 --docker_only=true
networks: monitoring: driver: overlay external: true attachable: true
configs: auto-discovery: external: true
|
步骤三(可选):如果对于目前的swarm已经在运行,且不方便重启docker增加远程api的访问,则可以通过外置nginx+docker.sock的方式来访问manager的api
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
|
user nginx; worker_processes 2; error_log /var/log/nginx/error.log notice; pid /var/run/nginx.pid; events { worker_connections 1024; use epoll ; } http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; tcp_nopush on; tcp_nodelay on; keepalive_timeout 65; proxy_ignore_client_abort on; proxy_connect_timeout 600; proxy_send_timeout 600; proxy_read_timeout 600; proxy_buffer_size 64k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; types_hash_max_size 2048; types_hash_bucket_size 128; server_names_hash_bucket_size 128; server_names_hash_max_size 1024; client_max_body_size 300m; client_body_buffer_size 128k; server { listen 2375; server_name _; location / { proxy_pass http://unix:/var/run/docker.sock; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } }
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| networks: monitoring: driver: overlay external: true attachable: true
configs: nginx_moniting: external: true services: nginx: image: dockerproxy.cn/nginx:alpine3.20 volumes: - /var/run/docker.sock:/var/run/docker.sock:ro configs: - source: nginx_moniting target: /etc/nginx/nginx.conf ports: - 2375:2375 networks: - monitoring deploy: mode: global restart_policy: condition: on-failure placement: constraints: - node.role == manager resources: limits: memory: 512m
|
步骤四:创建以auto-discovery脚本为内容的config
以下部分要么开启docker远程api,要么通过上面的nginx进行转发访问manager的docker.sock
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
| #!/bin/sh
dirPath=$(cd "$(dirname $0)";pwd) cd ${dirPath}
programName="$1" projectName="$2" envName="$3" workspaceName="$4"
programNameConf="${programName}_targets"
flag=0
IPs=$(curl -s --unix-socket /var/run/docker.sock http://localhost/networks/monitoring | jq '.Peers' | jq '.[].IP')
for masterIp in ${IPs};do masterIp=$(echo "${masterIp}" | tr -d '"')
curl --request GET -s \ -H "Accept: application/json" \ --url "http://${masterIp}:2375/nodes" if [ $? -eq 1 ];then echo "此节点非master节点或者此节点的api不可用" continue fi
totalConfigs=$(curl --request GET -s \ -H "Accept: application/json" \ --url "http://${masterIp}:2375/configs") destConfig=$(echo "${totalConfigs}"| jq 'map(select(.Spec.Name == "'"${programNameConf}"'")) | if length > 0 then . else null end') if [ "${destConfig}" = "null" ]; then echo "目前该config还不存在,需要手动创建" FILE="${programNameConf}.yml" echo "- targets:" > "${FILE}" echo " - '${programName}:8080'" >> "${FILE}" echo " labels:" >> "${FILE}" echo " app_projects: \"${projectName}\"" >> "${FILE}" echo " app_env: \"${envName}\"" >> "${FILE}" echo " app_scope: \"docker\"" >> "${FILE}" echo " app_host: \"docker\"" >> "${FILE}" echo -n " app_workspace: \"${workspaceName}\"" >> "${FILE}" data=$(cat "${FILE}" | base64) json_payload=$(jq -n \ --arg data "$data" \ --arg name "$programNameConf" \ '{Data: $data, Name: $name, Labels: {}}') curl -X POST \ -H "Content-Type: application/json" \ -d "${json_payload}" \ --url "${masterIp}:2375/v1.41/configs/create"
curl --request GET -s -H "Accept: application/json" --url "http://${masterIp}:2375/configs" | jq '.[] | .Spec.Name' | grep "${programNameConf}" if [ $? -ne 0 ]; then echo "无法创建对应的config,请查询具体原因!!!" exit 1 fi
totalConfigs=$(curl --request GET -s \ -H "Accept: application/json" \ --url "http://${masterIp}:2375/configs") configID=$(echo "${totalConfigs}" | jq 'map(select(.Spec.Name == "'"${programNameConf}"'"))' | jq '.[0] | .ID') else configID=$(echo "${totalConfigs}" | jq 'map(select(.Spec.Name == "'"${programNameConf}"'"))' | jq '.[0] | .ID') fi serviceMsg=$(wget --no-check-certificate --quiet -O - \ --header="Accept: application/json" \ "http://${masterIp}:2375/services?filters={\"label\":[\"monitoring_tag=swarm_monitor\"]}" | jq '.[0]') if [[ "${serviceMsg}" == "" ]];then echo "目前vmagent或prom还不存在,需要先创建" exit 1 fi serviceName=$(echo "${serviceMsg}" | jq '.Spec.Name' | tr -d '"') serviceVersion=$(echo "${serviceMsg}" | jq '.Version.Index' | tr -d '"') serviceConfigs=$(echo "${serviceMsg}" | jq '.Spec.TaskTemplate.ContainerSpec.Configs') if [ "${serviceConfigs}" = "null" ]; then serviceConfigs="[]" fi checkConfig=$(echo "${serviceConfigs}"| jq 'map(select(.ConfigName == "'"${programNameConf}"'")) | if length > 0 then . else null end') if [ "${checkConfig}" != "null" ];then flag=1 break fi new_config=$(cat <<EOF { "File": { "Name": "/data/sd_config/${programNameConf}.yml", "UID": "0", "GID": "0", "Mode": 292 }, "ConfigID": ${configID}, "ConfigName": "${programNameConf}" } EOF ) update_configs=$(echo "${serviceConfigs}" | jq ". + [$new_config]") update_json=$(echo "${serviceMsg}" | jq '.Spec.TaskTemplate.ContainerSpec.Configs = '"${update_configs}"'' | jq '.Spec.TaskTemplate.ForceUpdate = 1') update_json=$(echo "${update_json}" | jq '.Spec') curl -X POST "http://${masterIp}:2375/services/$serviceName/update?version=$serviceVersion" \ -H "Content-Type: application/json" \ -d "${update_json}"
if [ $? -eq 0 ];then echo "add new config success!!!" flag=1 break fi done
if [ $? -eq 0 ] && [ $flag -eq 1 ];then echo "以为prom/vmagent增加或本来就已经存在了对应节点的config" else echo "添加或查询失败,请查询具体原因" fi
|
步骤五:给grafana配置合适的Dashboard以及配置相应的告警规则
综合了目前众多Dashboard的参考而成以下完整版(详情请看下一篇文章)