副问题[/!--empirenews.page--]
在诊断Kubernetes集群题目的时辰,我们常常留意到集群中某一节点在闪烁*,而这凡是是随机的且以稀疏的方法产生。这就是为什么我们一向必要一种器材,它可以测试一个节点与另一个节点之间的可达性,并以Prometheus怀抱情势泛起功效。有了这个器材,我们还但愿在Grafana中建设图表并快速定位产生妨碍的节点(并在须要时将该节点上全部Pod举办从头调治并举办须要的维护)。
“闪烁”这里我是指某个节点随机变为“NotReady”但之后又规复正常的某种举动。譬喻部门流量也许无法达到相邻节点上的Pod。
为什么会产生这种环境?常见缘故起因之一是数据中心互换机中的毗连题目。譬喻,我们曾经在Hetzner中配置一个vswitch,个中一个节点已无法通过该vswitch端口行使,而且刚亏适当地收集上完全不行会见。
我们的最后一个要求是可直接在Kubernetes中运行此处事,因此我们将可以或许通过Helm图表陈设全部内容。(譬喻在行使Ansible的环境下,我们必需为各类情形中的每个脚色界说脚色:AWS、GCE、裸机等)。因为我们尚未找到针对此情形的现成办理方案,因此我们抉择本身来实现。
剧本和设置
我们办理方案的首要组件是一个剧本,该剧本监督每个节点的.status.addresses值。假如某个节点的该值已变动(譬喻添加了新节点),则我们的剧本行使Helm value方法将节点列表以ConfigMap的情势转达给Helm图表:
- apiVersion: v1
- kind: ConfigMap
- metadata:
- name: ping-exporter-config
- namespace: d8-system
- data:
- nodes.json: >
- {{ .Values.pingExporter.targets | toJson }}
-
-
- .Values.pingExporter.targets相同以下:
-
- "cluster_targets":[{"ipAddress":"192.168.191.11","name":"kube-a-3"},{"ipAddress":"192.168.191.12","name":"kube-a-2"},{"ipAddress":"192.168.191.22","name":"kube-a-1"},{"ipAddress":"192.168.191.23","name":"kube-db-1"},{"ipAddress":"192.168.191.9","name":"kube-db-2"},{"ipAddress":"51.75.130.47","name":"kube-a-4"}],"external_targets":[{"host":"8.8.8.8","name":"google-dns"},{"host":"youtube.com"}]}
下面是Python剧本:
- #!/usr/bin/env python3
-
- import subprocess
- import prometheus_client
- import re
- import statistics
- import os
- import json
- import glob
- import better_exchook
- import datetime
-
- better_exchook.install()
-
- FPING_CMDLINE = "/usr/sbin/fping -p 1000 -C 30 -B 1 -q -r 1".split(" ")
- FPING_REGEX = re.compile(r"^(S*)s*: (.*)$", re.MULTILINE)
- CONFIG_PATH = "/config/targets.json"
-
- registry = prometheus_client.CollectorRegistry()
-
- prometheus_exceptions_counter =
- prometheus_client.Counter('kube_node_ping_exceptions', 'Total number of exceptions', [], registry=registry)
-
- prom_metrics_cluster = {"sent": prometheus_client.Counter('kube_node_ping_packets_sent_total',
- 'ICMP packets sent',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry),
- "received": prometheus_client.Counter('kube_node_ping_packets_received_total',
- 'ICMP packets received',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry),
- "rtt": prometheus_client.Counter('kube_node_ping_rtt_milliseconds_total',
- 'round-trip time',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry),
- "min": prometheus_client.Gauge('kube_node_ping_rtt_min', 'minimum round-trip time',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry),
- "max": prometheus_client.Gauge('kube_node_ping_rtt_max', 'maximum round-trip time',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry),
- "mdev": prometheus_client.Gauge('kube_node_ping_rtt_mdev',
- 'mean deviation of round-trip times',
- ['destination_node', 'destination_node_ip_address'],
- registry=registry)}
-
-
- prom_metrics_external = {"sent": prometheus_client.Counter('external_ping_packets_sent_total',
- 'ICMP packets sent',
- ['destination_name', 'destination_host'],
- registry=registry),
- "received": prometheus_client.Counter('external_ping_packets_received_total',
- 'ICMP packets received',
- ['destination_name', 'destination_host'],
- registry=registry),
- "rtt": prometheus_client.Counter('external_ping_rtt_milliseconds_total',
- 'round-trip time',
- ['destination_name', 'destination_host'],
- registry=registry),
- "min": prometheus_client.Gauge('external_ping_rtt_min', 'minimum round-trip time',
- ['destination_name', 'destination_host'],
- registry=registry),
- "max": prometheus_client.Gauge('external_ping_rtt_max', 'maximum round-trip time',
- ['destination_name', 'destination_host'],
- registry=registry),
- "mdev": prometheus_client.Gauge('external_ping_rtt_mdev',
- 'mean deviation of round-trip times',
- ['destination_name', 'destination_host'],
- registry=registry)}
-
- def validate_envs():
- envs = {"MY_NODE_NAME": os.getenv("MY_NODE_NAME"), "PROMETHEUS_TEXTFILE_DIR": os.getenv("PROMETHEUS_TEXTFILE_DIR"),
- "PROMETHEUS_TEXTFILE_PREFIX": os.getenv("PROMETHEUS_TEXTFILE_PREFIX")}
-
- for k, v in envs.items():
- if not v:
- raise ValueError("{} environment variable is empty".format(k))
-
- return envs
-
-
- @prometheus_exceptions_counter.count_exceptions()
- def compute_results(results):
- computed = {}
-
- matches = FPING_REGEX.finditer(results)
- for match in matches:
- host = match.group(1)
- ping_results = match.group(2)
- if "duplicate" in ping_results:
- continue
- splitted = ping_results.split(" ")
- if len(splitted) != 30:
- raise ValueError("ping returned wrong number of results: "{}"".format(splitted))
-
- positive_results = [float(x) for x in splitted if x != "-"]
- if len(positive_results) > 0:
- computed[host] = {"sent": 30, "received": len(positive_results),
- "rtt": sum(positive_results),
- "max": max(positive_results), "min": min(positive_results),
- "mdev": statistics.pstdev(positive_results)}
- else:
- computed[host] = {"sent": 30, "received": len(positive_results), "rtt": 0,
- "max": 0, "min": 0, "mdev": 0}
- if not len(computed):
- raise ValueError("regex match"{}" found nothing in fping output "{}"".format(FPING_REGEX, results))
- return computed
-
-
- @prometheus_exceptions_counter.count_exceptions()
- def call_fping(ips):
- cmdline = FPING_CMDLINE + ips
- process = subprocess.run(cmdline, stdout=subprocess.PIPE,
- stderr=subprocess.STDOUT, universal_newlines=True)
- if process.returncode == 3:
- raise ValueError("invalid arguments: {}".format(cmdline))
- if process.returncode == 4:
- raise OSError("fping reported syscall error: {}".format(process.stderr))
-
- return process.stdout
-
-
- envs = validate_envs()
-
- files = glob.glob(envs["PROMETHEUS_TEXTFILE_DIR"] + "*")
- for f in files:
- os.remove(f)
-
- labeled_prom_metrics = {"cluster_targets": [], "external_targets": []}
-
- while True:
- with open(CONFIG_PATH, "r") as f:
- config = json.loads(f.read())
- config["external_targets"] = [] if config["external_targets"] is None else config["external_targets"]
- for target in config["external_targets"]:
- target["name"] = target["host"] if "name" not in target.keys() else target["name"]
-
- if labeled_prom_metrics["cluster_targets"]:
- for metric in labeled_prom_metrics["cluster_targets"]:
- if (metric["node_name"], metric["ip"]) not in [(node["name"], node["ipAddress"]) for node in config['cluster_targets']]:
- for k, v in prom_metrics_cluster.items():
- v.remove(metric["node_name"], metric["ip"])
-
- if labeled_prom_metrics["external_targets"]:
- for metric in labeled_prom_metrics["external_targets"]:
- if (metric["target_name"], metric["host"]) not in [(target["name"], target["host"]) for target in config['external_targets']]:
- for k, v in prom_metrics_external.items():
- v.remove(metric["target_name"], metric["host"])
-
-
- labeled_prom_metrics = {"cluster_targets": [], "external_targets": []}
-
- for node in config["cluster_targets"]:
- metrics = {"node_name": node["name"], "ip": node["ipAddress"], "prom_metrics": {}}
-
- for k, v in prom_metrics_cluster.items():
- metrics["prom_metrics"][k] = v.labels(node["name"], node["ipAddress"])
-
- labeled_prom_metrics["cluster_targets"].append(metrics)
-
- for target in config["external_targets"]:
- metrics = {"target_name": target["name"], "host": target["host"], "prom_metrics": {}}
-
- for k, v in prom_metrics_external.items():
- metrics["prom_metrics"][k] = v.labels(target["name"], target["host"])
-
- labeled_prom_metrics["external_targets"].append(metrics)
-
- out = call_fping([prom_metric["ip"] for prom_metric in labeled_prom_metrics["cluster_targets"]] +
- [prom_metric["host"] for prom_metric in labeled_prom_metrics["external_targets"]])
- computed = compute_results(out)
-
- for dimension in labeled_prom_metrics["cluster_targets"]:
- result = computed[dimension["ip"]]
- dimension["prom_metrics"]["sent"].inc(computed[dimension["ip"]]["sent"])
- dimension["prom_metrics"]["received"].inc(computed[dimension["ip"]]["received"])
- dimension["prom_metrics"]["rtt"].inc(computed[dimension["ip"]]["rtt"])
- dimension["prom_metrics"]["min"].set(computed[dimension["ip"]]["min"])
- dimension["prom_metrics"]["max"].set(computed[dimension["ip"]]["max"])
- dimension["prom_metrics"]["mdev"].set(computed[dimension["ip"]]["mdev"])
-
- for dimension in labeled_prom_metrics["external_targets"]:
- result = computed[dimension["host"]]
- dimension["prom_metrics"]["sent"].inc(computed[dimension["host"]]["sent"])
- dimension["prom_metrics"]["received"].inc(computed[dimension["host"]]["received"])
- dimension["prom_metrics"]["rtt"].inc(computed[dimension["host"]]["rtt"])
- dimension["prom_metrics"]["min"].set(computed[dimension["host"]]["min"])
- dimension["prom_metrics"]["max"].set(computed[dimension["host"]]["max"])
- dimension["prom_metrics"]["mdev"].set(computed[dimension["host"]]["mdev"])
-
- prometheus_client.write_to_textfile(
-
envs["PROMETHEUS_TEXTFILE_DIR"] + envs["PROMETHEUS_TEXTFILE_PREFIX"] + envs["MY_NODE_NAME"] + ".prom", registry)
该剧本在每个Kubernetes节点上运行,而且每秒两次发送ICMP数据包到Kubernetes集群的全部实例。网络的功效会存储在文本文件中。
(编辑:湖南网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|