HelmChart 一键部署

概述

Helm 是 Kubernetes 的包管理工具,Chart 是 Helm 的打包格式。通过 HelmChart,可以一键部署完整的监控体系(Prometheus + Grafana + Alertmanager + Exporters)。

环境准备

# 安装 Helm 3

curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

验证安装

helm version

添加常用仓库

helm repo add bitnami https://charts.bitnami.com/bitnami

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo add grafana https://grafana.github.io/helm-charts

helm repo update

kube-prometheus-stack(推荐)

kube-prometheus-stack 是生产级监控方案,一键部署 Prometheus + Grafana + Alertmanager + Exporter。

# 添加仓库

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

安装

helm install prometheus prometheus-community/kube-prometheus-stack \

--namespace monitoring \

--create-namespace \

--set prometheus.prometheusSpec.retention=15d \

--set prometheus.prometheusSpec.replicas=2 \

--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \

--set alertmanager.alertmanagerSpec.replicas=2 \

--set grafana.adminPassword="YourStrongPassword" \

--set grafana.persistence.enabled=true \

--set grafana.persistence.size=10Gi

验证

kubectl get pods -n monitoring

values.yaml 核心配置

# 自定义 values-prometheus.yaml

prometheus:

prometheusSpec:

# 数据保留时间

retention: 30d

retentionSize: "20Gi"

# 高可用:2 副本

replicas: 2

# 资源限制

resources:

requests:

cpu: 500m

memory: 2Gi

limits:

cpu: 2

memory: 4Gi

# 持久化存储

storageSpec:

volumeClaimTemplate:

spec:

storageClassName: gp3

resources:

requests:

storage: 100Gi

# 告警规则启用

ruleSelector: true

ruleSelectorMixins: true

# 抓取配置

serviceMonitorSelector: {}

serviceMonitorNamespaceSelector: {}

# 额外 scrape 配置

extraScrapeConfigs: |

  • job_name: 'blackbox-http'

metrics_path: /probe

params:

module: [http_2xx]

static_configs:

  • targets:
  • https://example.com

relabel_configs:

  • source_labels: [__address__]

target_label: __param_target

  • target_label: __address__

replacement: blackbox-exporter:9115

alertmanager:

alertmanagerSpec:

replicas: 2

# 持久化

storage:

volumeClaimTemplate:

spec:

storageClassName: gp3

resources:

requests:

storage: 5Gi

# 通知配置

config:

global:

resolve_timeout: 5m

route:

group_by: ['job', 'severity']

group_wait: 10s

group_interval: 10s

receiver: 'dingtalk'

routes:

  • match:

severity: critical

receiver: 'dingtalk'

continue: true

receivers:

  • name: 'dingtalk'

webhook_configs:

  • url: 'http://dingtalk-hook:5000/dingtalk'

send_resolved: true

grafana:

adminPassword: "ChangeMe123!"

# 持久化

persistence:

enabled: true

size: 10Gi

storageClassName: gp3

# 额外数据源

datasources:

datasources.yaml:

apiVersion: 1

datasources:

  • name: Prometheus

type: prometheus

access: proxy

url: http://prometheus-operated:9090

isDefault: true

# 预设 Dashboard

dashboards:

default:

kubernetes-cluster:

gnetId: 7249

revision: 2

kubernetes-pods:

gnetId: 6336

revision: 5

# 通过 sidecar 自动导入 Dashboard ConfigMap

sidecar:

dashboards:

enabled: true

label: grafana_dashboard

searchNamespace: ALL

Prometheus Operator CRD 配置

prometheusOperator:

admissionWebhooks:

enabled: true

certManager:

enabled: false

# TLS 证书生成

admissionWebhooks.patch:

enabled: true

ServiceMonitor(自动服务发现)

在 Prometheus Operator 架构中,通过 ServiceMonitor 自动发现监控目标:

# myapp-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: myapp-monitor

namespace: monitoring

labels:

release: prometheus # 必须匹配 prometheus-operator 的 rule selector

spec:

jobLabel: myapp

selector:

matchLabels:

app: myapp

namespaceSelector:

matchNames:

  • production

endpoints:

  • port: web

path: /metrics

interval: 15s

scrapeTimeout: 10s

PodMonitor(直接监控 Pod):

apiVersion: monitoring.coreos.com/v1

kind: PodMonitor

metadata:

name: myapp-pod-monitor

namespace: monitoring

spec:

selector:

matchLabels:

app: myapp

podMetricsEndpoints:

  • port: metrics

path: /metrics

interval: 15s

PrometheusRule(自定义告警规则)

# myapp-alertrules.yaml

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

name: myapp-alerts

namespace: production

labels:

release: prometheus

spec:

groups:

  • name: myapp.rules

interval: 30s

rules:

  • alert: MyAppHighErrorRate

expr: |

sum(rate(http_requests_total{status=~"5.."}[5m])) /

sum(rate(http_requests_total[5m])) > 0.05

for: 5m

labels:

severity: critical

annotations:

summary: "服务 {{ $labels.instance }} 5xx 错误率超过 5%"

description: "当前错误率:{{ $value | printf \"%.2f\" }}%"

升级与回滚

# 升级到新版本

helm upgrade prometheus prometheus-community/kube-prometheus-stack \

--namespace monitoring \

-f values-prometheus.yaml

查看发布历史

helm history prometheus -n monitoring

回滚到上一版本

helm rollback prometheus -n monitoring

回滚到指定版本

helm rollback prometheus 3 -n monitoring

完全卸载

helm uninstall prometheus -n monitoring

kubectl delete crd prometheuses.monitoring.coreos.com

kubectl delete crd prometheusrules.monitoring.coreos.com

持久化配置

# 使用 AWS EBS gp3 存储

prometheus:

prometheusSpec:

storageSpec:

volumeClaimTemplate:

spec:

storageClassName: gp3

resources:

requests:

storage: 100Gi

alertmanager:

alertmanagerSpec:

storage:

volumeClaimTemplate:

spec:

storageClassName: gp3

resources:

requests:

storage: 5Gi

grafana:

persistence:

enabled: true

storageClassName: gp3

size: 10Gi

高可用部署

# values-ha.yaml

prometheus:

prometheusSpec:

replicas: 2

replicaExternalLabelName: "__replica__"

externalLabels:

cluster: prod

# 启用 Thanos Sidecar 实现长期存储

thanos:

version: v0.31.0

objectStorage:

configFile: /etc/thanos/object-storage-config.yaml

alertmanager:

alertmanagerSpec:

replicas: 2

# 使用 gossip 协议同步状态

gossip:

enabled: true

config:

global:

# 所有集群共享同一个 alertmanager 集群

cluster: "alertmanager-monitoring-0.alertmanager-operated:9093,alertmanager-monitoring-1.alertmanager-operated:9093"

常用 Helm 命令速查

# 查看集群已安装的 release

helm list -A

查看 release 配置

helm get values prometheus -n monitoring

下载 Chart 到本地

helm pull prometheus-community/kube-prometheus-stack --untar

模板渲染(不实际安装)

helm template prometheus ./prometheus-chart -f values.yaml

调试模板

helm install --dry-run --debug prometheus ./prometheus-chart -f values.yaml

导出默认 values

helm show values prometheus-community/kube-prometheus-stack > defaults.yaml

生产环境最佳实践

  • 命名空间隔离: 所有监控组件部署在 monitoring 独立命名空间
  • 资源配额: 设置 LimitRange 防止单组件耗尽集群资源
  • 存储类选择: 生产使用 EBS/GCE PD 的 gp3/pd-standard 存储类
  • 高可用: Prometheus 2 副本 + Alertmanager 2 副本
  • 告警静默: 通过 Alertmanager Web UI 配置静默规则
  • 备份: 定期备份 Prometheus TSDB 和 Alertmanager 状态
  • 升级顺序: 先升级 CRD,再升级 Operator,最后升级 Helm Release