HelmChart 一键部署
概述
Helm 是 Kubernetes 的包管理工具,Chart 是 Helm 的打包格式。通过 HelmChart,可以一键部署完整的监控体系(Prometheus + Grafana + Alertmanager + Exporters)。
环境准备
# 安装 Helm 3
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
验证安装
helm version
添加常用仓库
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
kube-prometheus-stack(推荐)
kube-prometheus-stack 是生产级监控方案,一键部署 Prometheus + Grafana + Alertmanager + Exporter。
# 添加仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
安装
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.replicas=2 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set alertmanager.alertmanagerSpec.replicas=2 \
--set grafana.adminPassword="YourStrongPassword" \
--set grafana.persistence.enabled=true \
--set grafana.persistence.size=10Gi
验证
kubectl get pods -n monitoring
values.yaml 核心配置
# 自定义 values-prometheus.yaml
prometheus:
prometheusSpec:
# 数据保留时间
retention: 30d
retentionSize: "20Gi"
# 高可用:2 副本
replicas: 2
# 资源限制
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2
memory: 4Gi
# 持久化存储
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 100Gi
# 告警规则启用
ruleSelector: true
ruleSelectorMixins: true
# 抓取配置
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
# 额外 scrape 配置
extraScrapeConfigs: |
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
alertmanager:
alertmanagerSpec:
replicas: 2
# 持久化
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 5Gi
# 通知配置
config:
global:
resolve_timeout: 5m
route:
group_by: ['job', 'severity']
group_wait: 10s
group_interval: 10s
receiver: 'dingtalk'
routes:
- match:
severity: critical
receiver: 'dingtalk'
continue: true
receivers:
- name: 'dingtalk'
webhook_configs:
- url: 'http://dingtalk-hook:5000/dingtalk'
send_resolved: true
grafana:
adminPassword: "ChangeMe123!"
# 持久化
persistence:
enabled: true
size: 10Gi
storageClassName: gp3
# 额外数据源
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-operated:9090
isDefault: true
# 预设 Dashboard
dashboards:
default:
kubernetes-cluster:
gnetId: 7249
revision: 2
kubernetes-pods:
gnetId: 6336
revision: 5
# 通过 sidecar 自动导入 Dashboard ConfigMap
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
searchNamespace: ALL
Prometheus Operator CRD 配置
prometheusOperator:
admissionWebhooks:
enabled: true
certManager:
enabled: false
# TLS 证书生成
admissionWebhooks.patch:
enabled: true
ServiceMonitor(自动服务发现)
在 Prometheus Operator 架构中,通过 ServiceMonitor 自动发现监控目标:
# myapp-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: monitoring
labels:
release: prometheus # 必须匹配 prometheus-operator 的 rule selector
spec:
jobLabel: myapp
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
endpoints:
- port: web
path: /metrics
interval: 15s
scrapeTimeout: 10s
PodMonitor(直接监控 Pod):
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pod-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 15s
PrometheusRule(自定义告警规则)
# myapp-alertrules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: production
labels:
release: prometheus
spec:
groups:
- name: myapp.rules
interval: 30s
rules:
- alert: MyAppHighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.instance }} 5xx 错误率超过 5%"
description: "当前错误率:{{ $value | printf \"%.2f\" }}%"
升级与回滚
# 升级到新版本
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
-f values-prometheus.yaml
查看发布历史
helm history prometheus -n monitoring
回滚到上一版本
helm rollback prometheus -n monitoring
回滚到指定版本
helm rollback prometheus 3 -n monitoring
完全卸载
helm uninstall prometheus -n monitoring
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
持久化配置
# 使用 AWS EBS gp3 存储
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 100Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 5Gi
grafana:
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
高可用部署
# values-ha.yaml
prometheus:
prometheusSpec:
replicas: 2
replicaExternalLabelName: "__replica__"
externalLabels:
cluster: prod
# 启用 Thanos Sidecar 实现长期存储
thanos:
version: v0.31.0
objectStorage:
configFile: /etc/thanos/object-storage-config.yaml
alertmanager:
alertmanagerSpec:
replicas: 2
# 使用 gossip 协议同步状态
gossip:
enabled: true
config:
global:
# 所有集群共享同一个 alertmanager 集群
cluster: "alertmanager-monitoring-0.alertmanager-operated:9093,alertmanager-monitoring-1.alertmanager-operated:9093"
常用 Helm 命令速查
# 查看集群已安装的 release
helm list -A
查看 release 配置
helm get values prometheus -n monitoring
下载 Chart 到本地
helm pull prometheus-community/kube-prometheus-stack --untar
模板渲染(不实际安装)
helm template prometheus ./prometheus-chart -f values.yaml
调试模板
helm install --dry-run --debug prometheus ./prometheus-chart -f values.yaml
导出默认 values
helm show values prometheus-community/kube-prometheus-stack > defaults.yaml
生产环境最佳实践
- 命名空间隔离: 所有监控组件部署在
monitoring独立命名空间
- 资源配额: 设置
LimitRange防止单组件耗尽集群资源
- 存储类选择: 生产使用 EBS/GCE PD 的
gp3/pd-standard存储类
- 高可用: Prometheus 2 副本 + Alertmanager 2 副本
- 告警静默: 通过 Alertmanager Web UI 配置静默规则
- 备份: 定期备份 Prometheus TSDB 和 Alertmanager 状态
- 升级顺序: 先升级 CRD,再升级 Operator,最后升级 Helm Release