Grafana Loki 全量部署指南(EKS + EBS gp3 + Promtail + Grafana)

目录

1. 架构概览

2. 前置条件

3. 存储配置

4. Loki 部署

5. Promtail 部署

6. 对外访问

7. Grafana 部署

8. 快速验证

9. Dashboard

10. 常见问题

11. 生产化建议


1) 架构概览


+----------------------+           +------------------------------+
| K8s Nodes            |           |  Amazon EBS (gp3)            |
|  (DaemonSet)         |           |  PVC → PV (RWO, xfs)         |
|  Promtail            |  push     +------------------------------+
|  └─ tail /var/log    | ======>   |  Loki (Single Binary)        |
|                      |           |  └─ /var/loki ←─ PVC(EBS)    |
+----------------------+           |  └─ Gateway (LoadBalancer)   |
                                   +------------------------------+
                                              ↑
                                      Grafana / curl / logcli
  • Promtail:每个节点 DaemonSet 采集容器日志(/var/log/pods)
  • Loki 单二进制:单 Pod 写入 EBS PVC,本地 filesystem 模式
  • Gateway:统一入口,外部访问 NLB / Ingress
  • Grafana:配置 Loki 数据源,可视化日志

2) 前置条件

  • 已有 EKS 集群
  • 节点可挂载 EBS 卷,建议多 AZ
  • 已安装 kubectlhelm
  • 安装 EBS CSI Driver:

aws eks create-addon --cluster-name <cluster> --addon-name aws-ebs-csi-driver
kubectl create ns loki || true

3) 存储配置(EBS gp3)

StorageClass 示例:


apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-loki
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  csi.storage.k8s.io/fstype: xfs
  encrypted: "true"
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

kubectl apply -f storageclass-gp3-loki.yaml

> ⚠️ 易错点:volumeBindingMode: WaitForFirstConsumer 必须,否则 PVC Pending


4) Loki 部署(单二进制 + filesystem + PVC)

values-loki.yaml 样例:


deploymentMode: SingleBinary
read: { replicas: 0 }
write: { replicas: 0 }
backend: { replicas: 0 }

singleBinary:
  replicas: 1
  podSecurityContext:
    fsGroup: 10001
    fsGroupChangePolicy: "OnRootMismatch"
  persistence:
    storageClass: gp3-loki
    accessModes: ["ReadWriteOnce"]
    size: 300Gi

loki:
  storage:
    type: filesystem
    filesystem:
      chunks_directory: /var/loki/chunks
      rules_directory: /var/loki/rules
    bucketNames:
      chunks: chunks
      ruler: ruler
      admin: admin
  storage_config:
    filesystem:
      directory: /var/loki/tsdb
  commonConfig:
    replication_factor: 1
    path_prefix: /var/loki
  schemaConfig:
    configs:
      - from: "2024-04-01"
        store: tsdb
        object_store: filesystem
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    retention_period: 168h

chunksCache: { enabled: false }
resultsCache: { enabled: false }
canary: { enabled: false }

gateway:
  service:
    type: LoadBalancer
    port: 80
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
      service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"

安装 Loki:


helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki -n loki -f values-loki.yaml

> ⚠️ 易错点:

> - 3.5.x+ 不支持 admin_api_directory

> - fsGroup 未设置导致 PVC 权限错误

> - Gateway 500 / empty ring → 单二进制与 SimpleScalable 配置冲突


5) Promtail 部署

values-promtail.yaml 示例:


rbac: { create: true }
daemonset: { enabled: true }
tolerations: [ { operator: Exists } ]

config:
  server:
    http_listen_port: 3101
    grpc_listen_port: 0
  positions:
    filename: /run/promtail/positions.yaml
  clients:
    - url: http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push
  scrape_configs:
    - job_name: kubernetes-pods
      pipeline_stages:
        - cri: {}
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - action: replace
          source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - action: replace
          source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - action: replace
          source_labels: [__meta_kubernetes_container_name]
          target_label: container
        - action: replace
          source_labels: [__meta_kubernetes_pod_node_name]
          target_label: node
        - action: replace
          source_labels: [__meta_kubernetes_pod_uid]
          target_label: __path__
          replacement: /var/log/pods/*$1/*.log
        - action: labeldrop
          regex: __meta_kubernetes_pod_label_.+
        - action: labeldrop
          regex: __meta_kubernetes_pod_annotation_.+

helm upgrade --install promtail grafana/promtail -n loki -f values-promtail.yaml

> ⚠️ 易错点:Promtail pod 内无 curl,需外部 pod 测试 DNS;401 Unauthorized → 多租户 auth_enabled 问题


6) Loki Gateway / 对外访问

  • ClusterIP 内网访问loki-gateway.loki.svc.cluster.local
  • LoadBalancer 外网访问:NLB + scheme=internet-facing
  • 端口转发调试

kubectl port-forward svc/loki-gateway -n loki 3100:80 &
curl -XPOST http://127.0.0.1:3100/loki/api/v1/push -H "Content-Type: application/json" --data-raw '{"streams":[{"stream":{"job":"test"},"values":[["'"$(date +%s)000000000"'","fizzbuzz"]]}]}'

7) Grafana 部署与数据源

values-grafana.yaml 示例:


adminUser: admin
adminPassword: StrongPassw0rd!
service:
  type: LoadBalancer
  port: 80
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
persistence:
  enabled: true
  size: 10Gi
  storageClassName: gp3-loki
datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        access: proxy
        isDefault: true
        url: http://loki-gateway.loki.svc.cluster.local
        jsonData:
          maxLines: 1000

helm upgrade --install grafana grafana/grafana -n loki -f values-grafana.yaml

8) 快速验证


NS=loki
LOKI_SVC=http://loki.${NS}.svc.cluster.local:3100
GW=http://loki-gateway.${NS}.svc.cluster.local

# Loki Ready
kubectl get pods -n $NS
curl -s -o /dev/null -w "%{http_code}
" ${LOKI_SVC}/ready

# Gateway Ready
curl -s -o /dev/null -w "%{http_code}
" ${GW}/loki/api/v1/status/buildinfo

# Promtail Logs
kubectl logs -n $NS ds/promtail --since=2m | egrep '204|error|failed'

9) Dashboard / Logs Starter

  • 官方推荐:Grafana Dashboard ID 1514115324
  • 导入步骤:Grafana → Dashboards → Import → 输入 Dashboard ID → 选择 Loki 数据源 → Import

10) 常见问题与易错点

问题 原因 解决
Promtail 404 / 500 / 502 Gateway 未就绪或单二进制 ring 检查 Loki pod logs,重启 Gateway
Promtail 401 Unauthorized auth_enabled 多租户 关闭 auth_enabled 或配置 tenant_id
PVC Pending AZ 不匹配,StorageClass 错 volumeBindingMode: WaitForFirstConsumer
Loki 无法启动 权限不足 设置 fsGroup 10001,卷权限读写
Dashboard 无日志 Promtail 无日志推送 检查 /var/log/pods 是否挂载到 Promtail

11) 生产化建议

  • 单二进制适合测试/小集群,生产可考虑 SimpleScalable + S3
  • gp3 IOPS/吞吐可按需调整
  • retention_period 控制日志保留天数
  • labeldrop 控制高基数标签
  • 定期巡检 /ready,Promtail 204 状态

一键部署总结


# 0) EKS + CSI
aws eks create-addon --cluster-name <cluster> --addon-name aws-ebs-csi-driver
kubectl create ns loki || true

# 1) StorageClass
kubectl apply -f storageclass-gp3-loki.yaml

# 2) Loki
helm upgrade --install loki grafana/loki -n loki -f values-loki.yaml

# 3) Promtail
helm upgrade --install promtail grafana/promtail -n loki -f values-promtail.yaml

# 4) Grafana
helm upgrade --install grafana grafana/grafana -n loki -f values-grafana.yaml

下一步