AWS ElastiCache Redis

概述

Amazon ElastiCache 是托管式内存数据库服务，完全兼容 Redis 和 Memcached。本节聚焦 ElastiCache Redis 的集群配置、副本机制、安全设置和性能优化，适用于缓存、会话存储、实时排行榜、消息队列等场景。

核心应用场景：

缓存层 — 加速数据库查询，降低 RDS 负载

会话存储 — Web 应用分布式会话（Spring Session、PHP Session）

实时数据 — 排行榜、计数器、实时分析

消息队列 — Pub/Sub、Stream 异步通信

分布式锁 — Redlock 实现的分布式互斥

架构与部署

部署模式对比

模式	说明	适用场景
单节点	无副本，故障即丢失	开发测试、缓存只读数据
集群模式（Cluster Mode）	分片集群，16384 slots 分布到多个节点组	大容量、高吞吐生产环境
副本（Cluster Mode Disabled）	1 主节点 + 1-5 只读副本	读多写少，需要高可用
副本（Cluster Mode Enabled）	每个分片 1 主 + 1-5 副本	写多读多，需要分片扩展

创建 Redis 集群（ElastiCache）


# 创建 Redis 集群（集群模式，3 分片，每分片 1 副本）
aws elasticache createReplication-group   --replication-group-id prod-redis-cluster   --replication-group-description "Production Redis Cluster"   --engine redis   --engine-version 7.1   --cache-node-type cache.r6g.large   --num-node-groups 3   --replicas-per-node-group 1   --cache-parameter-group-name default.redis7.cluster.on   --auto-failover-enabled   --multi-az-enabled   --network-type ipv4   --at-rest-encryption-enabled   --transit-encryption-enabled   --auth-token-enabled   --auth-token "YourSecureAuthToken123!"   --cache-subnet-group-name private-redis-subnet   --security-group-ids sg-0abcd1234efgh5678   --preferred-maintenance-window mon:04:00-mon:05:00   --notification-topic-arn arn:aws:sns:us-east-1:123456789012:redis-alerts   --output json

创建 Memcached 集群


# 创建 Memcached 集群（节点式，不支持副本）
aws elasticache create-cache-cluster   --cache-cluster-id prod-memcached   --cache-node-type cache.t4g.medium   --engine memcached   --engine-version 1.6.22   --num-cache-nodes 3   --cache-parameter-group-name default.memcached1.6   --cache-subnet-group-name private-memcached-subnet   --security-group-ids sg-0abcd1234efgh5678   --az-mode cross-az   --output json

连接与客户端

获取连接端点


# 查看集群端点
aws elasticache describe-replication-groups   --replication-group-id prod-redis-cluster   --output json

# 输出示例：
# Primary EndPoint: prod-redis-cluster.xxxxx.ng.0001.use1.cache.amazonaws.com:6379
# Reader EndPoint: prod-redis-cluster-ro.xxxxx.ng.0001.use1.cache.amazonaws.com:6379

Redis 连接示例


import redis
from redis.cluster import RedisCluster

# 单节点/副本模式连接
r = redis.Redis(
    host='prod-redis-cluster.xxxxx.ng.0001.use1.cache.amazonaws.com',
    port=6379,
    password='YourSecureAuthToken123!',
    ssl=True,
    ssl_cert_reqs='required',
    decode_responses=True,
    socket_connect_timeout=5,
    socket_timeout=5,
    retry_on_timeout=True,
    health_check_interval=30
)

# 集群模式连接
rc = RedisCluster(
    host='prod-redis-cluster.xxxxx.ng.0001.use1.cache.amazonaws.com',
    port=6379,
    password='YourSecureAuthToken123!',
    ssl=True,
    decode_responses=True,
    skip_full_coverage_check=True
)

# 测试连接
r.ping()
rc.info()

连接池配置


import redis
from redis.connection import ConnectionPool

# 全局连接池（建议单例）
pool = redis.ConnectionPool(
    host='prod-redis-cluster.xxxxx.ng.0001.use1.cache.amazonaws.com',
    port=6379,
    password='YourSecureAuthToken123!',
    ssl=True,
    max_connections=50,
    decode_responses=True
)

def get_redis():
    return redis.Redis(connection_pool=pool)

# 使用
client = get_redis()
client.set('key', 'value', ex=3600)

数据结构操作

String（字符串）


r = get_redis()

# 基础操作
r.set('user:1000:session', 'sess_data_xxx', ex=1800)  # 30分钟过期
r.get('user:1000:session')
r.setnx('lock:job:123', 'worker-1')  # 分布式锁（需设过期时间）
r.incr('counter:pageviews:20240115')
r.incrbyfloat('metrics:latency:avg', 0.5)

Hash（哈希）


# 用户信息缓存
r.hset('user:1000', mapping={
    'name': '张三',
    'email': 'zhang@example.com',
    'tier': 'premium',
    'created_at': '2023-01-15'
})
r.hgetall('user:1000')
r.hget('user:1000', 'email')
r.hincrby('user:1000', 'login_count', 1)

List（列表）


# 消息队列 / 任务队列
r.lpush('queue:tasks', '{"job":"email","id":100}')
r.rpop('queue:tasks')
r.brpop('queue:tasks', timeout=5)  # 阻塞读取

# 实时列表
r.lpush('feed:user:1000', json.dumps({'msg':'hello','ts':now}))
r.ltrim('feed:user:1000', 0, 99)  # 保留最新100条

Set / Sorted Set


# 标签/去重
r.sadd('article:1000:tags', 'redis', 'cache', 'aws')
r.smembers('article:1000:tags')
r.sismember('article:1000:tags', 'redis')

# 排行榜
r.zadd('leaderboard:2024', {'player1': 9500, 'player2': 8800, 'player3': 8200})
r.zrevrange('leaderboard:2024', 0, 9, withscores=True)  # Top 10
r.zrevrank('leaderboard:2024', 'player2')  # 查询排名

Stream（流）


# 事件流（Kafka-lite）
r.xadd('events:orders', {'order_id': '12345', 'amount': '199.00'}, maxlen=10000, approximate=True)
r.xread({'events:orders': '0'})
r.xread({'events:orders': '$'}, count=10, block=5000)  # 阻塞等待新消息

# 消费者组
r.xgroup_create('events:orders', 'processors', id='0', mkstream=True)
r.xreadgroup('processors', 'worker-1', {'events:orders': '>'}, count=10, block=5000)
r.xack('events:orders', 'processors', '1705312345677-0')

高可用与故障切换

自动故障切换机制

ElastiCache Redis 自动检测主节点故障并在 95 秒内完成切换（Cluster Mode Disabled）。切换期间：

写入请求短暂中断（约 30-60 秒）

读取请求由副本承接（如果启用了 PreferReplicaMode）

切换后 DNS 自动更新，客户端无需修改连接配置

手动故障注入测试


# 模拟主节点故障（生产环境慎用！）
aws elasticache test-failover   --replication-group-id prod-redis-cluster   --node-group-id 0001

Global Datastore（跨区域复制）


# 创建全局复制组（跨区域灾备）
aws elasticache create-global-replication-group   --global-replication-group-id prod-redis-global   --primary-replication-group-id prod-redis-cluster-use1   --secondary-replication-group-id prod-redis-cluster-usw2   --at-rest-encryption-enabled

安全配置

IAM 身份验证（Redis 7.0+）


# 启用 IAM 认证
aws elasticache modify-replication-group   --replication-group-id prod-redis-cluster   --auth-token-enabled   --auth-token-secure-transport-enabled   --apply-immediately

VPC 安全组规则

方向	协议	端口	来源
入站	TCP	6379	应用服务器安全组
入站	TCP	6379	堡垒机（运维）
出站	TCP	443	S3（如果使用 ElastiCache Valkey）

静态加密


# 创建带 KMS 加密的集群
aws elasticache create-replication-group   # ... 其他参数 ...
  --at-rest-encryption-enabled   --kms-key-id arn:aws:kms:us-east-1:123456789012:key/xxxxx

监控指标

CloudWatch 关键指标

指标	说明	告警阈值建议
`CPUUtilization`	CPU 使用率	> 70% 警告
`EngineCPUUtilization`	Redis 进程 CPU	> 70% 警告
`DatabaseMemoryUsagePercentage`	内存使用率	> 75% 警告，> 90% 严重
`CurrConnections`	当前连接数	突增 > 2x 基准警告
`ReplicationLag`	复制延迟	> 1 秒警告
`CacheHitRate`	缓存命中率	< 80% 警告
`EvalBasedCmds`	Lua 脚本执行次数	频繁需优化
`ClusterCoordinator`	集群协调状态	非 1 需关注


# 设置 CPU 告警
aws cloudwatch put-metric-alarm   --alarm-name prod-redis-cpu-high   --alarm-description "ElastiCache Redis CPU 使用率超过 70%"   --metric-name EngineCPUUtilization   --namespace AWS/ElastiCache   --statistic Maximum   --period 300   --evaluation-periods 2   --threshold 70   --comparison-operator GreaterThanThreshold   --dimensions "Name=ReplicationGroupId,Value=prod-redis-cluster"   --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

内存碎片处理


# 通过 ElastiCache 控制台或 CLI 查看
aws elasticache describe-replication-groups   --replication-group-id prod-redis-cluster   --output json | jq '.ReplicationGroups[0].MemberClusters'

# 内存碎片率高时重启节点（最后一个手段）
aws elasticache reboot-cache-cluster   --cache-cluster-id prod-redis-cluster-0001-001   --cache-node-ids-to-reboot "0001"

性能优化

键值设计原则


# 命名规范：用途:实体ID:属性
user:1000:profile      # Hash，用户档案
session:abc123         # String，会话数据
queue:tasks            # List，任务队列
rate:ip:192.168.1.100  # String，限流计数器
lock:order:12345       # String，分布式锁

大 Key 处理


# 避免大 String（单个 value 建议 < 10KB）
# 超过时使用 Hash 分片存储
for i in range(10000):
    r.hset('bighash', f'field:{i}', f'value:{i}')

# 扫描大 Key（生产环境先在从节点执行）
r.scan_iter(match='*', count=100)

管道与批量操作


# 使用 Pipeline 减少 RTT
pipe = r.pipeline()
for i in range(1000):
    pipe.set(f'key:{i}', f'value:{i}')
    pipe.expire(f'key:{i}', 3600)
pipe.execute()  # 1 次网络往返完成 1000 次操作

备份与恢复


# 创建手动快照
aws elasticache create-snapshot   --replication-group-id prod-redis-cluster   --snapshot-name prod-redis-backup-20240115

# 自动快照（每日）
aws elasticache modify-replication-group   --replication-group-id prod-redis-cluster   --automatic-backup-retention-period 7

# 从快照恢复
aws elasticache restore-from-snapshot   --replication-group-id prod-redis-restored   --snapshot-name prod-redis-backup-20240115   --cache-node-type cache.r6g.large

常见问题

Q: 连接数突增导致 OOM？

A: 1) 检查客户端是否正确关闭连接（连接池泄漏）；2) 设置 timeout 参数限制空闲连接；3) 确认没有大量 DEBUG SEGFAULT 等高消耗命令；4) 升级节点规格。

Q: 缓存雪崩？

A: 1) 对过期时间加随机偏移（ TTL = base + rand(0, 300)）；2) 使用 SETEX 时不设置统一过期时间；3) 高可用架构 + 熔断降级。

Q: 数据持久化影响性能？

A: ElastiCache RDB 快照在后台执行，AOF 追加写对性能影响极小（< 5%）。如对延迟极敏感，可关闭持久化（纯缓存场景），但需接受故障数据丢失风险。