etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction
Tại Sao Cần Hiểu etcd Internals
Đa số Kubernetes engineers chỉ biết etcd là "key-value store mà Kubernetes dùng". Tuy nhiên, hiểu cách etcd vận hành giúp:
- Debug state inconsistencies — Khi data ở etcd sai với deployed state
- Predict failure modes — Biết etcd quorum cần bao nhiêu nodes, rồi tránh single-point-of-failure
- Tune performance — Watch connection limits, compaction schedule impact latency
- Capacity planning — Biết object size limits, write throughput ceiling
Raft Consensus Algorithm — Foundation
etcd dựa trên Raft, distributed consensus algorithm. Hiểu Raft critical para understand etcd.
Core Concept: State Machine Replication
Desired: All nodes trong etcd cluster maintain IDENTICAL state
Mechanism: Replicate every write → all nodes apply trong same order
Client writes value X
↓
Write to leader
↓
Leader broadcasts "append entry X" to followers
↓
Followers receive, persist to disk, ack
↓
Leader waits for majority ack (quorum)
↓
Leader commits X to state machine
↓
Leader notifies followers: "commit X"
↓
All followers apply X to state machine
↓
Consistency achieved ✓Quorum Requirement
etcd always needs majority (>50%) nodes operational:
Cluster size Min quorum Tolerance
─────────────────────────────────────
1 1 0 nodes fail
3 2 1 node fail
5 3 2 nodes fail
7 4 3 nodes failImplication:
- 1-node cluster: any failure = downtime (避免!)
- 3-node cluster: typical HA, tolerate 1 failure
- 5-node cluster: tolerate 2 failures (expensive, rarely needed)
Why odd numbers?: Quorum untuk 4-node = 3, same sebagai 5-node. Better gunakan 3 atau 5.
Leader Election
At any time, etcd cluster punya exactly 1 leader:
┌────────────────────────────────────┐
│ Leader │
│ - Receives client writes │
│ - Broadcasts to followers │
│ - Commits once majority acks │
└────────────────────────────────────┘
Followers:
- Replicate leader state
- Forward client writes to leader
- Trigger new election if leader dies
New leader elected: ~150-300ms (tunable)GKE context: etcd leader automatically elected oleh GCP, tidak perlu manual intervention.
Replication Mechanics
Log Replication
etcd maintains replicated log — ordered list of write operations:
Leader Log:
┌──────────────────────────────────────┐
│ [1] PUT /pods/pod1 → data1 │
│ [2] PUT /services/svc1 → data2 │
│ [3] DELETE /configmaps/cfg1 → - │
│ [4] PUT /pods/pod2 → data3 │
│ (uncommitted ↓) │
│ [5] PUT /services/svc2 → data4 │
│ [6] (pending replication) │
└──────────────────────────────────────┘
Followers replicate same log, apply same orderSnapshots untuk Efficiency
etcd doesn't keep infinite log. Periodically:
Log entries 1-1000 → Snapshot
├─ All entries applied to state machine
├─ Take snapshot of final state
└─ Discard old log entries
Result:
- Log stays manageable size
- New followers catchup faster via snapshot
- Faster recovery dari crashesGKE etcd compaction: ~every 6 hours (tunable)
Consistency During Replication
Important guarantee: Followers apply entries dalam exact same order sebagai leader:
# Leader has
[ PUT key1=A ]
[ PUT key1=B ]
# Followers akan ALWAYS see in this order
# Never key1=B then key1=A
# This ordering guarantee adalah essential untuk consistencyWatch Mechanism — Event Streaming
How Watch Works (High Level)
Watch adalah subscriber pattern — clients register untuk events:
Client:
kubectl get pods --watch
↓
Opens HTTP connection
↓
Sends: WATCH /pods
↓
API Server:
Forwards ke etcd
↓
Opens watch subscription
↓
etcd:
Streams ADDED/MODIFIED/DELETED events
├─ Object X added → ADDED event
├─ Object X updated → MODIFIED event
└─ Object X deleted → DELETED event
Events flow back ke client in real-timeWatch Caching dalam API Server
Tidak setiap watch langsung dari etcd. API Server maintains watch cache:
Multiple Clients watching /pods
├─ Watch 1 (client A)
├─ Watch 2 (client B)
└─ Watch 3 (client C)
↓
API Server Watch Cache
(shared subscription to etcd)
↓
Single etcd watch
(multiplexed upstream)Efficiency:
- Satu etcd watch untuk 100 API clients
- Significant reduction etcd watch connections
Latency:
- etcd → API Server cache: <10ms typically
- API Server → client: <100ms
Watch Consistency & Ordering
Critical: watch events delivered ordered:
If Pod spec updated: field1=A → field1=B
Client will see:
1. MODIFIED event (field1=A)
2. MODIFIED event (field1=B)
NEVER out-of-order or skippedWatch Connection Reconnection
When watch connection dropped (network interrupt, server restart):
Client:
Connection drops
↓
Reconnect attempt
↓
Send WATCH /pods at resourceVersion=N
↓
API Server:
"Give me events starting from version N"
↓
Streams buffered events
↓
Continue live streamBuffering: API Server buffers ~1000s events jika client temporary disconnected
Compaction — Maintaining Database Size
Problem: Unbounded Growth
Without compaction, etcd database terus grow:
Time 0: PUT key1=A (version 1)
Time 1: PUT key1=B (version 2)
Time 2: PUT key1=C (version 3)
Time 3: PUT key1=D (version 4)
...
Time 1000000: PUT key1=LATEST (version 1000000)
Database stores all versions!
Disk usage: unbounded
Recovery speed: slow (replay all versions)Compaction Operation
Compaction removes old revisions:
# Triggering compaction (GKE does automatically)
etcdctl compact 100000
# Keep revisions > 100000, discard older
# Result:
# Database size reduced
# Can't read old versions anymore
# Faster recoveryGKE automatic compaction: ~every 6 hours, retains last 1 hour of history
Side Effects of Compaction
⚠️ Important: After compaction, old revisions no longer accessible:
# This would fail post-compaction
etcdctl get key1 --rev=50000
# ERROR: revision 50000 not available
# This works
etcdctl get key1 --rev=150000 # > compaction revisionWatch implication:
- If client watches dari revision 50000 pasca-compaction
- etcdctl tidak bisa fulfill (revision deleted)
- etcdctl returns: "watch revision too old"
- Client must reconnect with current revision
Key & Value Size Constraints
Per-Key Limits
| Limit | Value | Impact |
|---|---|---|
| Max key size | ~512 KB | etcd key paths not typically hit limit |
| Max value size | ~1 MB | ConfigMaps/Secrets > 1MB = reject |
| Total db size | Configurable | Default: unlimited, limited by disk |
Implication untuk Kubernetes
# ❌ Will reject - ConfigMap too large
apiVersion: v1
kind: ConfigMap
metadata:
name: huge-config
data:
large-file: |
[... 10 MB of data ...] # REJECT
# Solution: Store dalam object storage, reference only
apiVersion: v1
kind: ConfigMap
metadata:
name: config-ref
data:
storage-url: gs://bucket/fileetcd Database Size
# Check etcd database size (GKE)
kubectl exec -n kube-system etcd-server -- \
du -sh /var/lib/etcd
# Typical values:
# Small cluster: 100 MB - 1 GB
# Medium cluster: 1 GB - 10 GB
# Large cluster: 10 GB - 100 GBPerformance Characteristics
Write Latency
Typical latency per write:
├─ Network latency to leader: 5 ms
├─ Disk fsync (persistent): 10 ms
├─ Leader commits: 5 ms
├─ Response back to client: 5 ms
└─ Total: ~25 ms (p50), ~100 ms (p99)Write Throughput
etcd bottleneck typically disk I/O:
Single leader can handle:
- Typical writes: ~1000 writes/sec
- Peak writes: ~5000 writes/sec (unsustainable)
Throughput limits:
- Storage IOPS capacity
- Network bandwidth
- Quorum replication latencyWatch Latency
Event latency:
- etcd detects change: immediate
- Broadcasts to followers: <5ms
- API Server buffers: <10ms
- Client receives: <100ms totalFailure Modes & Recovery
Node Failure Impact
3-node cluster:
├─ Node 1 fails: Cluster operational (2/3 quorum)
├─ Node 2 fails: Cluster operational (2/3 quorum)
└─ Nodes 1,2 fail: Cluster DOWN (need 2/3)
Data loss: Zero (quorum always preserved)Leader Failure
Leader crash:
Time 0: Leader crashes
↓
Time 100-200ms: Followers detect (heartbeat missing)
↓
Time 300ms: New leader elected
↓
Time 400ms: Cluster operational againClient impact: Pending writes get re-executed, consistency maintained
Network Partition
Network partition:
├─ Partition 1: 2 nodes (includes leader)
├─ Partition 2: 1 node
Partition 1: Quorum intact → OPERATIONAL
Partition 2: No quorum → READ-ONLY (or rejected)
When healed: Partition 2 catches up automaticallyProduction Best Practices
Cluster Size
Recommendation:
- 3-node: Default production (tolerate 1 failure)
- 5-node: Use if frequent updates (faster catchup dari snapshots)
- 1-node: Development only
Monitoring
# Critical metrics
etcd_server_has_leader # Should always be 1
etcd_server_leader_changes_seen_total # Should increase slowly
etcd_disk_backend_commit_duration_seconds # Should be <50ms
etcd_server_proposals_failed_total # Should be 0Backup Strategy
GKE manages backups automatically, tapi validate:
# Verify backups
gcloud container backups describe <backup> \
--format="table(name, state, create_time)"
# Test restore monthlyReference Dokumentasi
- Raft Paper - Extended Version
- etcd Clustering Guide
- etcd Performance Tuning
- GCP etcd Backup Best Practices
Summary
- Raft consensus: etcd basis, guarantees consistency across replicas
- Quorum requirement: Odd-sized clusters, 3-node typical
- Replication log: Maintains ordering of all operations
- Watch mechanism: Efficient event streaming via caching layer
- Compaction: Removes old revisions, improves performance but affects revision history
- Performance: ~1000 writes/sec ceiling, latency ~25ms median
- Failure recovery: Automatic leader election, transparent client failover