Skip to content

etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction

Tại Sao Cần Hiểu etcd Internals

Đa số Kubernetes engineers chỉ biết etcd là "key-value store mà Kubernetes dùng". Tuy nhiên, hiểu cách etcd vận hành giúp:

  1. Debug state inconsistencies — Khi data ở etcd sai với deployed state
  2. Predict failure modes — Biết etcd quorum cần bao nhiêu nodes, rồi tránh single-point-of-failure
  3. Tune performance — Watch connection limits, compaction schedule impact latency
  4. Capacity planning — Biết object size limits, write throughput ceiling

Raft Consensus Algorithm — Foundation

etcd dựa trên Raft, distributed consensus algorithm. Hiểu Raft critical para understand etcd.

Core Concept: State Machine Replication

Desired: All nodes trong etcd cluster maintain IDENTICAL state

Mechanism: Replicate every write → all nodes apply trong same order

       Client writes value X

       Write to leader

       Leader broadcasts "append entry X" to followers

       Followers receive, persist to disk, ack

       Leader waits for majority ack (quorum)

       Leader commits X to state machine

       Leader notifies followers: "commit X"

       All followers apply X to state machine

       Consistency achieved ✓

Quorum Requirement

etcd always needs majority (>50%) nodes operational:

Cluster size   Min quorum   Tolerance
─────────────────────────────────────
1              1            0 nodes fail
3              2            1 node fail
5              3            2 nodes fail
7              4            3 nodes fail

Implication:

  • 1-node cluster: any failure = downtime (避免!)
  • 3-node cluster: typical HA, tolerate 1 failure
  • 5-node cluster: tolerate 2 failures (expensive, rarely needed)

Why odd numbers?: Quorum untuk 4-node = 3, same sebagai 5-node. Better gunakan 3 atau 5.

Leader Election

At any time, etcd cluster punya exactly 1 leader:

┌────────────────────────────────────┐
│ Leader                             │
│ - Receives client writes           │
│ - Broadcasts to followers          │
│ - Commits once majority acks       │
└────────────────────────────────────┘

Followers:
- Replicate leader state
- Forward client writes to leader
- Trigger new election if leader dies

New leader elected: ~150-300ms (tunable)

GKE context: etcd leader automatically elected oleh GCP, tidak perlu manual intervention.


Replication Mechanics

Log Replication

etcd maintains replicated log — ordered list of write operations:

Leader Log:
┌──────────────────────────────────────┐
│ [1] PUT /pods/pod1 → data1          │
│ [2] PUT /services/svc1 → data2      │
│ [3] DELETE /configmaps/cfg1 → -     │
│ [4] PUT /pods/pod2 → data3          │
│ (uncommitted ↓)                      │
│ [5] PUT /services/svc2 → data4      │
│ [6] (pending replication)            │
└──────────────────────────────────────┘

Followers replicate same log, apply same order

Snapshots untuk Efficiency

etcd doesn't keep infinite log. Periodically:

Log entries 1-1000 → Snapshot
├─ All entries applied to state machine
├─ Take snapshot of final state
└─ Discard old log entries

Result:
- Log stays manageable size
- New followers catchup faster via snapshot
- Faster recovery dari crashes

GKE etcd compaction: ~every 6 hours (tunable)

Consistency During Replication

Important guarantee: Followers apply entries dalam exact same order sebagai leader:

yaml
# Leader has
[ PUT key1=A ]
[ PUT key1=B ]

# Followers akan ALWAYS see in this order
# Never key1=B then key1=A

# This ordering guarantee adalah essential untuk consistency

Watch Mechanism — Event Streaming

How Watch Works (High Level)

Watch adalah subscriber pattern — clients register untuk events:

Client:
    kubectl get pods --watch

    Opens HTTP connection

    Sends: WATCH /pods

    API Server:
         Forwards ke etcd

         Opens watch subscription

    etcd:
         Streams ADDED/MODIFIED/DELETED events
         ├─ Object X added → ADDED event
         ├─ Object X updated → MODIFIED event
         └─ Object X deleted → DELETED event
         
    Events flow back ke client in real-time

Watch Caching dalam API Server

Tidak setiap watch langsung dari etcd. API Server maintains watch cache:

Multiple Clients watching /pods
    ├─ Watch 1 (client A)
    ├─ Watch 2 (client B)
    └─ Watch 3 (client C)

        API Server Watch Cache
        (shared subscription to etcd)

        Single etcd watch
        (multiplexed upstream)

Efficiency:

  • Satu etcd watch untuk 100 API clients
  • Significant reduction etcd watch connections

Latency:

  • etcd → API Server cache: <10ms typically
  • API Server → client: <100ms

Watch Consistency & Ordering

Critical: watch events delivered ordered:

If Pod spec updated: field1=A → field1=B

Client will see:
1. MODIFIED event (field1=A)
2. MODIFIED event (field1=B)

NEVER out-of-order or skipped

Watch Connection Reconnection

When watch connection dropped (network interrupt, server restart):

Client:
    Connection drops

    Reconnect attempt

    Send WATCH /pods at resourceVersion=N

    API Server:
    "Give me events starting from version N"

    Streams buffered events

    Continue live stream

Buffering: API Server buffers ~1000s events jika client temporary disconnected


Compaction — Maintaining Database Size

Problem: Unbounded Growth

Without compaction, etcd database terus grow:

Time 0: PUT key1=A    (version 1)
Time 1: PUT key1=B    (version 2)
Time 2: PUT key1=C    (version 3)
Time 3: PUT key1=D    (version 4)
...
Time 1000000: PUT key1=LATEST  (version 1000000)

Database stores all versions!
Disk usage: unbounded
Recovery speed: slow (replay all versions)

Compaction Operation

Compaction removes old revisions:

bash
# Triggering compaction (GKE does automatically)
etcdctl compact 100000
# Keep revisions > 100000, discard older

# Result:
# Database size reduced
# Can't read old versions anymore
# Faster recovery

GKE automatic compaction: ~every 6 hours, retains last 1 hour of history

Side Effects of Compaction

⚠️ Important: After compaction, old revisions no longer accessible:

bash
# This would fail post-compaction
etcdctl get key1 --rev=50000
# ERROR: revision 50000 not available

# This works
etcdctl get key1 --rev=150000  # > compaction revision

Watch implication:

  • If client watches dari revision 50000 pasca-compaction
  • etcdctl tidak bisa fulfill (revision deleted)
  • etcdctl returns: "watch revision too old"
  • Client must reconnect with current revision

Key & Value Size Constraints

Per-Key Limits

LimitValueImpact
Max key size~512 KBetcd key paths not typically hit limit
Max value size~1 MBConfigMaps/Secrets > 1MB = reject
Total db sizeConfigurableDefault: unlimited, limited by disk

Implication untuk Kubernetes

yaml
# ❌ Will reject - ConfigMap too large
apiVersion: v1
kind: ConfigMap
metadata:
  name: huge-config
data:
  large-file: |
    [... 10 MB of data ...]  # REJECT

# Solution: Store dalam object storage, reference only
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-ref
data:
  storage-url: gs://bucket/file

etcd Database Size

bash
# Check etcd database size (GKE)
kubectl exec -n kube-system etcd-server -- \
  du -sh /var/lib/etcd

# Typical values:
# Small cluster: 100 MB - 1 GB
# Medium cluster: 1 GB - 10 GB  
# Large cluster: 10 GB - 100 GB

Performance Characteristics

Write Latency

Typical latency per write:
├─ Network latency to leader: 5 ms
├─ Disk fsync (persistent): 10 ms
├─ Leader commits: 5 ms
├─ Response back to client: 5 ms
└─ Total: ~25 ms (p50), ~100 ms (p99)

Write Throughput

etcd bottleneck typically disk I/O:

Single leader can handle:
- Typical writes: ~1000 writes/sec
- Peak writes: ~5000 writes/sec (unsustainable)

Throughput limits:
- Storage IOPS capacity
- Network bandwidth
- Quorum replication latency

Watch Latency

Event latency:
- etcd detects change: immediate
- Broadcasts to followers: <5ms
- API Server buffers: <10ms
- Client receives: <100ms total

Failure Modes & Recovery

Node Failure Impact

3-node cluster:
├─ Node 1 fails: Cluster operational (2/3 quorum)
├─ Node 2 fails: Cluster operational (2/3 quorum)
└─ Nodes 1,2 fail: Cluster DOWN (need 2/3)

Data loss: Zero (quorum always preserved)

Leader Failure

Leader crash:

Time 0: Leader crashes

Time 100-200ms: Followers detect (heartbeat missing)

Time 300ms: New leader elected

Time 400ms: Cluster operational again

Client impact: Pending writes get re-executed, consistency maintained

Network Partition

Network partition:
├─ Partition 1: 2 nodes (includes leader)
├─ Partition 2: 1 node

Partition 1: Quorum intact → OPERATIONAL
Partition 2: No quorum → READ-ONLY (or rejected)

When healed: Partition 2 catches up automatically

Production Best Practices

Cluster Size

Recommendation:

  • 3-node: Default production (tolerate 1 failure)
  • 5-node: Use if frequent updates (faster catchup dari snapshots)
  • 1-node: Development only

Monitoring

bash
# Critical metrics
etcd_server_has_leader  # Should always be 1
etcd_server_leader_changes_seen_total  # Should increase slowly
etcd_disk_backend_commit_duration_seconds  # Should be <50ms
etcd_server_proposals_failed_total  # Should be 0

Backup Strategy

GKE manages backups automatically, tapi validate:

bash
# Verify backups
gcloud container backups describe <backup> \
  --format="table(name, state, create_time)"

# Test restore monthly

Reference Dokumentasi


Summary

  • Raft consensus: etcd basis, guarantees consistency across replicas
  • Quorum requirement: Odd-sized clusters, 3-node typical
  • Replication log: Maintains ordering of all operations
  • Watch mechanism: Efficient event streaming via caching layer
  • Compaction: Removes old revisions, improves performance but affects revision history
  • Performance: ~1000 writes/sec ceiling, latency ~25ms median
  • Failure recovery: Automatic leader election, transparent client failover