etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction

Tại Sao Cần Hiểu etcd Internals

Đa số Kubernetes engineers chỉ biết etcd là "key-value store mà Kubernetes dùng". Tuy nhiên, hiểu cách etcd vận hành giúp:

Debug state inconsistencies — Khi data ở etcd sai với deployed state
Predict failure modes — Biết etcd quorum cần bao nhiêu nodes, rồi tránh single-point-of-failure
Tune performance — Watch connection limits, compaction schedule impact latency
Capacity planning — Biết object size limits, write throughput ceiling

Raft Consensus Algorithm — Foundation

etcd dựa trên Raft, distributed consensus algorithm. Hiểu Raft critical para understand etcd.

Core Concept: State Machine Replication

Desired: All nodes trong etcd cluster maintain IDENTICAL state

Mechanism: Replicate every write → all nodes apply trong same order

       Client writes value X
            ↓
       Write to leader
            ↓
       Leader broadcasts "append entry X" to followers
            ↓
       Followers receive, persist to disk, ack
            ↓
       Leader waits for majority ack (quorum)
            ↓
       Leader commits X to state machine
            ↓
       Leader notifies followers: "commit X"
            ↓
       All followers apply X to state machine
            ↓
       Consistency achieved ✓

Quorum Requirement

etcd always needs majority (>50%) nodes operational:

Cluster size   Min quorum   Tolerance
─────────────────────────────────────
1              1            0 nodes fail
3              2            1 node fail
5              3            2 nodes fail
7              4            3 nodes fail

Implication:

1-node cluster: any failure = downtime (避免!)
3-node cluster: typical HA, tolerate 1 failure
5-node cluster: tolerate 2 failures (expensive, rarely needed)

Why odd numbers?: Quorum untuk 4-node = 3, same sebagai 5-node. Better gunakan 3 atau 5.

Leader Election

At any time, etcd cluster punya exactly 1 leader:

┌────────────────────────────────────┐
│ Leader                             │
│ - Receives client writes           │
│ - Broadcasts to followers          │
│ - Commits once majority acks       │
└────────────────────────────────────┘

Followers:
- Replicate leader state
- Forward client writes to leader
- Trigger new election if leader dies

New leader elected: ~150-300ms (tunable)

GKE context: etcd leader automatically elected oleh GCP, tidak perlu manual intervention.

Replication Mechanics

Log Replication

etcd maintains replicated log — ordered list of write operations:

Leader Log:
┌──────────────────────────────────────┐
│ [1] PUT /pods/pod1 → data1          │
│ [2] PUT /services/svc1 → data2      │
│ [3] DELETE /configmaps/cfg1 → -     │
│ [4] PUT /pods/pod2 → data3          │
│ (uncommitted ↓)                      │
│ [5] PUT /services/svc2 → data4      │
│ [6] (pending replication)            │
└──────────────────────────────────────┘

Followers replicate same log, apply same order

Snapshots untuk Efficiency

etcd doesn't keep infinite log. Periodically:

Log entries 1-1000 → Snapshot
├─ All entries applied to state machine
├─ Take snapshot of final state
└─ Discard old log entries

Result:
- Log stays manageable size
- New followers catchup faster via snapshot
- Faster recovery dari crashes

GKE etcd compaction: ~every 6 hours (tunable)

Consistency During Replication

Important guarantee: Followers apply entries dalam exact same order sebagai leader:

yaml

# Leader has
[ PUT key1=A ]
[ PUT key1=B ]

# Followers akan ALWAYS see in this order
# Never key1=B then key1=A

# This ordering guarantee adalah essential untuk consistency

Watch Mechanism — Event Streaming

How Watch Works (High Level)

Watch adalah subscriber pattern — clients register untuk events:

Client:
    kubectl get pods --watch
         ↓
    Opens HTTP connection
         ↓
    Sends: WATCH /pods
         ↓
    API Server:
         Forwards ke etcd
         ↓
         Opens watch subscription
         ↓
    etcd:
         Streams ADDED/MODIFIED/DELETED events
         ├─ Object X added → ADDED event
         ├─ Object X updated → MODIFIED event
         └─ Object X deleted → DELETED event
         
    Events flow back ke client in real-time

Watch Caching dalam API Server

Tidak setiap watch langsung dari etcd. API Server maintains watch cache:

Multiple Clients watching /pods
    ├─ Watch 1 (client A)
    ├─ Watch 2 (client B)
    └─ Watch 3 (client C)
            ↓
        API Server Watch Cache
        (shared subscription to etcd)
            ↓
        Single etcd watch
        (multiplexed upstream)

Efficiency:

Satu etcd watch untuk 100 API clients
Significant reduction etcd watch connections

Latency:

etcd → API Server cache: <10ms typically
API Server → client: <100ms

Watch Consistency & Ordering

Critical: watch events delivered ordered:

If Pod spec updated: field1=A → field1=B

Client will see:
1. MODIFIED event (field1=A)
2. MODIFIED event (field1=B)

NEVER out-of-order or skipped

Watch Connection Reconnection

When watch connection dropped (network interrupt, server restart):

Client:
    Connection drops
         ↓
    Reconnect attempt
         ↓
    Send WATCH /pods at resourceVersion=N
         ↓
    API Server:
    "Give me events starting from version N"
         ↓
    Streams buffered events
         ↓
    Continue live stream

Buffering: API Server buffers ~1000s events jika client temporary disconnected

Compaction — Maintaining Database Size

Problem: Unbounded Growth

Without compaction, etcd database terus grow:

Time 0: PUT key1=A    (version 1)
Time 1: PUT key1=B    (version 2)
Time 2: PUT key1=C    (version 3)
Time 3: PUT key1=D    (version 4)
...
Time 1000000: PUT key1=LATEST  (version 1000000)

Database stores all versions!
Disk usage: unbounded
Recovery speed: slow (replay all versions)

Compaction Operation

Compaction removes old revisions:

bash

# Triggering compaction (GKE does automatically)
etcdctl compact 100000
# Keep revisions > 100000, discard older

# Result:
# Database size reduced
# Can't read old versions anymore
# Faster recovery

GKE automatic compaction: ~every 6 hours, retains last 1 hour of history

Side Effects of Compaction

⚠️ Important: After compaction, old revisions no longer accessible:

bash

# This would fail post-compaction
etcdctl get key1 --rev=50000
# ERROR: revision 50000 not available

# This works
etcdctl get key1 --rev=150000  # > compaction revision

Watch implication:

If client watches dari revision 50000 pasca-compaction
etcdctl tidak bisa fulfill (revision deleted)
etcdctl returns: "watch revision too old"
Client must reconnect with current revision

Key & Value Size Constraints

Per-Key Limits

Limit	Value	Impact
Max key size	~512 KB	etcd key paths not typically hit limit
Max value size	~1 MB	ConfigMaps/Secrets > 1MB = reject
Total db size	Configurable	Default: unlimited, limited by disk

Implication untuk Kubernetes

yaml

# ❌ Will reject - ConfigMap too large
apiVersion: v1
kind: ConfigMap
metadata:
  name: huge-config
data:
  large-file: |
    [... 10 MB of data ...]  # REJECT

# Solution: Store dalam object storage, reference only
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-ref
data:
  storage-url: gs://bucket/file

etcd Database Size

bash

# Check etcd database size (GKE)
kubectl exec -n kube-system etcd-server -- \
  du -sh /var/lib/etcd

# Typical values:
# Small cluster: 100 MB - 1 GB
# Medium cluster: 1 GB - 10 GB  
# Large cluster: 10 GB - 100 GB

Performance Characteristics

Write Latency

Typical latency per write:
├─ Network latency to leader: 5 ms
├─ Disk fsync (persistent): 10 ms
├─ Leader commits: 5 ms
├─ Response back to client: 5 ms
└─ Total: ~25 ms (p50), ~100 ms (p99)

Write Throughput

etcd bottleneck typically disk I/O:

Single leader can handle:
- Typical writes: ~1000 writes/sec
- Peak writes: ~5000 writes/sec (unsustainable)

Throughput limits:
- Storage IOPS capacity
- Network bandwidth
- Quorum replication latency

Watch Latency

Event latency:
- etcd detects change: immediate
- Broadcasts to followers: <5ms
- API Server buffers: <10ms
- Client receives: <100ms total

Failure Modes & Recovery

Node Failure Impact

3-node cluster:
├─ Node 1 fails: Cluster operational (2/3 quorum)
├─ Node 2 fails: Cluster operational (2/3 quorum)
└─ Nodes 1,2 fail: Cluster DOWN (need 2/3)

Data loss: Zero (quorum always preserved)

Leader Failure

Leader crash:

Time 0: Leader crashes
     ↓
Time 100-200ms: Followers detect (heartbeat missing)
     ↓
Time 300ms: New leader elected
     ↓
Time 400ms: Cluster operational again

Client impact: Pending writes get re-executed, consistency maintained

Network Partition

Network partition:
├─ Partition 1: 2 nodes (includes leader)
├─ Partition 2: 1 node

Partition 1: Quorum intact → OPERATIONAL
Partition 2: No quorum → READ-ONLY (or rejected)

When healed: Partition 2 catches up automatically

Production Best Practices

Cluster Size

Recommendation:

3-node: Default production (tolerate 1 failure)
5-node: Use if frequent updates (faster catchup dari snapshots)
1-node: Development only

Monitoring

bash

# Critical metrics
etcd_server_has_leader  # Should always be 1
etcd_server_leader_changes_seen_total  # Should increase slowly
etcd_disk_backend_commit_duration_seconds  # Should be <50ms
etcd_server_proposals_failed_total  # Should be 0

Backup Strategy

GKE manages backups automatically, tapi validate:

bash

# Verify backups
gcloud container backups describe <backup> \
  --format="table(name, state, create_time)"

# Test restore monthly

Reference Dokumentasi

Summary

Raft consensus: etcd basis, guarantees consistency across replicas
Quorum requirement: Odd-sized clusters, 3-node typical
Replication log: Maintains ordering of all operations
Watch mechanism: Efficient event streaming via caching layer
Compaction: Removes old revisions, improves performance but affects revision history
Performance: ~1000 writes/sec ceiling, latency ~25ms median
Failure recovery: Automatic leader election, transparent client failover

etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction ​

Tại Sao Cần Hiểu etcd Internals ​

Raft Consensus Algorithm — Foundation ​

Core Concept: State Machine Replication ​

Quorum Requirement ​

Leader Election ​

Replication Mechanics ​

Log Replication ​

Snapshots untuk Efficiency ​

Consistency During Replication ​

Watch Mechanism — Event Streaming ​

How Watch Works (High Level) ​

Watch Caching dalam API Server ​

Watch Consistency & Ordering ​

Watch Connection Reconnection ​

Compaction — Maintaining Database Size ​

Problem: Unbounded Growth ​

Compaction Operation ​

Side Effects of Compaction ​

Key & Value Size Constraints ​

Per-Key Limits ​

Implication untuk Kubernetes ​

etcd Database Size ​

Performance Characteristics ​

Write Latency ​

Write Throughput ​

Watch Latency ​

Failure Modes & Recovery ​

Node Failure Impact ​

Leader Failure ​

Network Partition ​

Production Best Practices ​

Cluster Size ​

Monitoring ​

Backup Strategy ​

Reference Dokumentasi ​

Summary ​

etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction

Tại Sao Cần Hiểu etcd Internals

Raft Consensus Algorithm — Foundation

Core Concept: State Machine Replication

Quorum Requirement

Leader Election

Replication Mechanics

Log Replication

Snapshots untuk Efficiency

Consistency During Replication

Watch Mechanism — Event Streaming

How Watch Works (High Level)

Watch Caching dalam API Server

Watch Consistency & Ordering

Watch Connection Reconnection

Compaction — Maintaining Database Size

Problem: Unbounded Growth

Compaction Operation

Side Effects of Compaction

Key & Value Size Constraints

Per-Key Limits

Implication untuk Kubernetes

etcd Database Size

Performance Characteristics

Write Latency

Write Throughput

Watch Latency

Failure Modes & Recovery

Node Failure Impact

Leader Failure

Network Partition

Production Best Practices

Cluster Size

Monitoring

Backup Strategy

Reference Dokumentasi

Summary