Skip to content

Controller Reconciliation Loops — Level-Triggered vs Edge-Triggered Design

Tại Sao Reconciliation Pattern Quan Trọng

Reconciliation loops là trái tim của Kubernetes. Đây là cơ chế đạt được "desired state". Mỗi controller (Deployment, StatefulSet, Job, custom controllers) chạy reconciliation loop.

Hiểu pattern này → debug stuck reconciliations, design robust custom controllers, predict recovery time từ failures.


Level-Triggered Design (Kubernetes Default)

Concept

Level-triggered: Controller periodically kiểm tra "current state == desired state?"

Desired State: Deployment.spec.replicas = 3

Check current: Pods running = 2

Mismatch detected: 2 ≠ 3

Action: Create 1 Pod

Wait (resync interval)

Check again: Pods = 3

Match! Continue monitoring

Pseudo-code

go
func reconc ilationLoop() {
    for {
        // Level-triggered: Check state, don't care how we got here
        desired := getDesiredState()
        actual := getCurrentState()
        
        if desired != actual {
            takeAction(desired, actual)
        }
        
        // Sleep, then check again
        time.Sleep(resyncInterval)
    }
}

Advantages

AdvantageImpact
RobustMiss event? Next resync catches it
SimpleNo need track what changed
IdempotentRe-running is safe
Self-healingManual deletions auto-reconcile

Disadvantages

DisadvantageImpact
High latencyDepends resync interval (10-15min typical)
High overheadContinuous checking even if no change
etcd loadEvery resync hits backend

Kubernetes Controllers Use Level-Triggered

Deployment Controller:
    for {
        deployment := getDesiredState()
        replicas := getRunningReplicas()
        
        if len(replicas) < deployment.Spec.Replicas {
            createPod()
        } else if len(replicas) > deployment.Spec.Replicas {
            deletePod()
        }
        
        time.Sleep(15 * time.Minute)  // Resync interval
    }

Edge-Triggered Design

Concept

Edge-triggered: React immediately when state changes

Desired: Deployment.spec.replicas = 3

CHANGE DETECTED: replicas changed 2 → 3

Immediately create Pod (no waiting for resync!)

Done

Pseudo-code

go
func edgeTriggeredLoop() {
    channel := subscribeToChanges()
    
    for change := range channel {
        // React immediately to change
        action := decideAction(change)
        executeAction(action)
    }
}

Advantages

AdvantageImpact
Low latencyReact immediately
EfficientOnly process when needed
Low overheadetcd load minimal

Disadvantages

DisadvantageImpact
ComplexTrack what changed
FragileMiss event → stuck state
Non-idempotentRe-running may cause issues

Why Kubernetes Chose Level-Triggered

Edge-triggered problems:

Time 0: Pod created, event sent to controller
Time 5ms: Controller processes event, creates ReplicaSet
Time 10ms: Network glitch, event lost
Time 15ms: Pod crash, controller never notified
    → Stuck at wrong state permanently!

Level-triggered fix:

Time 0: Desired=3, actual=2 → create Pod
Time 5ms: Pod created
Time 10ms: Network glitch, event lost (doesn't matter!)
Time 15ms: Pod crash
Time 900s: Resync happens
    → Detected mismatch: actual=1, desired=3
    → Re-create Pod
    → Fixed!

Conclusion: Level-triggered more robust untuk production systems.


Failure Modes & Recovery

Failure Mode 1: Controller Crash

Before crash: Pod created, statistic recorded
Crash happens: Controller dies

Recovery (controller restarts):
    Level-triggered: Checks desired vs actual, re-creates missing resources
    Edge-triggered: Events might be lost permanently

Failure Mode 2: Transient Network Error

Command: kubectl scale deployment myapp --replicas=5
etcd write succeeds
Event generated
Network glitch during event transmission

Level-triggered: Resync detects mismatch, fixes
Edge-triggered: Event lost, no recovery

Failure Mode 3: Partial Update

Deployment spec updated halfway:
    First: 5 replicas
    Then: 3 replicas

Events sent (update 1, update 2)

Network glitch loses update 2 event

Controller thinks still at update 1

Recovery:

  • Level-triggered: Resync catches, applies correct state
  • Edge-triggered: Stuck at wrong state

Hybrid Patterns in Practice

Pattern 1: Level-Triggered with Event Optimization

go
// Kubernetes actual implementation

func reconciler() {
    for {
        select {
        case event := <-eventChannel:
            // Edge-triggered: Process immediately
            reconcile(event.Object)
            
        case <-ticker.C:
            // Level-triggered: Periodic resync
            // Re-check all objects regardless of events
            reconcileAll()
        }
    }
}

Effect:

  • Events processed immediately (fast)
  • Resync catches anything missed (robust)
  • Best of both worlds

Pattern 2: Exponential Backoff Retry

go
// If reconciliation fails, retry with backoff
retryInterval := 100ms
maxRetries := 10

for attempt := 0; attempt < maxRetries; attempt++ {
    err := reconcile()
    if err == nil {
        break
    }
    
    sleep(retryInterval)
    retryInterval *= 2  // Exponential backoff
}

Custom Controller Best Practices

Anti-Pattern 1: Pure Edge-Triggered

go
// ❌ BAD - Pure edge-triggered
handler := func(obj interface{}) {
    reconcile(obj)  // Only runs when event arrives
}

// Problem: If event lost or handler crashes mid-reconciliation,
// stuck state is permanent

Anti-Pattern 2: Pure Level-Triggered with Long Resync

go
// ❌ BAD - Level-triggered dengan long resync
for {
    reconcileAll()
    time.Sleep(30 * time.Minute)  // Way too long!
}

// Problem: 30 minute latency to detect failures

Pattern: Hybrid (Event + Resync)

go
// ✅ GOOD - Hybrid approach

handler := func(obj interface{}) {
    // Event-triggered: Fast response
    queue.Add(obj.Name)
}

// Resync periodic: Catch missed events
go func() {
    ticker := time.NewTicker(10 * time.Minute)
    for range ticker.C {
        allObjects := cache.List()
        for _, obj := range allObjects {
            queue.Add(obj.Name)
        }
    }
}()

// Process queue
for {
    name := queue.Get()
    reconcile(name)
}

Reconciliation State Machine

Typical State Transitions

Object Created:
    ├─ Initial: ADD handler
    ├─ Reconcile: Create resources
    └─ Done: Monitor

Object Modified:
    ├─ Event: MODIFY handler
    ├─ Reconcile: Update resources
    └─ Done: Continue monitoring

Object Deleted:
    ├─ Event: DELETE handler
    ├─ Reconcile: Clean up resources
    ├─ Finalizers: Wait for cleanup
    └─ Done: Remove from etcd

Idempotency Requirement

Reconciliation must be idempotent:

go
func reconcile(pod *corev1.Pod) error {
    // Must be safe to call 1000x times
    // Either idempotent operations or detect already-done state
    
    // Check: already has label?
    if hasLabel(pod, "reconciled") {
        return nil  // Already done
    }
    
    // Idempotent operation
    patch := createLabelPatch("reconciled", "true")
    
    return patchPod(pod, patch)
}

Performance Tuning

Resync Frequency Impact

IntervalLatencyOverheadBest For
1 min1 min avg100% resync costHigh-churn
5 min2.5 min avg20% resync costBalance
15 min7.5 min avg6% resync costStable clusters
60 min30 min avg1.5% resync costRare changes

Debugging Stuck Reconciliation

Symptom: Pod not created despite Deployment spec

bash
# Check Deployment status
kubectl describe deployment myapp
# Look for: conditions, events

# Check controller logs
kubectl logs -n kube-system deployment-controller
# Search for: errors, exponential backoff

# Manual trigger resync
kubectl annotate deployment myapp force-resync="$(date +%s)"

Symptom: High CPU from reconciliation

bash
# Profile controller
go tool pprof http://controller:6060/debug/pprof/profile

# Check queue depth
kubectl get --raw /metrics | grep work queue_depth

Reference Documentation


Summary

  • Level-triggered: Default Kubernetes pattern - periodic state check
  • Edge-triggered: React immediately to changes (fragile dalam production)
  • Hybrid best: Events for speed + resync for robustness
  • Idempotency: Critical - reconcile must be safe to call multiple times
  • Failure recovery: Level-triggered self-heals via resync
  • Tuning: Balance latency vs overhead with resync interval