Kubernetes Informer Pattern — List-Watch Protocol, Local Cache Resync, Re-sync Intervals

Tại Sao Informer Pattern Quan Trọng

Mọi Kubernetes controller (Deployments, Services, StatefulSets, custom controllers) đều sử dụng informer pattern. Đây không phải optional — đây là foundation của reconciliation loop design trong Kubernetes.

Hiểu informer giúp:

Write efficient custom controllers
Debug controller performance issues
Understand memory consumption của controllers
Predict reconciliation latency

The List-Watch Protocol

Concept

Informer sử dụng two-phase approach:

Phase 1: LIST
  ├─ Get all current objects
  ├─ Build initial cache
  └─ Determine latest resourceVersion

Phase 2: WATCH
  ├─ Stream changes starting từ resourceVersion
  ├─ Update cache with incremental changes
  └─ Trigger handlers cho object changes

Why Two Phases?

If only watch (without list):
└─ Miss all changes happened before controller started
└─ Incomplete reconciliation

If only list (without watch):
└─ Have current state, but no change notifications
└─ Must poll continuously (inefficient)

List + Watch:
├─ List: Initial state + current version
├─ Watch: Incremental updates từ that version
└─ Complete + efficient

Informer Mechanics

Cache Layer

┌──────────────────────────────┐
│  Informer (per resource)     │
│                              │
│ ┌────────────────────────┐   │
│ │ Reflector              │   │
│ │ - List initial state   │   │
│ │ - Watch changes        │   │
│ └────────────────────────┘   │
│          ↓                    │
│ ┌────────────────────────┐   │
│ │ Local Cache (Indexer)  │   │
│ │ - In-memory objects    │   │
│ │ - Indexed by name      │   │
│ │ - Searchable           │   │
│ └────────────────────────┘   │
│          ↓                    │
│ ┌────────────────────────┐   │
│ │ WorkQueue              │   │
│ │ - Object keys          │   │
│ │ - Retry queue          │   │
│ └────────────────────────┘   │
│                              │
└──────────────────────────────┘
      ↓
   Handler (reconciliation logic)

Lifecycle

// Pseudo-code: How informer works

informer := NewPodInformer()

// Phase 1: LIST — get initial state
pods := api.ListPods()
for pod := range pods {
    informer.cache.Add(pod)
}
resourceVersion := pods.metadata.resourceVersion

// Phase 2: WATCH — streaming updates
watch := api.WatchPods(resourceVersion)
for event := range watch.EventChan() {
    switch event.Type {
    case "ADDED":
        informer.cache.Add(event.Object)
        informer.queue.Add(event.Object.Name)
        
    case "MODIFIED":
        informer.cache.Update(event.Object)
        informer.queue.Add(event.Object.Name)
        
    case "DELETED":
        informer.cache.Delete(event.Object)
        informer.queue.Add(event.Object.Name)
    }
}

// Phase 3: Handler execution
for {
    objectName := informer.queue.Get()  // Blocking get
    handler(objectName)  // User's reconciliation logic
}

Resync Mechanism

Why Resync?

Watch API không 100% reliable — events bisa lost in rare cases:

Problem scenarios:
├─ Network hiccup → missed events
├─ API Server cache invalidation → old events discarded
├─ Informer crash → missed batch of events
└─ etcd compaction race condition → events pruned

Solution: Periodic resync (level-triggered fallback)

How Resync Works

Timeline:

T=0s: Informer starts
T=0-300s: Watch working, incremental updates
T=300s: Resync window triggers
    └─ Re-LIST all objects
    └─ Compare with cache
    └─ For each object: add to queue
    └─ Handler re-processes everything

T=300-600s: Watch + incremental updates
T=600s: Next resync

Resync Interval Configuration

bash

# Default: 15 minutes
informerFactory := informers.NewSharedInformerFactory(clientset, 15*time.Minute)

# Production might be shorter
informerFactory := informers.NewSharedInformerFactory(clientset, 5*time.Minute)

# Custom resource informer
podInformer := informerFactory.Core().V1().Pods().Informer()
podInformer.SetResyncCheckPeriod(10 * time.Minute)

Resync Tradeoff

Shorter resync interval:
├─ Pro: Faster recovery dari missed events
└─ Con: More API calls, higher etcd load

Longer resync interval:
├─ Pro: Lower etcd load
└─ Con: Longer recovery if events missed

Typical: 10-15 minutes balance

Indexing & Search

Built-in Indexes

Informer cache maintains indexes untuk efficient lookup:

// Get pod by name
pod, err := informer.GetByKey("default/my-pod")

// Get pods by namespace
pods := informer.Index("namespace", "default")

// Custom index by owner
pods := informer.Index("owner", "deployment/my-deployment")

Index Types

// Typical indexes
informer.AddIndexers(map[string]cache.IndexFunc{
    "namespace": func(obj interface{}) ([]string, error) {
        pod := obj.(*corev1.Pod)
        return []string{pod.Namespace}, nil
    },
    "owner": func(obj interface{}) ([]string, error) {
        pod := obj.(*corev1.Pod)
        // Return owner reference
    },
})

Event Handlers

Handler Types

// AddFunc: called when object added
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: func(obj interface{}) {
        pod := obj.(*corev1.Pod)
        queue.Add(pod.Name)
    },
    
    UpdateFunc: func(oldObj, newObj interface{}) {
        oldPod := oldObj.(*corev1.Pod)
        newPod := newObj.(*corev1.Pod)
        
        // Only queue if spec changed (not just status)
        if oldPod.Spec != newPod.Spec {
            queue.Add(newPod.Name)
        }
    },
    
    DeleteFunc: func(obj interface{}) {
        pod := obj.(*corev1.Pod)
        queue.Add(pod.Name)
    },
})

Handler Best Practices

// ❌ Bad: Expensive computation in handler
handler := func(obj interface{}) {
    pod := obj.(*corev1.Pod)
    // Blocking network call here!
    result := expensiveNetworkCall(pod)
    queue.Add(result)
}

// ✅ Good: Queue immediately, process asynchronously
handler := func(obj interface{}) {
    pod := obj.(*corev1.Pod)
    queue.Add(pod.Name)  // Non-blocking, immediate
}

// Later in reconciliation:
for {
    name := queue.Get()
    pod, err := informer.GetByKey(name)
    // Now do expensive work
    expensiveNetworkCall(pod)
}

Shared Informer Factory

Problem: Multiple Controllers

If 10 controllers each create own Pod informer:
├─ 10 LIST calls (duplicate)
├─ 10 WATCH subscriptions (massive etcd load)
├─ 10 local caches (wasted memory)
└─ Inefficient!

Solution: Shared Informer Factory

// Single factory shares informers across controllers
factory := informers.NewSharedInformerFactory(clientset, 15*time.Minute)

// Multiple controllers use same informer
podInformer := factory.Core().V1().Pods().Informer()

// Add multiple handlers to single informer
podInformer.AddEventHandler(controller1Handler)
podInformer.AddEventHandler(controller2Handler)
podInformer.AddEventHandler(controller3Handler)

// Single LIST + WATCH upstream
// All handlers notified about changes

Memory Efficiency

Single informer with 100 handlers:
├─ 1 local cache
├─ 1 WATCH subscription
└─ Memory: ~100 MB (all objects in cache)

vs

100 separate informers:
├─ 100 local caches (duplicate data)
├─ 100 WATCH subscriptions
└─ Memory: ~10 GB (100x overhead!)

Work Queue & Reconciliation

Queue Semantics

// Work queue manages reconciliation ordering
queue := workqueue.NewRateLimitingQueue(
    workqueue.DefaultControllerRateLimiter(),
)

// Add item
queue.Add("pod-name")

// Get and process
for {
    item, _ := queue.Get()
    err := reconcile(item)
    
    if err != nil {
        queue.AddRateLimited(item)  // Retry with backoff
    } else {
        queue.Forget(item)  // Success, stop retrying
    }
    
    queue.Done(item)
}

Rate Limiting

Default: Exponential backoff

First failure: 5ms retry
Second failure: 10ms retry
Third failure: 20ms retry
...
Max: 1000s (16 minutes)

Common Patterns

Pattern 1: Owner Reference Tracking

// Pod belongs to Deployment
pod.OwnerReferences = []metav1.OwnerReference{
    {
        APIVersion: "apps/v1",
        Kind: "Deployment",
        Name: "my-deployment",
        UID: "...",
    },
}

// When Deployment handler triggered
handler := func(obj interface{}) {
    deployment := obj.(*appsv1.Deployment)
    // Find all Pods owned by this Deployment
    pods := podInformer.Index("owner", deployment.Name)
    for _, pod := range pods {
        queue.Add(pod.Name)  // Re-reconcile owned pods
    }
}

Pattern 2: Label-Based Filtering

// Watch only Pods with app=myapp label
selector := labels.SelectorFromSet(map[string]string{
    "app": "myapp",
})

options := metav1.ListOptions{
    LabelSelector: selector.String(),
}

podInformer := factory.Core().V1().Pods().Informer()
// Informer automatically filters

Performance Tuning

Memory Usage

Cache memory = sum of object sizes:

Small cluster (100 pods): ~10 MB
Medium cluster (1000 pods): ~100 MB
Large cluster (10k pods): ~1-5 GB

Optimization: Use field selector when available:

// Only watch Pods in "default" namespace
options := metav1.ListOptions{
    FieldSelector: fields.OneTermEqualSelector("metadata.namespace", "default").String(),
}

CPU Usage

Resync impact:

Resync every 15 minutes:
├─ LIST call: ~1s CPU
├─ Cache update: ~0.5s CPU
├─ Queue flush: ~1s CPU
└─ Total: ~2-3s CPU per resync

Resync frequency impact:
- Every 1 minute: 2-3% CPU overhead
- Every 10 minutes: 0.3-0.5% CPU overhead
- Every 30 minutes: 0.1% CPU overhead

Troubleshooting

Issue 1: Informer Cache Stale

Symptom: Reconciler sees old object state
Cause: Resync interval too long, missed events

Solution:
- Shorter resync: factory := informers.NewSharedInformerFactory(..., 5*time.Minute)
- Check watch connectivity

Issue 2: High Reconciliation Latency

Symptom: Object changes pero reconciliation delayed >30s
Cause: Queue backlog, handler slow

Solution:
- Profile handler performance
- Increase concurrency (multiple workers)
- Reduce work per reconciliation

Reference Documentation

Summary

List-Watch protocol: Initial state + incremental updates = complete reconciliation
Informer cache: Local in-memory cache, indexed untuk efficient lookup
Resync: Periodic re-LIST để recover từ missed events
Shared factory: Multiple handlers share single informer
Work queue: Manages reconciliation ordering with retry logic
Memory/CPU tradeoff: Resync shorter = faster recovery but higher cost

Kubernetes Informer Pattern — List-Watch Protocol, Local Cache Resync, Re-sync Intervals ​

Tại Sao Informer Pattern Quan Trọng ​

The List-Watch Protocol ​

Concept ​

Why Two Phases? ​

Informer Mechanics ​

Cache Layer ​

Lifecycle ​

Resync Mechanism ​

Why Resync? ​

How Resync Works ​

Resync Interval Configuration ​

Resync Tradeoff ​

Indexing & Search ​

Built-in Indexes ​

Index Types ​

Event Handlers ​

Handler Types ​

Handler Best Practices ​

Shared Informer Factory ​

Problem: Multiple Controllers ​

Solution: Shared Informer Factory ​

Memory Efficiency ​

Work Queue & Reconciliation ​

Queue Semantics ​

Rate Limiting ​

Common Patterns ​

Pattern 1: Owner Reference Tracking ​

Pattern 2: Label-Based Filtering ​

Performance Tuning ​

Memory Usage ​

CPU Usage ​

Troubleshooting ​

Issue 1: Informer Cache Stale ​

Issue 2: High Reconciliation Latency ​

Reference Documentation ​

Summary ​