Kubernetes Informer Pattern — List-Watch Protocol, Local Cache Resync, Re-sync Intervals
Tại Sao Informer Pattern Quan Trọng
Mọi Kubernetes controller (Deployments, Services, StatefulSets, custom controllers) đều sử dụng informer pattern. Đây không phải optional — đây là foundation của reconciliation loop design trong Kubernetes.
Hiểu informer giúp:
- Write efficient custom controllers
- Debug controller performance issues
- Understand memory consumption của controllers
- Predict reconciliation latency
The List-Watch Protocol
Concept
Informer sử dụng two-phase approach:
Phase 1: LIST
├─ Get all current objects
├─ Build initial cache
└─ Determine latest resourceVersion
Phase 2: WATCH
├─ Stream changes starting từ resourceVersion
├─ Update cache with incremental changes
└─ Trigger handlers cho object changesWhy Two Phases?
If only watch (without list):
└─ Miss all changes happened before controller started
└─ Incomplete reconciliation
If only list (without watch):
└─ Have current state, but no change notifications
└─ Must poll continuously (inefficient)
List + Watch:
├─ List: Initial state + current version
├─ Watch: Incremental updates từ that version
└─ Complete + efficientInformer Mechanics
Cache Layer
┌──────────────────────────────┐
│ Informer (per resource) │
│ │
│ ┌────────────────────────┐ │
│ │ Reflector │ │
│ │ - List initial state │ │
│ │ - Watch changes │ │
│ └────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────┐ │
│ │ Local Cache (Indexer) │ │
│ │ - In-memory objects │ │
│ │ - Indexed by name │ │
│ │ - Searchable │ │
│ └────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────┐ │
│ │ WorkQueue │ │
│ │ - Object keys │ │
│ │ - Retry queue │ │
│ └────────────────────────┘ │
│ │
└──────────────────────────────┘
↓
Handler (reconciliation logic)Lifecycle
go
// Pseudo-code: How informer works
informer := NewPodInformer()
// Phase 1: LIST — get initial state
pods := api.ListPods()
for pod := range pods {
informer.cache.Add(pod)
}
resourceVersion := pods.metadata.resourceVersion
// Phase 2: WATCH — streaming updates
watch := api.WatchPods(resourceVersion)
for event := range watch.EventChan() {
switch event.Type {
case "ADDED":
informer.cache.Add(event.Object)
informer.queue.Add(event.Object.Name)
case "MODIFIED":
informer.cache.Update(event.Object)
informer.queue.Add(event.Object.Name)
case "DELETED":
informer.cache.Delete(event.Object)
informer.queue.Add(event.Object.Name)
}
}
// Phase 3: Handler execution
for {
objectName := informer.queue.Get() // Blocking get
handler(objectName) // User's reconciliation logic
}Resync Mechanism
Why Resync?
Watch API không 100% reliable — events bisa lost in rare cases:
Problem scenarios:
├─ Network hiccup → missed events
├─ API Server cache invalidation → old events discarded
├─ Informer crash → missed batch of events
└─ etcd compaction race condition → events pruned
Solution: Periodic resync (level-triggered fallback)How Resync Works
Timeline:
T=0s: Informer starts
T=0-300s: Watch working, incremental updates
T=300s: Resync window triggers
└─ Re-LIST all objects
└─ Compare with cache
└─ For each object: add to queue
└─ Handler re-processes everything
T=300-600s: Watch + incremental updates
T=600s: Next resyncResync Interval Configuration
bash
# Default: 15 minutes
informerFactory := informers.NewSharedInformerFactory(clientset, 15*time.Minute)
# Production might be shorter
informerFactory := informers.NewSharedInformerFactory(clientset, 5*time.Minute)
# Custom resource informer
podInformer := informerFactory.Core().V1().Pods().Informer()
podInformer.SetResyncCheckPeriod(10 * time.Minute)Resync Tradeoff
Shorter resync interval:
├─ Pro: Faster recovery dari missed events
└─ Con: More API calls, higher etcd load
Longer resync interval:
├─ Pro: Lower etcd load
└─ Con: Longer recovery if events missed
Typical: 10-15 minutes balanceIndexing & Search
Built-in Indexes
Informer cache maintains indexes untuk efficient lookup:
go
// Get pod by name
pod, err := informer.GetByKey("default/my-pod")
// Get pods by namespace
pods := informer.Index("namespace", "default")
// Custom index by owner
pods := informer.Index("owner", "deployment/my-deployment")Index Types
go
// Typical indexes
informer.AddIndexers(map[string]cache.IndexFunc{
"namespace": func(obj interface{}) ([]string, error) {
pod := obj.(*corev1.Pod)
return []string{pod.Namespace}, nil
},
"owner": func(obj interface{}) ([]string, error) {
pod := obj.(*corev1.Pod)
// Return owner reference
},
})Event Handlers
Handler Types
go
// AddFunc: called when object added
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
pod := obj.(*corev1.Pod)
queue.Add(pod.Name)
},
UpdateFunc: func(oldObj, newObj interface{}) {
oldPod := oldObj.(*corev1.Pod)
newPod := newObj.(*corev1.Pod)
// Only queue if spec changed (not just status)
if oldPod.Spec != newPod.Spec {
queue.Add(newPod.Name)
}
},
DeleteFunc: func(obj interface{}) {
pod := obj.(*corev1.Pod)
queue.Add(pod.Name)
},
})Handler Best Practices
go
// ❌ Bad: Expensive computation in handler
handler := func(obj interface{}) {
pod := obj.(*corev1.Pod)
// Blocking network call here!
result := expensiveNetworkCall(pod)
queue.Add(result)
}
// ✅ Good: Queue immediately, process asynchronously
handler := func(obj interface{}) {
pod := obj.(*corev1.Pod)
queue.Add(pod.Name) // Non-blocking, immediate
}
// Later in reconciliation:
for {
name := queue.Get()
pod, err := informer.GetByKey(name)
// Now do expensive work
expensiveNetworkCall(pod)
}Shared Informer Factory
Problem: Multiple Controllers
If 10 controllers each create own Pod informer:
├─ 10 LIST calls (duplicate)
├─ 10 WATCH subscriptions (massive etcd load)
├─ 10 local caches (wasted memory)
└─ Inefficient!Solution: Shared Informer Factory
go
// Single factory shares informers across controllers
factory := informers.NewSharedInformerFactory(clientset, 15*time.Minute)
// Multiple controllers use same informer
podInformer := factory.Core().V1().Pods().Informer()
// Add multiple handlers to single informer
podInformer.AddEventHandler(controller1Handler)
podInformer.AddEventHandler(controller2Handler)
podInformer.AddEventHandler(controller3Handler)
// Single LIST + WATCH upstream
// All handlers notified about changesMemory Efficiency
Single informer with 100 handlers:
├─ 1 local cache
├─ 1 WATCH subscription
└─ Memory: ~100 MB (all objects in cache)
vs
100 separate informers:
├─ 100 local caches (duplicate data)
├─ 100 WATCH subscriptions
└─ Memory: ~10 GB (100x overhead!)Work Queue & Reconciliation
Queue Semantics
go
// Work queue manages reconciliation ordering
queue := workqueue.NewRateLimitingQueue(
workqueue.DefaultControllerRateLimiter(),
)
// Add item
queue.Add("pod-name")
// Get and process
for {
item, _ := queue.Get()
err := reconcile(item)
if err != nil {
queue.AddRateLimited(item) // Retry with backoff
} else {
queue.Forget(item) // Success, stop retrying
}
queue.Done(item)
}Rate Limiting
Default: Exponential backoff
First failure: 5ms retry
Second failure: 10ms retry
Third failure: 20ms retry
...
Max: 1000s (16 minutes)Common Patterns
Pattern 1: Owner Reference Tracking
go
// Pod belongs to Deployment
pod.OwnerReferences = []metav1.OwnerReference{
{
APIVersion: "apps/v1",
Kind: "Deployment",
Name: "my-deployment",
UID: "...",
},
}
// When Deployment handler triggered
handler := func(obj interface{}) {
deployment := obj.(*appsv1.Deployment)
// Find all Pods owned by this Deployment
pods := podInformer.Index("owner", deployment.Name)
for _, pod := range pods {
queue.Add(pod.Name) // Re-reconcile owned pods
}
}Pattern 2: Label-Based Filtering
go
// Watch only Pods with app=myapp label
selector := labels.SelectorFromSet(map[string]string{
"app": "myapp",
})
options := metav1.ListOptions{
LabelSelector: selector.String(),
}
podInformer := factory.Core().V1().Pods().Informer()
// Informer automatically filtersPerformance Tuning
Memory Usage
Cache memory = sum of object sizes:
Small cluster (100 pods): ~10 MB
Medium cluster (1000 pods): ~100 MB
Large cluster (10k pods): ~1-5 GBOptimization: Use field selector when available:
go
// Only watch Pods in "default" namespace
options := metav1.ListOptions{
FieldSelector: fields.OneTermEqualSelector("metadata.namespace", "default").String(),
}CPU Usage
Resync impact:
Resync every 15 minutes:
├─ LIST call: ~1s CPU
├─ Cache update: ~0.5s CPU
├─ Queue flush: ~1s CPU
└─ Total: ~2-3s CPU per resync
Resync frequency impact:
- Every 1 minute: 2-3% CPU overhead
- Every 10 minutes: 0.3-0.5% CPU overhead
- Every 30 minutes: 0.1% CPU overheadTroubleshooting
Issue 1: Informer Cache Stale
Symptom: Reconciler sees old object state
Cause: Resync interval too long, missed events
Solution:
- Shorter resync: factory := informers.NewSharedInformerFactory(..., 5*time.Minute)
- Check watch connectivityIssue 2: High Reconciliation Latency
Symptom: Object changes pero reconciliation delayed >30s
Cause: Queue backlog, handler slow
Solution:
- Profile handler performance
- Increase concurrency (multiple workers)
- Reduce work per reconciliationReference Documentation
Summary
- List-Watch protocol: Initial state + incremental updates = complete reconciliation
- Informer cache: Local in-memory cache, indexed untuk efficient lookup
- Resync: Periodic re-LIST để recover từ missed events
- Shared factory: Multiple handlers share single informer
- Work queue: Manages reconciliation ordering with retry logic
- Memory/CPU tradeoff: Resync shorter = faster recovery but higher cost