Skip to content

Watch Caching & API Server Local Cache — Stale Reads, Reconnection Behavior

Tại Sao Watch Cache Quan Trọng

Nếu không có watch cache, mỗi pod/service/deployment change sẽ trigger API call từ client. Ở scale 1000s pods, nghĩa là 1000s API calls per second. etcd không thể handle.

Watch cache là optimization layer giữa etcd và API clients. Hiểu cách nó vận hành → debug staleness issues, understand API latency, tối ưu client connection patterns.


Architecture của Watch Cache

Without Cache (Naive)

Pods watch từ 100 clients
     ├─ Watch 1 → etcd subscription
     ├─ Watch 2 → etcd subscription
     ├─ ...
     └─ Watch 100 → etcd subscription
     
Result: 100 etcd watch connections!
Etcd overhead: massive
Performance: degraded

With Cache (Actual Implementation)

API Server
├─ Local Watch Cache (in-memory)
│  └─ Stores frequently accessed objects

├─ Single etcd watch subscription
│  (multiplexed for all clients)

└─ Event distribution
   ├─ Client 1 ← cached events
   ├─ Client 2 ← cached events
   └─ Client 100 ← cached events
   
Result: Single etcd connection, shared cost
Efficiency: 100x improvement

Cache Mechanics

python
# Pseudo-code: API Server Watch Cache

class WatchCache:
    def __init__(self):
        self.resourceVersion = 0
        self.objects = {}  # key → object
        self.etcd_watch = None  # subscription to etcd
        self.subscribers = []   # clients watching
    
    def start(self):
        # Subscribe to etcd changes
        self.etcd_watch = etcd.watch(
            resource_type="pods",
            onEvent=self.onEtcdEvent
        )
    
    def onEtcdEvent(self, event):
        # Event from etcd
        if event.type == "ADDED":
            self.objects[event.key] = event.value
            self.resourceVersion += 1
        
        # Broadcast to all subscribers
        for subscriber in self.subscribers:
            subscriber.send(event)
    
    def addSubscriber(self, client):
        # New watch client connected
        self.subscribers.append(client)
        
        # Send initial state
        for obj in self.objects.values():
            client.send(Event.ADDED, obj)
    
    def removeSubscriber(self, client):
        self.subscribers.remove(client)

Resource Version & Consistency

What is Resource Version?

resourceVersion là version number của object trong etcd:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  resourceVersion: "12345"  # etcd version
  generation: 1
spec:
  containers:
  - name: app

Setiap update → incrementing resourceVersion.

Resource Version Ordering

Pod created:       resourceVersion=1000
Pod updated:       resourceVersion=1001
Pod spec changed:  resourceVersion=1002

Guarantee: 1000 < 1001 < 1002 (monotonic increasing)

Watch Client Synchronization

Client connect with resourceVersion=1000

API Server:
  "Send me events starting from 1000"

API Server sends:
  - All objects with version > 1000
  - Followed by live stream

Result: Client synchronized dengan server state at version 1000

Stale Reads Scenario

When Does Staleness Happen?

Watch cache can have stale data dalam narrow window:

Timeline:

T1: etcd push change to API Server cache
    └─ Cache now reflects change
    └─ resourceVersion bumped

T2: API Server notifies watch subscribers
    └─ Events sent

T3: Client receives events

T4: But if client reads from cache BEFORE events fully propagated
    └─ Might see state from T1-T2 boundary
    └─ Stale for few milliseconds

Real Scenario: Lost Update Race

yaml
# Deployment controller watching Pod status

Time 0: Pod created, watch cache sees it
Time 5ms: Pod starts, status.phase = Running (etcd updated)
Time 10ms: Deployment controller reads Pod (not watching)
           Pod not in watch cache yet (hasn't processed event)
           Gets stale Pod status

Duration: Usually <100ms, tapi under load bisa seconds.

Dealing dengan Stale Reads

Pattern dalam Kubernetes controllers:

go
// Strategy 1: Always read from etcd, not cache
pod := client.Get("pods/my-pod")  // read from etcd, not cache

// Strategy 2: Use watch API (eventual consistency)
watch := client.Watch("pods")      // guaranteed fresh events

// Strategy 3: Poll with exponential backoff
for i := 0; i < 10; i++ {
    pod := client.Get("pods/my-pod")
    if pod.Status.Phase == "Running" {
        break
    }
    time.Sleep(100ms)  // Retry after staleness window
}

Watch Connection Behavior

Connection Lifecycle

1. Client connects
   kubectl get pods --watch


2. TCP handshake

3. HTTP request
   GET /api/v1/watch/pods?resourceVersion=12345

4. API Server creates watch subscription

5. Streaming HTTP response
   Server sends: event1, event2, event3...
   (indefinite stream, never closes)

6. Connection held open
   └─ Client receives events as they occur
   └─ No polling needed

Reconnection on Network Failure

Watch stream running normally

Network interruption (5 seconds)

Connection dropped

Client library (kubectl, client-go) detects

Reconnect attempt

client.Watch(resource, resourceVersion=last_seen)

API Server:
  "Client last saw version 5000, give them 5000-current events"

Buffered events replayed

Continue live stream

Event Buffering

API Server buffers events jika client temporarily slow:

Buffer size: ~1000-5000 events (tunable)

If client can't keep up:
├─ Buffer fills
├─ Client gets dropped (GOAWAY frame)
├─ Client reconnects
└─ Replay dari buffered window

Implication: Slow watch clients get disconnected automatically.


Cache Invalidation

When Cache Invalidates

Watch cache invalidates dalam scenarios:

ScenarioImpactRecovery
etcd compactionOld revisions purgedClient must reconnect dengan new revision
API Server restartCache clearedRepopulate dari etcd
Network partitionSubscribers disconnectedAutomatic reconnect + replay
Resource quota exceedSubscription rejectedClient error

etcd Compaction Effect

Compaction retains last 1 hour:

If client watched at revision 50000
Compaction happens (discards < 100000)

Client tries: Watch pods at revision=50000

API Server:
  "Revision 50000 no longer available"
  Error: "watch revision too old"

Client must reconnect at revision=100000 (current)

Client loses events between 50000-100000

Recommendation: Compact on large clusters

Watch Cache Limits

Connection Limits

Each API Server has max watch connections:

Typical limits:
├─ Single API Server: 10,000-50,000 watch connections
├─ GKE typical: 4-8 API Server replicas
└─ Total cluster: ~100,000+ watches

If exceeded:
├─ New watch rejected
├─ Error: "watch limit exceeded"
└─ Client must retry atau wait

Memory Usage

Watch cache memory grows with cluster size:

Small cluster (100 pods):  ~10 MB cache
Medium cluster (1000 pods): ~100 MB cache
Large cluster (10k pods):  ~1-5 GB cache
Huge cluster (100k pods):  ~10-50 GB cache

GKE mitigation: Automatic cache eviction based on memory pressure.

Event Queue Backpressure

If 1000s pods change at same time:
├─ etcd sends all changes
├─ Cache processes events
├─ Backlog builds
├─ Subscribers receive delayed events
│  (stale for seconds, not milliseconds)
└─ Eventually catch up

Production Patterns

Pattern 1: Efficient Client Implementation

go
// Good: Use watch API, minimal reconnects
import "k8s.io/client-go/kubernetes"

clientset := kubernetes.NewForConfig(config)
watch, err := clientset.CoreV1().Pods("default").Watch(context.Background(), metav1.ListOptions{})

for event := range watch.ResultChan() {
    pod := event.Object.(*corev1.Pod)
    // Handle pod change
}

// Bad: Polling every second (unnecessary load)
for {
    pods, err := clientset.CoreV1().Pods("default").List(context.Background(), metav1.ListOptions{})
    // Process pods
    time.Sleep(1 * time.Second)  // Anti-pattern!
}

Pattern 2: Handling Watch Reconnections

go
// Robust watch with auto-reconnect

ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()

for {
    watcher, err := clientset.CoreV1().Pods("").Watch(context.Background(), metav1.ListOptions{})
    if err != nil {
        log.Printf("Watch failed: %v", err)
        <-ticker.C
        continue
    }
    
    for event := range watcher.ResultChan() {
        // Process event
    }
    
    // If loop exits (watch dropped), retry
    log.Println("Watch disconnected, retrying...")
    <-ticker.C
}

Pattern 3: ListWatch Protocol (Framework Pattern)

Semua Kubernetes controllers use ListWatch pattern:

go
// List: Get current state
pods, err := client.List(namespace, selector)

// Watch: Get updates
watch := client.Watch(namespace, selector, resourceVersion=latest)

// Combine: Full reconciliation + incremental updates
for event := range watch.ResultChan() {
    if event.Type == "ADDED" || event.Type == "MODIFIED" {
        reconcile(event.Object)
    }
}

Troubleshooting Stale Cache

Symptom 1: Pod shows Pending, but node has resources

Cause: Cache stale, controller hasn't seen resource yet
Solution: Trigger watch refresh dengan kubectl delete pod (forces reschedule)

Symptom 2: Service endpoints not updated

Cause: Service/Endpoint cache desync
Diagnosis:
  kubectl get endpoints <service>
  kubectl describe service <service>
  kubectl get pods -L kubernetes.io/hostname
Solution: Restart endpoint controller (not recommended, last resort)

Symptom 3: Watch client reconnecting frequently

Cause: Slow client, buffer fills, disconnected
Solution: 
  1. Check client performance
  2. Reduce watch scope (only needed selectors)
  3. Increase buffer size (GKE setting)

Reference Dokumentasi


Summary

  • Watch cache: Single upstream etcd subscription, multiplexed ke many clients
  • Resource version: Monotonically increasing, used để track state
  • Stale reads: Possible dalam narrow window, usually <100ms
  • Reconnection: Automatic, buffered events replayed
  • Compaction impact: Old revisions discarded, client must reconnect
  • Connection limits: ~10k per API Server, overflow rejected
  • Best practice: Use watch API instead polling, implement robust reconnect logic
  • Performance: Order of magnitude improvement: 1 etcd watch vs N client watches