Watch Caching & API Server Local Cache — Stale Reads, Reconnection Behavior
Tại Sao Watch Cache Quan Trọng
Nếu không có watch cache, mỗi pod/service/deployment change sẽ trigger API call từ client. Ở scale 1000s pods, nghĩa là 1000s API calls per second. etcd không thể handle.
Watch cache là optimization layer giữa etcd và API clients. Hiểu cách nó vận hành → debug staleness issues, understand API latency, tối ưu client connection patterns.
Architecture của Watch Cache
Without Cache (Naive)
Pods watch từ 100 clients
├─ Watch 1 → etcd subscription
├─ Watch 2 → etcd subscription
├─ ...
└─ Watch 100 → etcd subscription
Result: 100 etcd watch connections!
Etcd overhead: massive
Performance: degradedWith Cache (Actual Implementation)
API Server
├─ Local Watch Cache (in-memory)
│ └─ Stores frequently accessed objects
│
├─ Single etcd watch subscription
│ (multiplexed for all clients)
│
└─ Event distribution
├─ Client 1 ← cached events
├─ Client 2 ← cached events
└─ Client 100 ← cached events
Result: Single etcd connection, shared cost
Efficiency: 100x improvementCache Mechanics
# Pseudo-code: API Server Watch Cache
class WatchCache:
def __init__(self):
self.resourceVersion = 0
self.objects = {} # key → object
self.etcd_watch = None # subscription to etcd
self.subscribers = [] # clients watching
def start(self):
# Subscribe to etcd changes
self.etcd_watch = etcd.watch(
resource_type="pods",
onEvent=self.onEtcdEvent
)
def onEtcdEvent(self, event):
# Event from etcd
if event.type == "ADDED":
self.objects[event.key] = event.value
self.resourceVersion += 1
# Broadcast to all subscribers
for subscriber in self.subscribers:
subscriber.send(event)
def addSubscriber(self, client):
# New watch client connected
self.subscribers.append(client)
# Send initial state
for obj in self.objects.values():
client.send(Event.ADDED, obj)
def removeSubscriber(self, client):
self.subscribers.remove(client)Resource Version & Consistency
What is Resource Version?
resourceVersion là version number của object trong etcd:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
resourceVersion: "12345" # etcd version
generation: 1
spec:
containers:
- name: appSetiap update → incrementing resourceVersion.
Resource Version Ordering
Pod created: resourceVersion=1000
Pod updated: resourceVersion=1001
Pod spec changed: resourceVersion=1002
Guarantee: 1000 < 1001 < 1002 (monotonic increasing)Watch Client Synchronization
Client connect with resourceVersion=1000
↓
API Server:
"Send me events starting from 1000"
↓
API Server sends:
- All objects with version > 1000
- Followed by live stream
Result: Client synchronized dengan server state at version 1000Stale Reads Scenario
When Does Staleness Happen?
Watch cache can have stale data dalam narrow window:
Timeline:
T1: etcd push change to API Server cache
└─ Cache now reflects change
└─ resourceVersion bumped
T2: API Server notifies watch subscribers
└─ Events sent
T3: Client receives events
T4: But if client reads from cache BEFORE events fully propagated
└─ Might see state from T1-T2 boundary
└─ Stale for few millisecondsReal Scenario: Lost Update Race
# Deployment controller watching Pod status
Time 0: Pod created, watch cache sees it
Time 5ms: Pod starts, status.phase = Running (etcd updated)
Time 10ms: Deployment controller reads Pod (not watching)
Pod not in watch cache yet (hasn't processed event)
Gets stale Pod statusDuration: Usually <100ms, tapi under load bisa seconds.
Dealing dengan Stale Reads
Pattern dalam Kubernetes controllers:
// Strategy 1: Always read from etcd, not cache
pod := client.Get("pods/my-pod") // read from etcd, not cache
// Strategy 2: Use watch API (eventual consistency)
watch := client.Watch("pods") // guaranteed fresh events
// Strategy 3: Poll with exponential backoff
for i := 0; i < 10; i++ {
pod := client.Get("pods/my-pod")
if pod.Status.Phase == "Running" {
break
}
time.Sleep(100ms) // Retry after staleness window
}Watch Connection Behavior
Connection Lifecycle
1. Client connects
kubectl get pods --watch
↓
2. TCP handshake
3. HTTP request
GET /api/v1/watch/pods?resourceVersion=12345
4. API Server creates watch subscription
5. Streaming HTTP response
Server sends: event1, event2, event3...
(indefinite stream, never closes)
6. Connection held open
└─ Client receives events as they occur
└─ No polling neededReconnection on Network Failure
Watch stream running normally
↓
Network interruption (5 seconds)
↓
Connection dropped
↓
Client library (kubectl, client-go) detects
↓
Reconnect attempt
↓
client.Watch(resource, resourceVersion=last_seen)
↓
API Server:
"Client last saw version 5000, give them 5000-current events"
↓
Buffered events replayed
↓
Continue live streamEvent Buffering
API Server buffers events jika client temporarily slow:
Buffer size: ~1000-5000 events (tunable)
If client can't keep up:
├─ Buffer fills
├─ Client gets dropped (GOAWAY frame)
├─ Client reconnects
└─ Replay dari buffered windowImplication: Slow watch clients get disconnected automatically.
Cache Invalidation
When Cache Invalidates
Watch cache invalidates dalam scenarios:
| Scenario | Impact | Recovery |
|---|---|---|
| etcd compaction | Old revisions purged | Client must reconnect dengan new revision |
| API Server restart | Cache cleared | Repopulate dari etcd |
| Network partition | Subscribers disconnected | Automatic reconnect + replay |
| Resource quota exceed | Subscription rejected | Client error |
etcd Compaction Effect
Compaction retains last 1 hour:
If client watched at revision 50000
Compaction happens (discards < 100000)
↓
Client tries: Watch pods at revision=50000
↓
API Server:
"Revision 50000 no longer available"
Error: "watch revision too old"
↓
Client must reconnect at revision=100000 (current)
↓
Client loses events between 50000-100000
↓
Recommendation: Compact on large clustersWatch Cache Limits
Connection Limits
Each API Server has max watch connections:
Typical limits:
├─ Single API Server: 10,000-50,000 watch connections
├─ GKE typical: 4-8 API Server replicas
└─ Total cluster: ~100,000+ watches
If exceeded:
├─ New watch rejected
├─ Error: "watch limit exceeded"
└─ Client must retry atau waitMemory Usage
Watch cache memory grows with cluster size:
Small cluster (100 pods): ~10 MB cache
Medium cluster (1000 pods): ~100 MB cache
Large cluster (10k pods): ~1-5 GB cache
Huge cluster (100k pods): ~10-50 GB cacheGKE mitigation: Automatic cache eviction based on memory pressure.
Event Queue Backpressure
If 1000s pods change at same time:
├─ etcd sends all changes
├─ Cache processes events
├─ Backlog builds
├─ Subscribers receive delayed events
│ (stale for seconds, not milliseconds)
└─ Eventually catch upProduction Patterns
Pattern 1: Efficient Client Implementation
// Good: Use watch API, minimal reconnects
import "k8s.io/client-go/kubernetes"
clientset := kubernetes.NewForConfig(config)
watch, err := clientset.CoreV1().Pods("default").Watch(context.Background(), metav1.ListOptions{})
for event := range watch.ResultChan() {
pod := event.Object.(*corev1.Pod)
// Handle pod change
}
// Bad: Polling every second (unnecessary load)
for {
pods, err := clientset.CoreV1().Pods("default").List(context.Background(), metav1.ListOptions{})
// Process pods
time.Sleep(1 * time.Second) // Anti-pattern!
}Pattern 2: Handling Watch Reconnections
// Robust watch with auto-reconnect
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for {
watcher, err := clientset.CoreV1().Pods("").Watch(context.Background(), metav1.ListOptions{})
if err != nil {
log.Printf("Watch failed: %v", err)
<-ticker.C
continue
}
for event := range watcher.ResultChan() {
// Process event
}
// If loop exits (watch dropped), retry
log.Println("Watch disconnected, retrying...")
<-ticker.C
}Pattern 3: ListWatch Protocol (Framework Pattern)
Semua Kubernetes controllers use ListWatch pattern:
// List: Get current state
pods, err := client.List(namespace, selector)
// Watch: Get updates
watch := client.Watch(namespace, selector, resourceVersion=latest)
// Combine: Full reconciliation + incremental updates
for event := range watch.ResultChan() {
if event.Type == "ADDED" || event.Type == "MODIFIED" {
reconcile(event.Object)
}
}Troubleshooting Stale Cache
Symptom 1: Pod shows Pending, but node has resources
Cause: Cache stale, controller hasn't seen resource yet
Solution: Trigger watch refresh dengan kubectl delete pod (forces reschedule)Symptom 2: Service endpoints not updated
Cause: Service/Endpoint cache desync
Diagnosis:
kubectl get endpoints <service>
kubectl describe service <service>
kubectl get pods -L kubernetes.io/hostname
Solution: Restart endpoint controller (not recommended, last resort)Symptom 3: Watch client reconnecting frequently
Cause: Slow client, buffer fills, disconnected
Solution:
1. Check client performance
2. Reduce watch scope (only needed selectors)
3. Increase buffer size (GKE setting)Reference Dokumentasi
Summary
- Watch cache: Single upstream etcd subscription, multiplexed ke many clients
- Resource version: Monotonically increasing, used để track state
- Stale reads: Possible dalam narrow window, usually <100ms
- Reconnection: Automatic, buffered events replayed
- Compaction impact: Old revisions discarded, client must reconnect
- Connection limits: ~10k per API Server, overflow rejected
- Best practice: Use watch API instead polling, implement robust reconnect logic
- Performance: Order of magnitude improvement: 1 etcd watch vs N client watches