TTL Tuning: High-Churn Environments & Consistency
Tại sao điều này quan trọng
TTL (Time-To-Live) = cache expiration time. Thiết kế sai TTL có hậu quả:
- Too high (3600s): Stale data for 1 hour after change
- Too low (0s): Every query hits DNS (load spike)
- Not environment-aware: Prod vs dev needs different TTLs
TTL Fundamentals
How TTL Works
T+0: Record cached (TTL=300)
api.default → 10.4.0.50
T+50: Query same record
→ Cache hit (TTL=250 remaining)
→ Return 10.4.0.50 (no DNS query)
T+150: Query again
→ Cache hit (TTL=150 remaining)
→ Return 10.4.0.50
T+300: Query after expiration
→ Cache expired (TTL=0)
→ DNS query sent
→ New record retrievedTTL Layers
Layer 1: Pod cache (browsers, client libraries)
TTL: Varies (often 60s)
Layer 2: NodeLocal DNSCache
TTL: Configurable (default 30s for Kubernetes records)
Layer 3: CoreDNS
TTL: Configured in Corefile (default 30s)
Layer 4: Cloud DNS
TTL: Set on record (default 300s)
Effective TTL = Minimum across all layersTTL for Different Environments
Development Environment
TTL: 60-120 seconds (fresh data frequently)
Rationale:
- Services change frequently
- Need quick iteration
- Cannot afford stale IPs
- Load on DNS acceptable
Example:
kubectl apply -f deployment.yaml (new image)
→ Pod IP changes
→ New service endpoint
→ Within 60s, services discover new IPStaging Environment
TTL: 300 seconds (5 minutes balance)
Rationale:
- Regular deployments (but not constant)
- Balance between freshness and load
- Similar to production (test conditions)
- Production-like service patterns
Example:
Staging deployment every 2 hours
→ New endpoint every 120s
→ 300s TTL → 2-3 minute delay before all clients see new IPProduction Environment
TTL: 300-600 seconds (5-10 minutes)
Rationale:
- Blue-green deployments (planned changes)
- Minimize DNS query load
- Balances consistency and performance
- Gradual rollout (not immediate)
Example:
Blue (old, 10.4.0.50) running
Green (new, 10.4.0.51) deployed
T+0: Switch DNS: api.prod → 10.4.0.51
T+0-300: Old clients still cached to 10.4.0.50
New clients resolve to 10.4.0.51
Both running simultaneously
T+300: Old cache expires, all clients on newHigh-Churn Scenarios
Scenario 1: Continuous Deployment (Every 5 min)
Problem:
Deployments every 5 minutes
TTL=300 (5 min default)
→ Every 5 min, cache expires
→ DNS load spike
Solution:
1. Reduce TTL to 60s (more queries, but smoother)
Tradeoff: 5x more DNS queries, but faster convergence
2. Use sticky load balancing (keep connection to same pod)
Clients don't need new DNS immediately
3. Increase pool size (fewer queries per pod)Scenario 2: Canary Deployments (1% traffic)
Canary pattern:
1% traffic → Canary pods (10.4.0.100)
99% traffic → Stable pods (10.4.0.50)
Challenge:
Need fine-grained control, but DNS returns single IP or round-robin
Solution:
Don't use DNS for canary routing
Use Service Mesh (Istio) or load balancer weight controls
DNS too coarse for canary (DNS-level load balancing not precise)Scenario 3: Auto-Scaling Events
Autoscaler trigger: 100 pods → 500 pods added
Problem:
New pods get new IPs
TTL=300 means old clients cached old set for 5 min
→ Requests hit old pods (possible failures)
Solution:
1. Connection pooling/reuse (client doesn't reconnect immediately)
2. Graceful scale-down (old pods drain, not terminate abruptly)
3. Lower TTL during scale events (60s)TTL Configuration
Cloud DNS Record
bash
# Create record with 60s TTL (for dev)
gcloud dns record-sets transaction start --zone=api-zone
gcloud dns record-sets transaction add 10.4.0.50 \
--name=api.default.svc.cluster.local \
--type=A \
--ttl=60 \
--zone=api-zone
gcloud dns record-sets transaction execute --zone=api-zone
# Change TTL
gcloud dns record-sets update api.default.svc.cluster.local \
--rrdatas=10.4.0.50 \
--ttl=300 \
--type=A \
--zone=api-zoneKubernetes Service TTL
yaml
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
ports:
- port: 8080
# TTL for service discovery (Kubernetes-level)
# Actual DNS TTL controlled by Cloud DNS or CoreDNSCoreDNS TTL
corefile
.:53 {
cache 30 # Cache TTL
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30 # Kubernetes record TTL
}
forward . /etc/resolv.conf
}Impact Analysis
Example: Service Migration
Old service: api.old.internal → 10.0.1.5 (on-prem)
New service: api.internal → 10.0.2.5 (GCP)
Phase 1: Pre-migration
api.internal → 10.0.2.5 (new GCP service)
TTL=300
Phase 2: Migration day (T+0)
Old clients still have cached 10.0.1.5 (if it was dns entry)
New clients resolve 10.0.2.5
Phase 3: T+300 (5 min)
Old cached entries expire
All clients now on 10.0.2.5
Phase 4: T+600 (10 min)
Verify all traffic on new service
Stop old serviceMonitoring TTL Impact
bash
# Monitor DNS cache hit rate
kubectl logs -n kube-system -l k8s-app=node-local-dns \
| grep "cache" \
| tail -50
# Monitor query latency distribution
# P50: ~2ms (cache hit)
# P99: ~5ms (cache miss, one hop)
# P99.9: ~50ms (if stalled)
# Alert if P99.9 > 100ms
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High DNS latency (P99.9 > 100ms)"TTL vs Eventual Consistency
Consistency Model
Eventual consistency window ≈ TTL seconds
Example:
Update record: api.internal → 10.4.0.100 (new IP)
TTL=300 (5 min)
Client A queries at T+1: Resolved immediately (new IP)
Client B queries at T+50: Gets cached old IP (consistency gap)
Client C queries at T+300: Gets new IP (consistency achieved)
Window: 0-300 seconds (full consistency after TTL expires)Trade-off Matrix
| TTL | Consistency | Load | Latency |
|---|---|---|---|
| 30s | High (fast consistency) | High (more queries) | Slightly higher |
| 300s | Medium | Medium | Lower (more cache hits) |
| 3600s | Low (slow consistency) | Low (few queries) | Lowest (high cache hit rate) |
Best Practices
- Dev environment: TTL=60-120s (refresh frequently)
- Staging: TTL=300s (mirror prod)
- Production: TTL=300-600s (balance)
- High-churn services: TTL=60s (accept load)
- Stable services: TTL=3600s (minimize queries)
- Lower TTL before planned changes (ensure fast rollout)
- Monitor cache hit rates (tune based on metrics)