Skip to content

TTL Tuning: High-Churn Environments & Consistency

Tại sao điều này quan trọng

TTL (Time-To-Live) = cache expiration time. Thiết kế sai TTL có hậu quả:

  • Too high (3600s): Stale data for 1 hour after change
  • Too low (0s): Every query hits DNS (load spike)
  • Not environment-aware: Prod vs dev needs different TTLs

TTL Fundamentals

How TTL Works

T+0: Record cached (TTL=300)
  api.default → 10.4.0.50

T+50: Query same record
  → Cache hit (TTL=250 remaining)
  → Return 10.4.0.50 (no DNS query)

T+150: Query again
  → Cache hit (TTL=150 remaining)
  → Return 10.4.0.50

T+300: Query after expiration
  → Cache expired (TTL=0)
  → DNS query sent
  → New record retrieved

TTL Layers

Layer 1: Pod cache (browsers, client libraries)
  TTL: Varies (often 60s)

Layer 2: NodeLocal DNSCache
  TTL: Configurable (default 30s for Kubernetes records)

Layer 3: CoreDNS
  TTL: Configured in Corefile (default 30s)

Layer 4: Cloud DNS
  TTL: Set on record (default 300s)

Effective TTL = Minimum across all layers

TTL for Different Environments

Development Environment

TTL: 60-120 seconds (fresh data frequently)

Rationale:
  - Services change frequently
  - Need quick iteration
  - Cannot afford stale IPs
  - Load on DNS acceptable

Example:
  kubectl apply -f deployment.yaml (new image)
    → Pod IP changes
    → New service endpoint
    → Within 60s, services discover new IP

Staging Environment

TTL: 300 seconds (5 minutes balance)

Rationale:
  - Regular deployments (but not constant)
  - Balance between freshness and load
  - Similar to production (test conditions)
  - Production-like service patterns

Example:
  Staging deployment every 2 hours
    → New endpoint every 120s
    → 300s TTL → 2-3 minute delay before all clients see new IP

Production Environment

TTL: 300-600 seconds (5-10 minutes)

Rationale:
  - Blue-green deployments (planned changes)
  - Minimize DNS query load
  - Balances consistency and performance
  - Gradual rollout (not immediate)

Example:
  Blue (old, 10.4.0.50) running
  Green (new, 10.4.0.51) deployed
  
  T+0: Switch DNS: api.prod → 10.4.0.51
  T+0-300: Old clients still cached to 10.4.0.50
           New clients resolve to 10.4.0.51
           Both running simultaneously
  T+300: Old cache expires, all clients on new

High-Churn Scenarios

Scenario 1: Continuous Deployment (Every 5 min)

Problem:
  Deployments every 5 minutes
  TTL=300 (5 min default)
  → Every 5 min, cache expires
  → DNS load spike

Solution:
  1. Reduce TTL to 60s (more queries, but smoother)
     Tradeoff: 5x more DNS queries, but faster convergence
  
  2. Use sticky load balancing (keep connection to same pod)
     Clients don't need new DNS immediately
  
  3. Increase pool size (fewer queries per pod)

Scenario 2: Canary Deployments (1% traffic)

Canary pattern:
  1% traffic → Canary pods (10.4.0.100)
  99% traffic → Stable pods (10.4.0.50)

Challenge:
  Need fine-grained control, but DNS returns single IP or round-robin
  
Solution:
  Don't use DNS for canary routing
  Use Service Mesh (Istio) or load balancer weight controls
  DNS too coarse for canary (DNS-level load balancing not precise)

Scenario 3: Auto-Scaling Events

Autoscaler trigger: 100 pods → 500 pods added

Problem:
  New pods get new IPs
  TTL=300 means old clients cached old set for 5 min
  → Requests hit old pods (possible failures)

Solution:
  1. Connection pooling/reuse (client doesn't reconnect immediately)
  2. Graceful scale-down (old pods drain, not terminate abruptly)
  3. Lower TTL during scale events (60s)

TTL Configuration

Cloud DNS Record

bash
# Create record with 60s TTL (for dev)
gcloud dns record-sets transaction start --zone=api-zone
gcloud dns record-sets transaction add 10.4.0.50 \
  --name=api.default.svc.cluster.local \
  --type=A \
  --ttl=60 \
  --zone=api-zone
gcloud dns record-sets transaction execute --zone=api-zone

# Change TTL
gcloud dns record-sets update api.default.svc.cluster.local \
  --rrdatas=10.4.0.50 \
  --ttl=300 \
  --type=A \
  --zone=api-zone

Kubernetes Service TTL

yaml
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
  - port: 8080
  # TTL for service discovery (Kubernetes-level)
  # Actual DNS TTL controlled by Cloud DNS or CoreDNS

CoreDNS TTL

corefile
.:53 {
    cache 30          # Cache TTL
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
      ttl 30          # Kubernetes record TTL
    }
    forward . /etc/resolv.conf
}

Impact Analysis

Example: Service Migration

Old service: api.old.internal → 10.0.1.5 (on-prem)
New service: api.internal → 10.0.2.5 (GCP)

Phase 1: Pre-migration
  api.internal → 10.0.2.5 (new GCP service)
  TTL=300

Phase 2: Migration day (T+0)
  Old clients still have cached 10.0.1.5 (if it was dns entry)
  New clients resolve 10.0.2.5

Phase 3: T+300 (5 min)
  Old cached entries expire
  All clients now on 10.0.2.5

Phase 4: T+600 (10 min)
  Verify all traffic on new service
  Stop old service

Monitoring TTL Impact

bash
# Monitor DNS cache hit rate
kubectl logs -n kube-system -l k8s-app=node-local-dns \
  | grep "cache" \
  | tail -50

# Monitor query latency distribution
# P50: ~2ms (cache hit)
# P99: ~5ms (cache miss, one hop)
# P99.9: ~50ms (if stalled)

# Alert if P99.9 > 100ms
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High DNS latency (P99.9 > 100ms)"

TTL vs Eventual Consistency

Consistency Model

Eventual consistency window ≈ TTL seconds

Example:
  Update record: api.internal → 10.4.0.100 (new IP)
  TTL=300 (5 min)
  
  Client A queries at T+1: Resolved immediately (new IP)
  Client B queries at T+50: Gets cached old IP (consistency gap)
  Client C queries at T+300: Gets new IP (consistency achieved)
  
  Window: 0-300 seconds (full consistency after TTL expires)

Trade-off Matrix

TTLConsistencyLoadLatency
30sHigh (fast consistency)High (more queries)Slightly higher
300sMediumMediumLower (more cache hits)
3600sLow (slow consistency)Low (few queries)Lowest (high cache hit rate)

Best Practices

  1. Dev environment: TTL=60-120s (refresh frequently)
  2. Staging: TTL=300s (mirror prod)
  3. Production: TTL=300-600s (balance)
  4. High-churn services: TTL=60s (accept load)
  5. Stable services: TTL=3600s (minimize queries)
  6. Lower TTL before planned changes (ensure fast rollout)
  7. Monitor cache hit rates (tune based on metrics)

References