Chapter 4: Cloud DNS Architecture & Production Patterns
Tại sao DNS quan trọng trong production
DNS không phải là một "commodity service" mà bạn có thể ignore. Ở scale production, DNS failures dẫn đến:
- Complete service outages: Khi DNS không resolve, bất kỳ service nào phía sau nó cũng unreachable
- Security breaches: DNS poisoning, exfiltration detection, lateral movement vectors
- Compliance violations: Audit logs không capture, data residency violations
- Latency amplification: Mỗi DNS query thêm 10-100ms—ở scale có thể là 1000s queries/sec
- Hidden attack surface: DNS hijacking, DDoS via DNS reflection, cache poisoning
Reality check: Hầu hết outages không phải lỗi compute—chúng là network/DNS issues.
Tại sao Chapter này
GCP Cloud DNS khác biệt so với traditional DNS (BIND on-prem):
- Fully managed, no operational overhead — nhưng vẫn cần understand cơ chế
- Integrated với GCP security stack — IAM, audit logs, VPC enforcement
- Split-horizon, peering, forwarding — complex topologies cho hybrid/multi-cloud
- Mandatory cho GKE Autopilot — không thể avoid
- Production at scale — 1000s zones, millions queries/sec
Learning Path
Recommend theo thứ tự:
Foundation (Bắt buộc)
- Managed Zones: Public vs Private - Core abstraction
- Split-Horizon DNS - Internal/external topology
- Cloud DNS for GKE - Mandatory reading nếu dùng Kubernetes
Hybrid/Enterprise (Strongly Recommended)
- DNS Peering - Multi-project, multi-VPC
- DNS Forwarding - On-premises integration
- Private DNS Zones - VPC-specific resolution
Advanced (Recommended)
- Response Policy Zones (RPZ) - Security overrides
- NodeLocal DNSCache - Performance
- DNS Resolution Path - Debugging & troubleshooting
Operations & Compliance
- DNS Query Logging - Audit, compliance, exfiltration detection
- TTL Tuning - Churn scenarios
- DNSSEC - Validation & key management
- Multi-Cluster DNS - Service Directory patterns
By Role
Backend Engineer (5-10 years experience)
→ Read: 01, 02, 06, 09
Bạn cần hiểu how services discover each other, why DNS matters for RPC latency, basic split-horizon.
Platform Engineer / SRE
→ Read: 01, 02, 03, 04, 05, 06, 07, 09, 10, 11, 12
Bạn manage DNS infrastructure, peering configs, troubleshooting, compliance.
Kubernetes Operator
→ Read: 06, 08, 09, 10, 13
GKE DNS integration, caching, multi-cluster service discovery.
Security Engineer
→ Read: 02, 07, 10, 12, 13
Split-horizon, RPZ policies, logging, DNSSEC, multi-cluster access control.
Cloud Architect
→ Read: 01, 02, 03, 04, 05, 06, 12, 13
Design hybrid networks, peering topologies, multi-region patterns.
Key Mental Models
1. DNS Resolution Hierarchy
Pod/VM (10.0.1.5)
↓
Local /etc/resolv.conf (ndots=5, search domains)
↓
NodeLocal DNSCache (optional, cached entries)
↓
Cloud DNS (private zones → public zones → upstream)
↓
Upstream (on-prem DNS, public internet)Critical: Each hop can fail. Understand all 5 stages.
2. Zone Attachment Model
Private zones attach to VPCs, not individual resources:
Project A: VPC A, VPC B (shared VPC)
├── Private Zone "internal.example.com"
│ └── Attached to: VPC A, VPC B → BOTH can resolve
└── Private Zone "database.internal"
└── Attached to: VPC A only → VPC B cannot resolve
Project B: VPC C (peer project)
└── Cannot resolve any zones from Project A (unless DNS peering)Implication: Careful with zone attachment → security boundaries.
3. Split-Horizon Trade-offs
Same domain, different answers:
Internal user (10.0.1.5 in VPC A):
dig api.example.com → 10.0.2.10 (private, Cloud LB)
External user (internet):
dig api.example.com → 35.201.100.50 (public IP)Cost: Need maintain 2 zones, synchronization, potential inconsistency.
4. Forwarding vs Peering
| Aspect | Peering | Forwarding |
|---|---|---|
| Mechanism | DNS zone replication | Query redirection |
| Latency | Lower (caching) | Higher (per-query) |
| Control | Full zone copy | Upstream dependency |
| On-prem | Limited | Primary use case |
| Scale | Best for multi-VPC | Best for 1-2 zones |
Production pattern: Peering untuk multi-project, forwarding untuk on-prem integration.
5. GKE DNS Stack
Pod → kubelet's /etc/resolv.conf
↓
kube-dns atau CoreDNS (cluster DNS)
↓
NodeLocal DNSCache (on each node, if enabled)
↓
Cloud DNS (if query not in cluster)
↓
Upstream (external resolution)Gotcha: NodeLocal DNSCache is optional tapi highly recommended untuk production.
6. TTL as Consistency Control
Low TTL (60s) = fresh data tapi more queries High TTL (3600s) = stale risk tapi fewer queries
Production pattern: Environment-dependent:
- Production: 300s (5min) balance
- Staging: 60s (fresh data, troubleshooting)
- Dev: 300s (doesn't matter much)
7. Response Policy Zones for Security
RPZ intercept queries, apply policies:
Query: malware.example.com → RPZ rule matches → NXDOMAIN (block)
Query: internal-service.com → RPZ rule matches → 10.0.1.100 (redirect to internal)Use case: Malware blocking, internal service redirect, compliance overrides.
Production Patterns Summary
Pattern 1: Hub-and-Spoke DNS
Hub project DNS peering, spokes attach → centralized management.
Pattern 2: Hybrid On-Premises
Cloud DNS forwarding zone → on-prem DNS → seamless resolution.
Pattern 3: Split-Horizon API
Internal vs external service endpoints, same domain.
Pattern 4: GKE Service Discovery
Cloud DNS for external service discovery, kube-dns for internal pods.
Pattern 5: Multi-Cluster DNS
Cloud Service Directory, ServiceImport/ServiceExport cross-cluster.
Common Mistakes to Avoid
❌ All traffic using public DNS resolvers (8.8.8.8) → Breaks split-horizon, increases latency
❌ No private zones — services have public IPs → Unnecessary cost, security risk
❌ Too many forwarding zones → Latency accumulation, failure cascade
❌ TTL = 0 (always refresh) → CPU spike, slower user experience
❌ No DNS query logging → Cannot debug, security blind spot, compliance gaps
❌ DNSSEC on internal DNS without validation → False sense of security, operational complexity
Architecture Decision Framework
When designing DNS for your system, ask:
- Public or internal? → Zone type (public/private)
- Multi-VPC? → Peering vs separate zones
- On-prem integration? → Forwarding zones
- Split-horizon needed? → Public + private zones
- Performance critical? → NodeLocal DNSCache, TTL tuning
- Security compliance? → RPZ, query logging, DNSSEC
- Multi-cluster? → Cloud Service Directory
- Debugging priority? → Logging, tracing setup
What's Next After This Chapter
- Chapter 5: VPC-native networking deep dive (routing, policies)
- Chapter 6: GKE Networking at scale (service mesh, ingress)
- Chapter 7: Security architecture (identity, policies)
- Chapter 8: Observability (logging, monitoring DNS)
Quick Commands Reference
# Managed zones
gcloud dns managed-zones list
gcloud dns managed-zones create ZONE_NAME --dns-name=example.com --visibility=private --networks=VPC_NAME
# Record sets
gcloud dns record-sets list --zone=ZONE_NAME
gcloud dns record-sets transaction add 10.0.1.5 --name=service.example.com --ttl=300 --type=A --zone=ZONE_NAME
# DNS peering
gcloud dns managed-zones update ZONE_NAME --inbound-forwarding-servers=10.0.1.5
# Query logs
gcloud logging read "resource.type=dns_query" --limit=50
# Test resolution
nslookup service.example.com 10.0.1.1 # Internal resolverDocument Info
- Last Updated: 2026-06-01
- Language: Vietnamese (Tiếng Việt)
- Target Audience: Senior engineers, platform teams, SREs
- Total Content: 13 files, ~45,000 words, production-focused
- Prerequisite: Chapter 3 (VPC Model) understanding