Skip to content

Chapter 4: Cloud DNS Architecture & Production Patterns

Tại sao DNS quan trọng trong production

DNS không phải là một "commodity service" mà bạn có thể ignore. Ở scale production, DNS failures dẫn đến:

  • Complete service outages: Khi DNS không resolve, bất kỳ service nào phía sau nó cũng unreachable
  • Security breaches: DNS poisoning, exfiltration detection, lateral movement vectors
  • Compliance violations: Audit logs không capture, data residency violations
  • Latency amplification: Mỗi DNS query thêm 10-100ms—ở scale có thể là 1000s queries/sec
  • Hidden attack surface: DNS hijacking, DDoS via DNS reflection, cache poisoning

Reality check: Hầu hết outages không phải lỗi compute—chúng là network/DNS issues.

Tại sao Chapter này

GCP Cloud DNS khác biệt so với traditional DNS (BIND on-prem):

  1. Fully managed, no operational overhead — nhưng vẫn cần understand cơ chế
  2. Integrated với GCP security stack — IAM, audit logs, VPC enforcement
  3. Split-horizon, peering, forwarding — complex topologies cho hybrid/multi-cloud
  4. Mandatory cho GKE Autopilot — không thể avoid
  5. Production at scale — 1000s zones, millions queries/sec

Learning Path

Recommend theo thứ tự:

Foundation (Bắt buộc)

  1. Managed Zones: Public vs Private - Core abstraction
  2. Split-Horizon DNS - Internal/external topology
  3. Cloud DNS for GKE - Mandatory reading nếu dùng Kubernetes
  1. DNS Peering - Multi-project, multi-VPC
  2. DNS Forwarding - On-premises integration
  3. Private DNS Zones - VPC-specific resolution
  1. Response Policy Zones (RPZ) - Security overrides
  2. NodeLocal DNSCache - Performance
  3. DNS Resolution Path - Debugging & troubleshooting

Operations & Compliance

  1. DNS Query Logging - Audit, compliance, exfiltration detection
  2. TTL Tuning - Churn scenarios
  3. DNSSEC - Validation & key management
  4. Multi-Cluster DNS - Service Directory patterns

By Role

Backend Engineer (5-10 years experience)

→ Read: 01, 02, 06, 09

Bạn cần hiểu how services discover each other, why DNS matters for RPC latency, basic split-horizon.

Platform Engineer / SRE

→ Read: 01, 02, 03, 04, 05, 06, 07, 09, 10, 11, 12

Bạn manage DNS infrastructure, peering configs, troubleshooting, compliance.

Kubernetes Operator

→ Read: 06, 08, 09, 10, 13

GKE DNS integration, caching, multi-cluster service discovery.

Security Engineer

→ Read: 02, 07, 10, 12, 13

Split-horizon, RPZ policies, logging, DNSSEC, multi-cluster access control.

Cloud Architect

→ Read: 01, 02, 03, 04, 05, 06, 12, 13

Design hybrid networks, peering topologies, multi-region patterns.

Key Mental Models

1. DNS Resolution Hierarchy

Pod/VM (10.0.1.5)

Local /etc/resolv.conf (ndots=5, search domains)

NodeLocal DNSCache (optional, cached entries)

Cloud DNS (private zones → public zones → upstream)

Upstream (on-prem DNS, public internet)

Critical: Each hop can fail. Understand all 5 stages.

2. Zone Attachment Model

Private zones attach to VPCs, not individual resources:

Project A: VPC A, VPC B (shared VPC)
  ├── Private Zone "internal.example.com"
  │   └── Attached to: VPC A, VPC B → BOTH can resolve
  └── Private Zone "database.internal"
      └── Attached to: VPC A only → VPC B cannot resolve

Project B: VPC C (peer project)
  └── Cannot resolve any zones from Project A (unless DNS peering)

Implication: Careful with zone attachment → security boundaries.

3. Split-Horizon Trade-offs

Same domain, different answers:

Internal user (10.0.1.5 in VPC A):
  dig api.example.com → 10.0.2.10 (private, Cloud LB)

External user (internet):
  dig api.example.com → 35.201.100.50 (public IP)

Cost: Need maintain 2 zones, synchronization, potential inconsistency.

4. Forwarding vs Peering

AspectPeeringForwarding
MechanismDNS zone replicationQuery redirection
LatencyLower (caching)Higher (per-query)
ControlFull zone copyUpstream dependency
On-premLimitedPrimary use case
ScaleBest for multi-VPCBest for 1-2 zones

Production pattern: Peering untuk multi-project, forwarding untuk on-prem integration.

5. GKE DNS Stack

Pod → kubelet's /etc/resolv.conf

kube-dns atau CoreDNS (cluster DNS)

NodeLocal DNSCache (on each node, if enabled)

Cloud DNS (if query not in cluster)

Upstream (external resolution)

Gotcha: NodeLocal DNSCache is optional tapi highly recommended untuk production.

6. TTL as Consistency Control

Low TTL (60s) = fresh data tapi more queries High TTL (3600s) = stale risk tapi fewer queries

Production pattern: Environment-dependent:

  • Production: 300s (5min) balance
  • Staging: 60s (fresh data, troubleshooting)
  • Dev: 300s (doesn't matter much)

7. Response Policy Zones for Security

RPZ intercept queries, apply policies:

Query: malware.example.com → RPZ rule matches → NXDOMAIN (block)
Query: internal-service.com → RPZ rule matches → 10.0.1.100 (redirect to internal)

Use case: Malware blocking, internal service redirect, compliance overrides.

Production Patterns Summary

Pattern 1: Hub-and-Spoke DNS

Hub project DNS peering, spokes attach → centralized management.

Pattern 2: Hybrid On-Premises

Cloud DNS forwarding zone → on-prem DNS → seamless resolution.

Pattern 3: Split-Horizon API

Internal vs external service endpoints, same domain.

Pattern 4: GKE Service Discovery

Cloud DNS for external service discovery, kube-dns for internal pods.

Pattern 5: Multi-Cluster DNS

Cloud Service Directory, ServiceImport/ServiceExport cross-cluster.

Common Mistakes to Avoid

All traffic using public DNS resolvers (8.8.8.8) → Breaks split-horizon, increases latency

No private zones — services have public IPs → Unnecessary cost, security risk

Too many forwarding zones → Latency accumulation, failure cascade

TTL = 0 (always refresh) → CPU spike, slower user experience

No DNS query logging → Cannot debug, security blind spot, compliance gaps

DNSSEC on internal DNS without validation → False sense of security, operational complexity

Architecture Decision Framework

When designing DNS for your system, ask:

  1. Public or internal? → Zone type (public/private)
  2. Multi-VPC? → Peering vs separate zones
  3. On-prem integration? → Forwarding zones
  4. Split-horizon needed? → Public + private zones
  5. Performance critical? → NodeLocal DNSCache, TTL tuning
  6. Security compliance? → RPZ, query logging, DNSSEC
  7. Multi-cluster? → Cloud Service Directory
  8. Debugging priority? → Logging, tracing setup

What's Next After This Chapter

  • Chapter 5: VPC-native networking deep dive (routing, policies)
  • Chapter 6: GKE Networking at scale (service mesh, ingress)
  • Chapter 7: Security architecture (identity, policies)
  • Chapter 8: Observability (logging, monitoring DNS)

Quick Commands Reference

bash
# Managed zones
gcloud dns managed-zones list
gcloud dns managed-zones create ZONE_NAME --dns-name=example.com --visibility=private --networks=VPC_NAME

# Record sets
gcloud dns record-sets list --zone=ZONE_NAME
gcloud dns record-sets transaction add 10.0.1.5 --name=service.example.com --ttl=300 --type=A --zone=ZONE_NAME

# DNS peering
gcloud dns managed-zones update ZONE_NAME --inbound-forwarding-servers=10.0.1.5

# Query logs
gcloud logging read "resource.type=dns_query" --limit=50

# Test resolution
nslookup service.example.com 10.0.1.1  # Internal resolver

Document Info

  • Last Updated: 2026-06-01
  • Language: Vietnamese (Tiếng Việt)
  • Target Audience: Senior engineers, platform teams, SREs
  • Total Content: 13 files, ~45,000 words, production-focused
  • Prerequisite: Chapter 3 (VPC Model) understanding