Split-Horizon DNS: Internal vs External Resolution
Tại sao điều này quan trọng
Split-horizon DNS là architectural pattern cơ bản cho bất kỳ production system nào có cả internal và external traffic. Misconception phổ biến: "Chúng ta có internal network, chúng ta có external network, họ có thể dùng same domain."
Reality: Nếu implement không đúng, hậu quả:
- Routing loops: Internal clients resolve external IP → connect thông qua internet → latency 10x
- Firewall breaches: Internal clients bypass firewall trying reach "external" IP
- Operational complexity: Phải maintain consistency giữa 2 zones
- Failover chaos: Nếu external IP down, internal clients also cannot access (nếu wrong setup)
- Security blind spots: Internal services exposed via external DNS
- TTL synchronization: Client caching stale records across boundaries
Production reality: Được implement sai ở hơn 70% companies. Kết quả: latency, outages, security incidents.
Split-Horizon Fundamentals
Concept
Split-horizon DNS (hay "split-view DNS") serve different DNS answers dựa vào source của query:
Internal Query (từ 10.0.0.0/8):
dig api.example.com
→ Resolves to: 10.0.1.5 (internal IP)
External Query (từ internet):
dig api.example.com
→ Resolves to: 35.201.100.50 (public IP)
Same domain name, different answers.Why Needed
Scenario 1: Without Split-Horizon
├── Public Zone: api.example.com → 35.201.100.50
└── Internal VMs resolve via public DNS
→ Internal traffic goes through internet
→ Crosses network boundary unnecessarily
→ Latency added (10-100ms)
→ Bandwidth waste
→ Potential firewall rules issues
Scenario 2: With Split-Horizon (Correct)
├── Public Zone: api.example.com → 35.201.100.50 (internet-facing)
├── Private Zone: api.example.com → 10.0.1.5 (internal)
└── Internal VMs resolve private zone
→ Direct internal routing
→ Minimal latency
→ Secure (não pass through internet)Implementation Architecture
The Two-Zone Model
# Step 1: Create public zone
gcloud dns managed-zones create api-public \
--dns-name=api.example.com \
--visibility=public
# Step 2: Create private zone (same domain!)
gcloud dns managed-zones create api-internal \
--dns-name=api.example.com \
--visibility=private \
--networks=projects/PROJECT_ID/global/networks/default
# Step 3: Add public records
gcloud dns record-sets transaction start --zone=api-public
gcloud dns record-sets transaction add 35.201.100.50 \
--name=api.example.com \
--type=A \
--ttl=300 \
--zone=api-public
gcloud dns record-sets transaction execute --zone=api-public
# Step 4: Add internal records
gcloud dns record-sets transaction start --zone=api-internal
gcloud dns record-sets transaction add 10.0.1.5 \
--name=api.example.com \
--type=A \
--ttl=300 \
--zone=api-internal
gcloud dns record-sets transaction execute --zone=api-internalResolution Flow
┌─────────────────────────────────────────┐
│ Resolver Selection │
└────────┬────────────────────────────┬───┘
│ │
┌────▼─────┐ ┌──────▼────┐
│ Internal │ │ External │
│ (10.x.x) │ │ (internet)│
└────┬─────┘ └──────┬────┘
│ │
┌────▼──────────────┐ ┌──────▼────────┐
│ VPC Resolver │ │ Public Recursor│
│ (169.254.169.254) │ │ (8.8.8.8 etc) │
└────┬──────────────┘ └──────┬────────┘
│ │
┌────▼──────────────┐ ┌──────▼───────────┐
│ Private Zone │ │ Public Zone │
│ api.example.com │ │ api.example.com │
│ → 10.0.1.5 │ │ → 35.201.100.50 │
└───────────────────┘ └──────────────────┘
↓ Result ↓ Result
10.0.1.5 (Direct) 35.201.100.50 (Internet)How GCP Private Zones Work
Private zones resolve chỉ within authorized VPCs:
VPC Internal Resolver (169.254.169.254):
- Every VPC has built-in resolver
- Handles DNS queries từ resources dalam VPC
- First checks private zones bound to VPC
- Falls back to public DNS if not found
Configuration:
├── VPC "default" (10.0.0.0/8)
└── Private Zone "api.example.com" (bound)
↓
All resources in "default" VPC → resolve "api.example.com" → 10.0.1.5
VPC "secondary" (172.16.0.0/12)
└── NOT bound to "api.example.com" zone
↓
Resources in "secondary" VPC → cannot resolve "api.example.com"Production Pattern: Hub-Spoke Split-Horizon
Most common architecture cho enterprise:
Hub Project (centralized DNS management)
├── Shared VPC
├── Public Zone: api.example.com (internet-facing)
├── Private Zone: api.example.com → 10.0.1.5 (bound to Shared VPC)
└── Private Zone: internal.example.com (bound to Shared VPC)
Service Project A
├── Resources in Shared VPC
└── Resolve:
- api.example.com → 10.0.1.5 (private zone)
- public users resolve → 35.201.100.50 (public zone)
Service Project B
├── Resources in Shared VPC
└── Same resolution as ATerraform Implementation
# Public zone
resource "google_dns_managed_zone" "api_public" {
name = "api-public-zone"
dns_name = "api.example.com."
description = "Public zone for api.example.com"
visibility = "public"
}
# Private zone
resource "google_dns_managed_zone" "api_private" {
name = "api-private-zone"
dns_name = "api.example.com."
description = "Private zone for internal resolution"
visibility = "private"
private_visibility_config {
networks_list {
network_url = google_compute_network.shared_vpc.id
}
}
}
# Public records
resource "google_dns_record_set" "api_public" {
name = "api.example.com."
type = "A"
ttl = 300
managed_zone = google_dns_managed_zone.api_public.name
rrdatas = ["35.201.100.50"]
}
# Private records (internal)
resource "google_dns_record_set" "api_private" {
name = "api.example.com."
type = "A"
ttl = 300
managed_zone = google_dns_managed_zone.api_private.name
rrdatas = ["10.0.1.5"]
}Consistency Challenges
Challenge 1: Manual Synchronization
Khi có 2 zones, dữ liệu cần remain consistent:
Public Zone: api.example.com A 35.201.100.50
Private Zone: api.example.com A 10.0.1.5
Scenario: Update load balancer IP từ 35.201.100.50 → 35.201.100.51
Manual process:
Step 1: Update public zone
Step 2: Wait for propagation (seconds)
Step 3: Update private zone internal IP? (if changed)
Problem: Race condition - external clients see new IP, internal see old IPSolution: Infrastructure-as-Code (Terraform) để atomically update cả zones.
Challenge 2: TTL Coordination
TTL ở public vs private zones có thể differ:
Public Zone: api.example.com TTL=300
Private Zone: api.example.com TTL=60
Scenario: Change internal IP
T+0: Update private zone TTL=60 → old cached entries expire in 60s
T+0: Internal clients still see old IP (cached, TTL not expired yet)
T+60: Cache expires, new IP resolved
But jika:
Public Zone: api.example.com TTL=3600
External clients might see stale IP for 1 hour setelah update.Recommendation: Same TTL for same domain (300s is good default).
Challenge 3: Record Completeness
Public vs private zones tidak perlu have same records:
Public Zone (api.example.com):
- api.example.com A 35.201.100.50 (public endpoint)
Private Zone (api.example.com):
- api.example.com A 10.0.1.5 (internal IP)
- api-admin.example.com A 10.0.1.100 (admin only, not in public)
- api-debug.example.com A 10.0.1.101 (debug only, not in public)
Result:
Internal: All 4 names resolve
External: Only api.example.com resolvesImplication: Operational complexity, documentation critical.
Failover & Redundancy
Scenario 1: Simple Failover (Active-Standby)
Primary Load Balancer: 35.201.100.50 (active)
Standby Load Balancer: 35.201.100.51 (standby)
Setup 1 (No split-horizon):
Public Zone: api.example.com A 35.201.100.50
If primary down:
→ Update public zone
→ TTL=300 means 5min before external users see new IP
→ Problem: 5min outage for external users
Setup 2 (With split-horizon + health checks):
Public Zone: api.example.com A 35.201.100.50
Private Zone: api.example.com A 10.0.1.5 (internal LB)
If public LB down:
→ Internal traffic not affected (routed internal)
→ External users see outage (can mitigate via faster TTL)
If internal IP down:
→ Internal users see outage
→ Can retry/failover via application logicScenario 2: Multi-Region Failover
Region: us-central1
Public IP: 35.201.100.50
Private IP: 10.0.1.5
Region: us-east1
Public IP: 35.201.100.51
Private IP: 10.0.2.5
Setup (GCP Cloud Load Balancer):
Public Zone:
- api.example.com (Global) → Cloud LB (anycast)
- Cloud LB automatically routes to closest healthy region
Private Zone:
- api.example.com A 10.0.1.5 (us-central1, default)
- api-us-east1.example.com A 10.0.2.5 (explicit failover)
Internal applications:
- Normal: Resolve api.example.com → 10.0.1.5
- Region-down scenario: Explicitly use api-us-east1.example.comAnti-Patterns to Avoid
❌ Anti-Pattern 1: Only Public Zone, No Private
Zone "example.com" (public only):
api.example.com A 10.0.1.5 (PRIVATE IP exposed)
Result:
✗ Internal traffic routes through internet
✗ Can be resolved by anyone (DNS enumeration)
✗ Unnecessary bandwidthSolution: Use private zone untuk internal IPs.
❌ Anti-Pattern 2: Inconsistent Records
Public Zone: api.example.com A 35.201.100.50
Private Zone: (empty, no api.example.com record)
Result:
✗ Internal clients cannot resolve api.example.com
✗ Must use different internal hostname (api-internal.example.com)
✗ Application code must handle both namesSolution: Private zone should have matching records.
❌ Anti-Pattern 3: Different TTLs
Public: api.example.com TTL=3600
Private: api.example.com TTL=0 (always refresh)
Result:
✗ Public users see stale IP for 1 hour
✗ Private users hit DNS constantly (load)Solution: Same TTL (300s recommended).
❌ Anti-Pattern 4: No Zone Versioning/Documentation
Zone "api.example.com" (which VPC is it bound to?)
Zone "api-internal.example.com" (is this the same as above?)
Zone "api.prod.example.com" (which one do prod apps use?)
Result:
✗ Chaos when on-boarding engineers
✗ Difficult to maintainSolution: Clear naming, documentation, Terraform comments.
Monitoring & Troubleshooting
Issue 1: Internal Clients See External IP
Debugging:
From internal VM: nslookup api.example.com
→ Returns 35.201.100.50 (WRONG, should be 10.0.1.5)
Causes:
1. Private zone not bound to VM's VPC
2. Private zone doesn't have api.example.com record
3. VM using public DNS (8.8.8.8) instead VPC resolver
Solution:
gcloud dns managed-zones describe api-private --format="value(privateVisibilityConfig.networks[].networkUrl)"
# Should show your VPC
gcloud dns record-sets list --zone=api-private --filter="name:api.example.com"
# Should show 10.0.1.5
# Fix if using public DNS
# In VM's /etc/resolv.conf, ensure it uses VPC resolver (169.254.169.254)Issue 2: External Clients Cannot Resolve
Debugging:
From external: dig api.example.com @8.8.8.8
→ NXDOMAIN (not found)
Causes:
1. Public zone doesn't have api.example.com record
2. Nameservers not updated at registrar
3. Zone propagation not complete
Solution:
gcloud dns record-sets list --zone=api-public
# Verify api.example.com exists
gcloud dns managed-zones describe api-public --format="value(nameServers)"
# Get nameservers, verify at registrar (GoDaddy, etc.)
dig api.example.com @ns-123.googledomains.com
# Test specific nameserverIssue 3: TTL Mismatch Causing Stale IPs
Symptom: After record change, some clients still see old IP
Solution:
1. Monitor with: watch -n 1 'dig +short api.example.com'
2. If stale: Either lower TTL atau clients need clear cache
3. For next changes: Plan ahead when TTL expiresGKE-Specific Split-Horizon
GKE services automatically get DNS entries:
Service "api" in namespace "default":
Internal DNS: api.default.svc.cluster.local (10.4.0.50)
Exposed via Google Cloud LB:
Public DNS: api.example.com (35.201.100.50)
Split-horizon automatically handled by kube-dns/CoreDNS:
✓ Internal pods → api.default.svc.cluster.local
✓ External users → api.example.comProduction Checklist
- [ ] Public zone created and registered at domain registrar
- [ ] Private zone created and bound to correct VPC(s)
- [ ] Both zones have matching A/AAAA records for main services
- [ ] TTL same across zones (recommended 300s)
- [ ] Documentation clarifies which records in which zone
- [ ] Terraform manages both zones atomically
- [ ] Monitoring alerts if external/internal IPs diverge
- [ ] Regular audit:
gcloud dns record-sets listfor both zones - [ ] Failover tested (manually update record, verify propagation)
- [ ] Team trained: how to troubleshoot split-horizon