Skip to content

Split-Horizon DNS: Internal vs External Resolution

Tại sao điều này quan trọng

Split-horizon DNS là architectural pattern cơ bản cho bất kỳ production system nào có cả internal và external traffic. Misconception phổ biến: "Chúng ta có internal network, chúng ta có external network, họ có thể dùng same domain."

Reality: Nếu implement không đúng, hậu quả:

  • Routing loops: Internal clients resolve external IP → connect thông qua internet → latency 10x
  • Firewall breaches: Internal clients bypass firewall trying reach "external" IP
  • Operational complexity: Phải maintain consistency giữa 2 zones
  • Failover chaos: Nếu external IP down, internal clients also cannot access (nếu wrong setup)
  • Security blind spots: Internal services exposed via external DNS
  • TTL synchronization: Client caching stale records across boundaries

Production reality: Được implement sai ở hơn 70% companies. Kết quả: latency, outages, security incidents.

Split-Horizon Fundamentals

Concept

Split-horizon DNS (hay "split-view DNS") serve different DNS answers dựa vào source của query:

Internal Query (từ 10.0.0.0/8):
  dig api.example.com
  → Resolves to: 10.0.1.5 (internal IP)

External Query (từ internet):
  dig api.example.com
  → Resolves to: 35.201.100.50 (public IP)

Same domain name, different answers.

Why Needed

Scenario 1: Without Split-Horizon
  
├── Public Zone: api.example.com → 35.201.100.50
└── Internal VMs resolve via public DNS
    → Internal traffic goes through internet
    → Crosses network boundary unnecessarily
    → Latency added (10-100ms)
    → Bandwidth waste
    → Potential firewall rules issues

Scenario 2: With Split-Horizon (Correct)

├── Public Zone: api.example.com → 35.201.100.50 (internet-facing)
├── Private Zone: api.example.com → 10.0.1.5 (internal)
└── Internal VMs resolve private zone
    → Direct internal routing
    → Minimal latency
    → Secure (não pass through internet)

Implementation Architecture

The Two-Zone Model

bash
# Step 1: Create public zone
gcloud dns managed-zones create api-public \
  --dns-name=api.example.com \
  --visibility=public

# Step 2: Create private zone (same domain!)
gcloud dns managed-zones create api-internal \
  --dns-name=api.example.com \
  --visibility=private \
  --networks=projects/PROJECT_ID/global/networks/default

# Step 3: Add public records
gcloud dns record-sets transaction start --zone=api-public
gcloud dns record-sets transaction add 35.201.100.50 \
  --name=api.example.com \
  --type=A \
  --ttl=300 \
  --zone=api-public
gcloud dns record-sets transaction execute --zone=api-public

# Step 4: Add internal records
gcloud dns record-sets transaction start --zone=api-internal
gcloud dns record-sets transaction add 10.0.1.5 \
  --name=api.example.com \
  --type=A \
  --ttl=300 \
  --zone=api-internal
gcloud dns record-sets transaction execute --zone=api-internal

Resolution Flow

┌─────────────────────────────────────────┐
│         Resolver Selection              │
└────────┬────────────────────────────┬───┘
         │                            │
    ┌────▼─────┐              ┌──────▼────┐
    │ Internal │              │ External  │
    │ (10.x.x) │              │ (internet)│
    └────┬─────┘              └──────┬────┘
         │                           │
    ┌────▼──────────────┐    ┌──────▼────────┐
    │ VPC Resolver      │    │ Public Recursor│
    │ (169.254.169.254) │    │ (8.8.8.8 etc) │
    └────┬──────────────┘    └──────┬────────┘
         │                          │
    ┌────▼──────────────┐    ┌──────▼───────────┐
    │ Private Zone      │    │ Public Zone       │
    │ api.example.com   │    │ api.example.com   │
    │ → 10.0.1.5       │    │ → 35.201.100.50  │
    └───────────────────┘    └──────────────────┘
         
         ↓ Result             ↓ Result
    
    10.0.1.5 (Direct)   35.201.100.50 (Internet)

How GCP Private Zones Work

Private zones resolve chỉ within authorized VPCs:

VPC Internal Resolver (169.254.169.254):
  - Every VPC has built-in resolver
  - Handles DNS queries từ resources dalam VPC
  - First checks private zones bound to VPC
  - Falls back to public DNS if not found

Configuration:
├── VPC "default" (10.0.0.0/8)
└── Private Zone "api.example.com" (bound)

    All resources in "default" VPC → resolve "api.example.com" → 10.0.1.5

VPC "secondary" (172.16.0.0/12)
└── NOT bound to "api.example.com" zone

    Resources in "secondary" VPC → cannot resolve "api.example.com"

Production Pattern: Hub-Spoke Split-Horizon

Most common architecture cho enterprise:

Hub Project (centralized DNS management)
├── Shared VPC
├── Public Zone: api.example.com (internet-facing)
├── Private Zone: api.example.com → 10.0.1.5 (bound to Shared VPC)
└── Private Zone: internal.example.com (bound to Shared VPC)

Service Project A
├── Resources in Shared VPC
└── Resolve:
    - api.example.com → 10.0.1.5 (private zone)
    - public users resolve → 35.201.100.50 (public zone)

Service Project B
├── Resources in Shared VPC
└── Same resolution as A

Terraform Implementation

hcl
# Public zone
resource "google_dns_managed_zone" "api_public" {
  name        = "api-public-zone"
  dns_name    = "api.example.com."
  description = "Public zone for api.example.com"
  visibility  = "public"
}

# Private zone
resource "google_dns_managed_zone" "api_private" {
  name        = "api-private-zone"
  dns_name    = "api.example.com."
  description = "Private zone for internal resolution"
  visibility  = "private"
  
  private_visibility_config {
    networks_list {
      network_url = google_compute_network.shared_vpc.id
    }
  }
}

# Public records
resource "google_dns_record_set" "api_public" {
  name            = "api.example.com."
  type            = "A"
  ttl             = 300
  managed_zone    = google_dns_managed_zone.api_public.name
  rrdatas         = ["35.201.100.50"]
}

# Private records (internal)
resource "google_dns_record_set" "api_private" {
  name            = "api.example.com."
  type            = "A"
  ttl             = 300
  managed_zone    = google_dns_managed_zone.api_private.name
  rrdatas         = ["10.0.1.5"]
}

Consistency Challenges

Challenge 1: Manual Synchronization

Khi có 2 zones, dữ liệu cần remain consistent:

Public Zone:        api.example.com A 35.201.100.50
Private Zone:       api.example.com A 10.0.1.5

Scenario: Update load balancer IP từ 35.201.100.50 → 35.201.100.51

Manual process:
  Step 1: Update public zone
  Step 2: Wait for propagation (seconds)
  Step 3: Update private zone internal IP? (if changed)
  
Problem: Race condition - external clients see new IP, internal see old IP

Solution: Infrastructure-as-Code (Terraform) để atomically update cả zones.

Challenge 2: TTL Coordination

TTL ở public vs private zones có thể differ:

Public Zone:  api.example.com TTL=300
Private Zone: api.example.com TTL=60

Scenario: Change internal IP
  T+0: Update private zone TTL=60 → old cached entries expire in 60s
  T+0: Internal clients still see old IP (cached, TTL not expired yet)
  T+60: Cache expires, new IP resolved

But jika:
  Public Zone: api.example.com TTL=3600
  
External clients might see stale IP for 1 hour setelah update.

Recommendation: Same TTL for same domain (300s is good default).

Challenge 3: Record Completeness

Public vs private zones tidak perlu have same records:

Public Zone (api.example.com):
  - api.example.com A 35.201.100.50 (public endpoint)

Private Zone (api.example.com):
  - api.example.com A 10.0.1.5 (internal IP)
  - api-admin.example.com A 10.0.1.100 (admin only, not in public)
  - api-debug.example.com A 10.0.1.101 (debug only, not in public)

Result:
  Internal: All 4 names resolve
  External: Only api.example.com resolves

Implication: Operational complexity, documentation critical.

Failover & Redundancy

Scenario 1: Simple Failover (Active-Standby)

Primary Load Balancer: 35.201.100.50 (active)
Standby Load Balancer: 35.201.100.51 (standby)

Setup 1 (No split-horizon):
  Public Zone: api.example.com A 35.201.100.50
  
  If primary down:
    → Update public zone
    → TTL=300 means 5min before external users see new IP
    → Problem: 5min outage for external users

Setup 2 (With split-horizon + health checks):
  Public Zone: api.example.com A 35.201.100.50
  Private Zone: api.example.com A 10.0.1.5 (internal LB)
  
  If public LB down:
    → Internal traffic not affected (routed internal)
    → External users see outage (can mitigate via faster TTL)
  
  If internal IP down:
    → Internal users see outage
    → Can retry/failover via application logic

Scenario 2: Multi-Region Failover

Region: us-central1
  Public IP: 35.201.100.50
  Private IP: 10.0.1.5

Region: us-east1
  Public IP: 35.201.100.51
  Private IP: 10.0.2.5

Setup (GCP Cloud Load Balancer):
  Public Zone:
    - api.example.com (Global) → Cloud LB (anycast)
    - Cloud LB automatically routes to closest healthy region
  
  Private Zone:
    - api.example.com A 10.0.1.5 (us-central1, default)
    - api-us-east1.example.com A 10.0.2.5 (explicit failover)

Internal applications:
  - Normal: Resolve api.example.com → 10.0.1.5
  - Region-down scenario: Explicitly use api-us-east1.example.com

Anti-Patterns to Avoid

❌ Anti-Pattern 1: Only Public Zone, No Private

Zone "example.com" (public only):
  api.example.com A 10.0.1.5 (PRIVATE IP exposed)
  
Result:
  ✗ Internal traffic routes through internet
  ✗ Can be resolved by anyone (DNS enumeration)
  ✗ Unnecessary bandwidth

Solution: Use private zone untuk internal IPs.

❌ Anti-Pattern 2: Inconsistent Records

Public Zone: api.example.com A 35.201.100.50
Private Zone: (empty, no api.example.com record)

Result:
  ✗ Internal clients cannot resolve api.example.com
  ✗ Must use different internal hostname (api-internal.example.com)
  ✗ Application code must handle both names

Solution: Private zone should have matching records.

❌ Anti-Pattern 3: Different TTLs

Public: api.example.com TTL=3600
Private: api.example.com TTL=0 (always refresh)

Result:
  ✗ Public users see stale IP for 1 hour
  ✗ Private users hit DNS constantly (load)

Solution: Same TTL (300s recommended).

❌ Anti-Pattern 4: No Zone Versioning/Documentation

Zone "api.example.com" (which VPC is it bound to?)
Zone "api-internal.example.com" (is this the same as above?)
Zone "api.prod.example.com" (which one do prod apps use?)

Result:
  ✗ Chaos when on-boarding engineers
  ✗ Difficult to maintain

Solution: Clear naming, documentation, Terraform comments.

Monitoring & Troubleshooting

Issue 1: Internal Clients See External IP

Debugging:
  From internal VM: nslookup api.example.com
  → Returns 35.201.100.50 (WRONG, should be 10.0.1.5)

Causes:
  1. Private zone not bound to VM's VPC
  2. Private zone doesn't have api.example.com record
  3. VM using public DNS (8.8.8.8) instead VPC resolver

Solution:
  gcloud dns managed-zones describe api-private --format="value(privateVisibilityConfig.networks[].networkUrl)"
  # Should show your VPC

  gcloud dns record-sets list --zone=api-private --filter="name:api.example.com"
  # Should show 10.0.1.5
  
  # Fix if using public DNS
  # In VM's /etc/resolv.conf, ensure it uses VPC resolver (169.254.169.254)

Issue 2: External Clients Cannot Resolve

Debugging:
  From external: dig api.example.com @8.8.8.8
  → NXDOMAIN (not found)

Causes:
  1. Public zone doesn't have api.example.com record
  2. Nameservers not updated at registrar
  3. Zone propagation not complete

Solution:
  gcloud dns record-sets list --zone=api-public
  # Verify api.example.com exists
  
  gcloud dns managed-zones describe api-public --format="value(nameServers)"
  # Get nameservers, verify at registrar (GoDaddy, etc.)
  
  dig api.example.com @ns-123.googledomains.com
  # Test specific nameserver

Issue 3: TTL Mismatch Causing Stale IPs

Symptom: After record change, some clients still see old IP

Solution:
  1. Monitor with: watch -n 1 'dig +short api.example.com'
  2. If stale: Either lower TTL atau clients need clear cache
  3. For next changes: Plan ahead when TTL expires

GKE-Specific Split-Horizon

GKE services automatically get DNS entries:

Service "api" in namespace "default":
  Internal DNS: api.default.svc.cluster.local (10.4.0.50)
  
Exposed via Google Cloud LB:
  Public DNS: api.example.com (35.201.100.50)

Split-horizon automatically handled by kube-dns/CoreDNS:
  ✓ Internal pods → api.default.svc.cluster.local
  ✓ External users → api.example.com

Production Checklist

  • [ ] Public zone created and registered at domain registrar
  • [ ] Private zone created and bound to correct VPC(s)
  • [ ] Both zones have matching A/AAAA records for main services
  • [ ] TTL same across zones (recommended 300s)
  • [ ] Documentation clarifies which records in which zone
  • [ ] Terraform manages both zones atomically
  • [ ] Monitoring alerts if external/internal IPs diverge
  • [ ] Regular audit: gcloud dns record-sets list for both zones
  • [ ] Failover tested (manually update record, verify propagation)
  • [ ] Team trained: how to troubleshoot split-horizon

References