Skip to content

Multi-Cluster DNS: Cloud Service Directory Patterns

Tại sao điều này quan trọng

Multi-cluster deployments cần cross-cluster service discovery. Pods trong Cluster A cần resolve services từ Cluster B, C, D...

Scenarios:

Scenario 1: Regional HA
  Cluster 1 (us-central1): api-service
  Cluster 2 (us-east1): api-service (standby)
  
  Clients: api.multi.svc → resolves to healthy cluster

Scenario 2: Gradual Migration
  Old Cluster: All services running
  New Cluster: New services only
  
  Need: Communicate between clusters transparently

Scenario 3: Multi-Cloud
  GKE cluster (GCP)
  Managed cluster (AWS via EKS)
  
  Need: DNS resolution across providers

Kubernetes Multi-Cluster Service Discovery

ServiceExport/ServiceImport Pattern

yaml
# Cluster A (source)
apiVersion: net.gke.io/v1
kind: ServiceExport
metadata:
  name: api-service
  namespace: default
spec:
  ports:
  - port: 8080
    protocol: TCP

---
# Cluster B (consumer)
apiVersion: net.gke.io/v1
kind: ServiceImport
metadata:
  name: api-service
  namespace: default
spec:
  ports:
  - port: 8080
    protocol: TCP
  ips:
  - clusterIPs: ["10.0.0.50"]  # ClusterIP from Cluster A

# Pod in Cluster B can now resolve:
# api-service.default.svc.cluster.local
# → Routes to Cluster A's IP

GCP Cloud Service Directory

Concept

Cloud Service Directory = centralized service registry for multi-cluster/multi-cloud.

Service Registry (Cloud Service Directory):
├── Namespace: production
├─ Service: api-backend
│  ├── Endpoint 1 (10.4.0.50:8080) - us-central1 cluster
│  └── Endpoint 2 (10.5.0.50:8080) - us-east1 cluster
├─ Service: database
│   ├── Endpoint 1 (10.0.3.50:5432) - on-premises
│   └── Endpoint 2 (10.1.3.50:5432) - backup
└─ Service: cache
    └── Endpoint 1 (memcached.internal:11211)

DNS names auto-generated:
  api-backend.production.servicedirectory.cloud.goog
  database.production.servicedirectory.cloud.goog
  cache.production.servicedirectory.cloud.goog

Setup: Terraform

hcl
# Create namespace
resource "google_service_directory_namespace" "production" {
  provider      = google
  namespace_id  = "production"
  location      = "us-central1"
  description   = "Production services registry"
}

# Create service
resource "google_service_directory_service" "api" {
  provider   = google
  service_id = "api-backend"
  namespace  = google_service_directory_namespace.production.id
}

# Add cluster endpoints
resource "google_service_directory_endpoint" "api_uc1" {
  provider      = google
  endpoint_id   = "cluster-uc1"
  service       = google_service_directory_service.api.id
  address       = "10.4.0.50"
  port          = 8080
  
  metadata = {
    region = "us-central1"
    cluster = "prod-cluster-1"
  }
}

resource "google_service_directory_endpoint" "api_ue1" {
  provider      = google
  endpoint_id   = "cluster-ue1"
  service       = google_service_directory_service.api.id
  address       = "10.5.0.50"
  port          = 8080
  
  metadata = {
    region = "us-east1"
    cluster = "prod-cluster-2"
  }
}

DNS Resolution Flow

Cloud Service Directory DNS

Pod Query: nslookup api-backend.production.servicedirectory.cloud.goog

Step 1: Pod resolver (CoreDNS)
Step 2: Check local zones → Not found
Step 3: Forward to Cloud DNS
Step 4: Cloud DNS checks Service Directory registry
Step 5: Returns all healthy endpoints
  → 10.4.0.50:8080 (us-central1)
  → 10.5.0.50:8080 (us-east1)
Step 6: Result: Round-robin across endpoints

Client: connect to 10.4.0.50 or 10.5.0.50 (depends on load balancing)

Health Checks & Failover

Health Checking

bash
# Manual health check registration
gcloud service-directory endpoints update ep1 \
  --service=api-backend \
  --namespace=production \
  --location=us-central1 \
  --health-status=HEALTHY

# Automatic health check (GCP managed)
# Cloud DNS checks endpoint availability
# Removes failing endpoints from DNS responses

Failover Example

Initial state:
  api-backend endpoints:
    ├─ 10.4.0.50 (us-central1) - HEALTHY
    └─ 10.5.0.50 (us-east1) - HEALTHY

Query: api-backend.production.servicedirectory.cloud.goog
  → Returns both endpoints

Cluster 1 failure (T+0):
  ├─ 10.4.0.50 - UNHEALTHY (health check fails)
  └─ 10.5.0.50 - HEALTHY

Query at T+15 (after health check interval):
  → Returns only 10.5.0.50 (failover transparent)

Cluster 1 recovers (T+45):
  ├─ 10.4.0.50 - HEALTHY (health checks pass)
  └─ 10.5.0.50 - HEALTHY

Query at T+60:
  → Returns both endpoints again

Multi-Region Pattern

Active-Active Pattern

gcloud service-directory services create api \
  --namespace=production

# Add endpoints from 3 regions
gcloud service-directory endpoints create uc1 \
  --service=api --address=10.4.0.50
gcloud service-directory endpoints create ue1 \
  --service=api --address=10.5.0.50
gcloud service-directory endpoints create ew1 \
  --service=api --address=10.1.0.50

# DNS returns all 3 (round-robin)
# Each region handles 33% traffic

Active-Standby Pattern

# Active endpoint (primary)
gcloud service-directory endpoints create active \
  --service=api \
  --address=10.4.0.50 \
  --metadata="weight=100,active=true"

# Standby endpoint (backup)
gcloud service-directory endpoints create standby \
  --service=api \
  --address=10.5.0.50 \
  --metadata="weight=0,active=false"

# Client code handles weight:
# Send 100% traffic to weight=100
# If fails, fallback to weight=0

Hybrid & Multi-Cloud

Hybrid Setup

On-Premises:
  ├─ Database: 192.168.1.10:5432
  └─ Cache: 192.168.1.20:6379

GCP:
  ├─ API: 10.4.0.50:8080
  └─ Frontend: 35.201.100.50

Service Directory Registry:
  ├─ api-backend
  │  └─ 10.4.0.50 (GCP)
  ├─ database
  │  └─ 192.168.1.10 (on-prem)
  └─ cache
     └─ 192.168.1.20 (on-prem)

Result: Seamless resolution across environments

Multi-Cloud (GCP + AWS)

GCP Cluster:
  ├─ Namespace: production
  ├─ Service: api-backend
  │  └─ Endpoint: 10.4.0.50:8080 (GKE)

AWS Cluster:
  ├─ Pod queries: api-backend.production.servicedirectory.cloud.goog
  │   (requires:) VPN/Peering to GCP + Cloud Service Directory access

Result:
  AWS pod can resolve GCP service transparently
  Single service registry across clouds

Operational Patterns

Pattern 1: Canary Deployments

api-backend endpoints:
  Stable: 10.4.0.50 (99 replicas)
  Canary: 10.4.0.51 (1 replica)

DNS query returns both IPs
Client load balancing:
  Random: 1% traffic to canary (1/100)
  Result: Automatic canary distribution

Pattern 2: Gradual Traffic Shift

Old Cluster:
  api-backend.production:
    ├─ Old endpoint: 10.0.0.50 (weight=80)
    └─ New endpoint: 10.1.0.50 (weight=20)

Monitor metrics for 1 hour

Shift traffic:
  ├─ Old endpoint: 10.0.0.50 (weight=50)
  └─ New endpoint: 10.1.0.50 (weight=50)

If issues: Rollback
  ├─ Old endpoint: 10.0.0.50 (weight=100)
  └─ New endpoint: 10.1.0.50 (weight=0)

Pattern 3: Geographic Load Balancing

Service: api-backend
Endpoints by region:
  ├─ us-central1: 10.4.0.50
  ├─ us-east1: 10.5.0.50
  ├─ eu-west1: 10.1.0.50
  └─ asia-southeast1: 10.2.0.50

Client resolution (varies by location):
  From us-central1: Prefers 10.4.0.50 (local)
  From eu-west1: Prefers 10.1.0.50 (local)
  
(Requires: Geo-aware client routing, not automatic DNS)

Troubleshooting

Issue 1: Cross-Cluster Service Not Resolving

bash
# Debug:
1. Check Service Directory namespace
   gcloud service-directory namespaces describe production

2. Check service registered
   gcloud service-directory services describe api \
     --namespace=production

3. Check endpoints
   gcloud service-directory endpoints list \
     --service=api --namespace=production

4. Test DNS resolution
   nslookup api.production.servicedirectory.cloud.goog

5. Check IAM permissions
   gcloud projects get-iam-policy PROJECT_ID

Issue 2: Some Endpoints Not Returned

bash
# Debug:
1. Check endpoint health status
   gcloud service-directory endpoints describe ep1 \
     --service=api --namespace=production

2. If UNHEALTHY: Check if health check configured
   Check endpoint metadata

3. Manually update health status
   gcloud service-directory endpoints update ep1 \
     --health-status=HEALTHY

Issue 3: Slow Endpoint Discovery

bash
# Health check interval: ~10 seconds default
# If endpoint down: 10-15 seconds to remove from DNS

For faster failover:
  1. Use shorter health check interval (if configurable)
  2. Implement client-side retry (immediately fallback)
  3. Cache DNS results locally (refresh before TTL)

Monitoring

bash
# Monitor Service Directory
gcloud monitoring dashboards create \
  --config='{ "displayName": "Multi-Cluster DNS", ...}'

# Alert on endpoint failures
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Endpoint unhealthy" \
  --condition-threshold-value=1 \
  --condition-threshold-filter='resource.type="service_directory_endpoint" AND metric.status="UNHEALTHY"'

Best Practices

  1. Use Cloud Service Directory (centralized registry)
  2. Implement health checks (automatic failover)
  3. Add metadata (region, version, weight)
  4. Monitor endpoint health (alert on failures)
  5. Test failover (before production)
  6. Document service topology (which endpoints what)
  7. Use weights for traffic control (canary, gradual shift)
  8. Automate endpoint registration (via Terraform/Kubernetes)

References