Multi-Cluster DNS: Cloud Service Directory Patterns
Tại sao điều này quan trọng
Multi-cluster deployments cần cross-cluster service discovery. Pods trong Cluster A cần resolve services từ Cluster B, C, D...
Scenarios:
Scenario 1: Regional HA
Cluster 1 (us-central1): api-service
Cluster 2 (us-east1): api-service (standby)
Clients: api.multi.svc → resolves to healthy cluster
Scenario 2: Gradual Migration
Old Cluster: All services running
New Cluster: New services only
Need: Communicate between clusters transparently
Scenario 3: Multi-Cloud
GKE cluster (GCP)
Managed cluster (AWS via EKS)
Need: DNS resolution across providersKubernetes Multi-Cluster Service Discovery
ServiceExport/ServiceImport Pattern
yaml
# Cluster A (source)
apiVersion: net.gke.io/v1
kind: ServiceExport
metadata:
name: api-service
namespace: default
spec:
ports:
- port: 8080
protocol: TCP
---
# Cluster B (consumer)
apiVersion: net.gke.io/v1
kind: ServiceImport
metadata:
name: api-service
namespace: default
spec:
ports:
- port: 8080
protocol: TCP
ips:
- clusterIPs: ["10.0.0.50"] # ClusterIP from Cluster A
# Pod in Cluster B can now resolve:
# api-service.default.svc.cluster.local
# → Routes to Cluster A's IPGCP Cloud Service Directory
Concept
Cloud Service Directory = centralized service registry for multi-cluster/multi-cloud.
Service Registry (Cloud Service Directory):
├── Namespace: production
├─ Service: api-backend
│ ├── Endpoint 1 (10.4.0.50:8080) - us-central1 cluster
│ └── Endpoint 2 (10.5.0.50:8080) - us-east1 cluster
├─ Service: database
│ ├── Endpoint 1 (10.0.3.50:5432) - on-premises
│ └── Endpoint 2 (10.1.3.50:5432) - backup
└─ Service: cache
└── Endpoint 1 (memcached.internal:11211)
DNS names auto-generated:
api-backend.production.servicedirectory.cloud.goog
database.production.servicedirectory.cloud.goog
cache.production.servicedirectory.cloud.googSetup: Terraform
hcl
# Create namespace
resource "google_service_directory_namespace" "production" {
provider = google
namespace_id = "production"
location = "us-central1"
description = "Production services registry"
}
# Create service
resource "google_service_directory_service" "api" {
provider = google
service_id = "api-backend"
namespace = google_service_directory_namespace.production.id
}
# Add cluster endpoints
resource "google_service_directory_endpoint" "api_uc1" {
provider = google
endpoint_id = "cluster-uc1"
service = google_service_directory_service.api.id
address = "10.4.0.50"
port = 8080
metadata = {
region = "us-central1"
cluster = "prod-cluster-1"
}
}
resource "google_service_directory_endpoint" "api_ue1" {
provider = google
endpoint_id = "cluster-ue1"
service = google_service_directory_service.api.id
address = "10.5.0.50"
port = 8080
metadata = {
region = "us-east1"
cluster = "prod-cluster-2"
}
}DNS Resolution Flow
Cloud Service Directory DNS
Pod Query: nslookup api-backend.production.servicedirectory.cloud.goog
Step 1: Pod resolver (CoreDNS)
Step 2: Check local zones → Not found
Step 3: Forward to Cloud DNS
Step 4: Cloud DNS checks Service Directory registry
Step 5: Returns all healthy endpoints
→ 10.4.0.50:8080 (us-central1)
→ 10.5.0.50:8080 (us-east1)
Step 6: Result: Round-robin across endpoints
Client: connect to 10.4.0.50 or 10.5.0.50 (depends on load balancing)Health Checks & Failover
Health Checking
bash
# Manual health check registration
gcloud service-directory endpoints update ep1 \
--service=api-backend \
--namespace=production \
--location=us-central1 \
--health-status=HEALTHY
# Automatic health check (GCP managed)
# Cloud DNS checks endpoint availability
# Removes failing endpoints from DNS responsesFailover Example
Initial state:
api-backend endpoints:
├─ 10.4.0.50 (us-central1) - HEALTHY
└─ 10.5.0.50 (us-east1) - HEALTHY
Query: api-backend.production.servicedirectory.cloud.goog
→ Returns both endpoints
Cluster 1 failure (T+0):
├─ 10.4.0.50 - UNHEALTHY (health check fails)
└─ 10.5.0.50 - HEALTHY
Query at T+15 (after health check interval):
→ Returns only 10.5.0.50 (failover transparent)
Cluster 1 recovers (T+45):
├─ 10.4.0.50 - HEALTHY (health checks pass)
└─ 10.5.0.50 - HEALTHY
Query at T+60:
→ Returns both endpoints againMulti-Region Pattern
Active-Active Pattern
gcloud service-directory services create api \
--namespace=production
# Add endpoints from 3 regions
gcloud service-directory endpoints create uc1 \
--service=api --address=10.4.0.50
gcloud service-directory endpoints create ue1 \
--service=api --address=10.5.0.50
gcloud service-directory endpoints create ew1 \
--service=api --address=10.1.0.50
# DNS returns all 3 (round-robin)
# Each region handles 33% trafficActive-Standby Pattern
# Active endpoint (primary)
gcloud service-directory endpoints create active \
--service=api \
--address=10.4.0.50 \
--metadata="weight=100,active=true"
# Standby endpoint (backup)
gcloud service-directory endpoints create standby \
--service=api \
--address=10.5.0.50 \
--metadata="weight=0,active=false"
# Client code handles weight:
# Send 100% traffic to weight=100
# If fails, fallback to weight=0Hybrid & Multi-Cloud
Hybrid Setup
On-Premises:
├─ Database: 192.168.1.10:5432
└─ Cache: 192.168.1.20:6379
GCP:
├─ API: 10.4.0.50:8080
└─ Frontend: 35.201.100.50
Service Directory Registry:
├─ api-backend
│ └─ 10.4.0.50 (GCP)
├─ database
│ └─ 192.168.1.10 (on-prem)
└─ cache
└─ 192.168.1.20 (on-prem)
Result: Seamless resolution across environmentsMulti-Cloud (GCP + AWS)
GCP Cluster:
├─ Namespace: production
├─ Service: api-backend
│ └─ Endpoint: 10.4.0.50:8080 (GKE)
AWS Cluster:
├─ Pod queries: api-backend.production.servicedirectory.cloud.goog
│ (requires:) VPN/Peering to GCP + Cloud Service Directory access
Result:
AWS pod can resolve GCP service transparently
Single service registry across cloudsOperational Patterns
Pattern 1: Canary Deployments
api-backend endpoints:
Stable: 10.4.0.50 (99 replicas)
Canary: 10.4.0.51 (1 replica)
DNS query returns both IPs
Client load balancing:
Random: 1% traffic to canary (1/100)
Result: Automatic canary distributionPattern 2: Gradual Traffic Shift
Old Cluster:
api-backend.production:
├─ Old endpoint: 10.0.0.50 (weight=80)
└─ New endpoint: 10.1.0.50 (weight=20)
Monitor metrics for 1 hour
Shift traffic:
├─ Old endpoint: 10.0.0.50 (weight=50)
└─ New endpoint: 10.1.0.50 (weight=50)
If issues: Rollback
├─ Old endpoint: 10.0.0.50 (weight=100)
└─ New endpoint: 10.1.0.50 (weight=0)Pattern 3: Geographic Load Balancing
Service: api-backend
Endpoints by region:
├─ us-central1: 10.4.0.50
├─ us-east1: 10.5.0.50
├─ eu-west1: 10.1.0.50
└─ asia-southeast1: 10.2.0.50
Client resolution (varies by location):
From us-central1: Prefers 10.4.0.50 (local)
From eu-west1: Prefers 10.1.0.50 (local)
(Requires: Geo-aware client routing, not automatic DNS)Troubleshooting
Issue 1: Cross-Cluster Service Not Resolving
bash
# Debug:
1. Check Service Directory namespace
gcloud service-directory namespaces describe production
2. Check service registered
gcloud service-directory services describe api \
--namespace=production
3. Check endpoints
gcloud service-directory endpoints list \
--service=api --namespace=production
4. Test DNS resolution
nslookup api.production.servicedirectory.cloud.goog
5. Check IAM permissions
gcloud projects get-iam-policy PROJECT_IDIssue 2: Some Endpoints Not Returned
bash
# Debug:
1. Check endpoint health status
gcloud service-directory endpoints describe ep1 \
--service=api --namespace=production
2. If UNHEALTHY: Check if health check configured
Check endpoint metadata
3. Manually update health status
gcloud service-directory endpoints update ep1 \
--health-status=HEALTHYIssue 3: Slow Endpoint Discovery
bash
# Health check interval: ~10 seconds default
# If endpoint down: 10-15 seconds to remove from DNS
For faster failover:
1. Use shorter health check interval (if configurable)
2. Implement client-side retry (immediately fallback)
3. Cache DNS results locally (refresh before TTL)Monitoring
bash
# Monitor Service Directory
gcloud monitoring dashboards create \
--config='{ "displayName": "Multi-Cluster DNS", ...}'
# Alert on endpoint failures
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="Endpoint unhealthy" \
--condition-threshold-value=1 \
--condition-threshold-filter='resource.type="service_directory_endpoint" AND metric.status="UNHEALTHY"'Best Practices
- Use Cloud Service Directory (centralized registry)
- Implement health checks (automatic failover)
- Add metadata (region, version, weight)
- Monitor endpoint health (alert on failures)
- Test failover (before production)
- Document service topology (which endpoints what)
- Use weights for traffic control (canary, gradual shift)
- Automate endpoint registration (via Terraform/Kubernetes)