Cloud DNS for GKE: Alternatives & Performance at Scale
Tại sao điều này quan trọng
Đối với GKE Autopilot, Cloud DNS không phải optional—nó mandatory cho tất cả external service discovery. Hiểu cơ chế integration là critical cho production reliability.
Real scenarios:
Scenario 1: Microservices Connecting
Frontend pod: DNS query "api-service.default.svc.cluster.local"
→ Resolved locally (kube-dns)
Frontend pod: DNS query "database.prod.internal"
→ Resolved by Cloud DNS (external to cluster)
Scenario 2: GKE Autopilot
Cannot use custom kube-dns
→ Must use Google-managed DNS
→ Automatically integrated with Cloud DNS
Scenario 3: Multi-Cluster
Frontend cluster: Resolve api.example.com
→ Routes to backend cluster's IP
→ Requires Cloud DNS coordination
Scenario 4: High Query Volume
10,000 pods × 1000s queries/pod = millions of DNS queries/sec
→ Single kube-dns becomes bottleneck
→ Solution: NodeLocal DNSCache + Cloud DNSGKE DNS Architecture (Pre-Autopilot)
Kube-DNS (Legacy, Deprecated)
Pod Resolution Flow:
Pod (10.4.1.5)
↓
/etc/resolv.conf (ndots=5)
↓
Kubelet resolver (127.0.0.1:53 via kube-dns)
↓
Kube-DNS Pod (10.4.0.10, kube-system namespace)
↓
Responds with:
- Internal: api.default.svc.cluster.local → 10.4.0.50
- External: example.com → Cloud DNS → 35.201.100.50Kube-DNS characteristics:
- Single point of failure (Central DNS pod)
- Runs in kube-system namespace (visible to all pods)
- dnsPolicy=ClusterFirst by default (internal first, then upstream)
Why deprecated: Performance issues at scale, replaced by CoreDNS.
CoreDNS (Current Default)
Pod Resolution Flow:
Pod (10.4.2.5)
↓
/etc/resolv.conf (ndots=5)
↓
Kubelet resolver (127.0.0.1:53 via CoreDNS)
↓
CoreDNS Pod (10.4.0.11, kube-system namespace)
↓
Corefile config (plugin-based):
├── kubernetes plugin (internal resolution)
├── forward plugin (upstream Cloud DNS)
└── cache plugin (caching layer)
↓
Responds with:
- Internal: api.default.svc.cluster.local → 10.4.0.50
- External: example.com → Cloud DNS → 35.201.100.50CoreDNS advantages:
- Modular (plugin architecture)
- Better performance (multi-threaded)
- Configurable Corefile
- Faster cache lookups
GKE Autopilot DNS
GKE Autopilot uses managed Cloud DNS integration:
Pod Resolution (GKE Autopilot):
Pod (10.4.3.5)
↓
/etc/resolv.conf
↓
GKE-managed CoreDNS (Google-managed, no visible pods)
↓
Cloud DNS (internal + external)
↓
For internal: Pod IP returned
For external: Cloud DNS queriedAutopilot specifics:
- Cannot customize CoreDNS (managed by Google)
- Cannot disable Cloud DNS integration
- Query routing automated
- Observability via Cloud Monitoring
Cloud DNS Integration Methods
Method 1: Private Zones for GKE Services
bash
# Create private zone for GKE services
gcloud dns managed-zones create gke-services \
--dns-name=k8s.internal.example.com \
--visibility=private \
--networks=projects/PROJECT/global/networks/gke-vpc
# Add service records (normally automated via operator)
gcloud dns record-sets transaction start --zone=gke-services
gcloud dns record-sets transaction add 10.4.0.50 \
--name=api.k8s.internal.example.com \
--type=A \
--ttl=60 \
--zone=gke-services
gcloud dns record-sets transaction add 10.4.0.51 \
--name=database.k8s.internal.example.com \
--type=A \
--ttl=60 \
--zone=gke-services
gcloud dns record-sets transaction execute --zone=gke-services
# From GKE pod:
# nslookup api.k8s.internal.example.com
# → Resolves to 10.4.0.50 via Cloud DNSMethod 2: Cloud Service Directory (Multi-Cluster)
bash
# Enable Service Directory in project
gcloud services enable servicedirectory.googleapis.com
# Create namespace
gcloud service-directory namespaces create my-services \
--location=us-central1
# Create service
gcloud service-directory services create my-api \
--namespace=my-services \
--location=us-central1
# Add endpoint
gcloud service-directory endpoints create ep1 \
--service=my-api \
--namespace=my-services \
--location=us-central1 \
--address=10.4.0.50 \
--port=8080
# Automatic DNS registration: my-api.my-services.servicedirectory.cloud.googNetworking Details
DNS Query Path in GKE
Pod Query: curl api.default.svc.cluster.local
Step 1: /etc/resolv.conf parsing
nameserver 127.0.0.1
search api.default.svc.cluster.local svc.cluster.local cluster.local
Step 2: ndots=5 (default)
Query as-is (FQDN): api.default.svc.cluster.local
→ 4 dots → Try as-is first
Step 3: CoreDNS processes query
Corefile rules:
├── .:53 (default zone)
│ ├── kubernetes cluster.local (internal)
│ ├── cache 30 (caching)
│ └── forward . /etc/resolv.conf (upstream)
Step 4: kubernetes plugin processes
Query: api.default.svc.cluster.local
→ Check Kubernetes API
→ Service "api" in namespace "default" exists
→ Return ClusterIP: 10.4.0.50
Result: Pod gets 10.4.0.50, connects directlyExternal Resolution
Pod Query: curl example.com
Step 1: /etc/resolv.conf parsing
search api.default.svc.cluster.local ...
Step 2: Try with search domains first
api.default.svc.cluster.local.example.com → NXDOMAIN
api.default.svc.cluster.local → NXDOMAIN
example.com (no search suffix) → Try as-is
Step 3: CoreDNS processes
Query: example.com
→ kubernetes plugin: Not internal
→ forward plugin: Send to upstream
→ Upstream: Cloud DNS (configured in corefile)
Step 4: Cloud DNS resolves
example.com → 35.201.100.50
Result: Pod gets 35.201.100.50Corefile Configuration
Default Corefile (GKE)
corefile
.:53 {
cache 30
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
loop
reload
loadbalance
}Plugins explained:
cache 30: Cache DNS responses for 30 secondskubernetes: Internal service discoveryforward .: Upstream resolver (Cloud DNS)max_concurrent 1000: Rate limiting
Custom Corefile (Standard GKE, Not Autopilot)
bash
# Create ConfigMap with custom Corefile
cat <<EOF > corefile.txt
.:53 {
cache 30
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
# Add custom forwarding
rewrite name regex (.*)\.internal\.company\.com {1}.internal.gcp.company.com
# Add caching for internal domain
cache 60 internal.company.com
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
loop
reload
loadbalance
}
EOF
# Apply ConfigMap
kubectl create configmap coredns-custom \
--from-file=internal.override=corefile.txt \
-n kube-systemGKE DNS Policies
DNS Policy Types
yaml
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
dnsPolicy: ClusterFirst # Default (cluster DNS first, then upstream)
---
dnsPolicy: ClusterFirstWithHostNet # Use cluster DNS with hostNetwork
hostNetwork: true
---
dnsPolicy: Default # Node's DNS resolver (no cluster DNS)
---
dnsPolicy: None # Use custom dnsConfig below
dnsConfig:
nameservers:
- 8.8.8.8
- 1.1.1.1
searches:
- my-domain.comWhen to Use Each
| Policy | Use Case |
|---|---|
| ClusterFirst (Default) | Pod-to-pod communication, services |
| ClusterFirstWithHostNet | DaemonSets that need cluster DNS |
| Default | Bypass cluster DNS (rare, special cases) |
| None | Full control, custom DNS (edge cases) |
Performance at Scale
Challenge: DNS Bottleneck
GKE Cluster:
- 10,000 pods
- Each pod averages 100 DNS queries/min
- Total: 1,666 queries/sec
Single CoreDNS pod:
- Can handle ~5,000-10,000 QPS (depends on hardware)
- Becomes bottleneck
Symptoms:
- DNS timeouts
- Connection refused (cannot resolve service)
- Increased pod restart rate (liveness probes timing out)Solution 1: CoreDNS Replicas
bash
# Scale CoreDNS to 3 replicas (HA)
kubectl scale deployment -n kube-system coredns --replicas=3
# Or via kubectl patch
kubectl patch deployment coredns -n kube-system -p '{"spec":{"replicas":3}}'
# Verify
kubectl get pods -n kube-system -l k8s-app=kube-dnsSolution 2: NodeLocal DNSCache (Recommended)
Deploy node-local caching resolver (separate topic, covered in 08.nodelocal-dnscache.md).
bash
# Enable NodeLocal DNSCache
gcloud container clusters create my-cluster \
--enable-ip-alias \
--addons=HttpLoadBalancing,HorizontalPodAutoscaling,NodeLocalDNS
# Or update existing
gcloud container clusters update my-cluster \
--enable-ip-alias \
--addons=NodeLocalDNS \
--zone=us-central1-aEffect:
Before: 10,000 pods × 100 queries/min = 1,666 QPS to CoreDNS
After: Local cache hits ~95% → Only ~83 QPS to CoreDNS (95x reduction)External Service Discovery
Pattern 1: Kubernetes Ingress → Cloud DNS
yaml
# Service exposed via Google Cloud LB
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: backend-config
spec:
sessionAffinity:
affinityType: "CLIENT_IP"
---
# Update Cloud DNS with service IP
apiVersion: v1
kind: Service
metadata:
name: api-service
annotations:
cloud.google.com/backend-config: '{"external": "backend-config"}'
spec:
type: LoadBalancer
selector:
app: api
ports:
- port: 80
targetPort: 8080After creation, update Cloud DNS with LB IP.
Pattern 2: ExternalDNS Operator
Automatically sync Kubernetes Ingress/Service with Cloud DNS:
bash
# Install ExternalDNS
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns/
helm install external-dns external-dns/external-dns \
--set google.project=PROJECT_ID \
--set policy=sync \
--set registry=txt \
--set txtOwnerId=my-cluster
# Now any Ingress automatically updates Cloud DNS
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
external-dns.alpha.kubernetes.io/hostname: api.example.com
spec:
ingressClassName: gce
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
EOF
# ExternalDNS automatically creates Cloud DNS record
# api.example.com → Load Balancer IPTroubleshooting
Issue 1: DNS Timeouts in Pods
bash
# From inside pod
nslookup kubernetes.default
# If hangs or times out
Debug:
1. Check CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
2. Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
3. Check Corefile config
kubectl get cm -n kube-system coredns -o yaml
4. Test CoreDNS directly
kubectl exec -it -n kube-system POD_NAME -- nslookup kubernetes.default
5. Check resource limits
kubectl top pods -n kube-system -l k8s-app=kube-dnsIssue 2: Pod Cannot Resolve External Service
bash
# From inside pod
nslookup example.com
# If NXDOMAIN
Debug:
1. Check node /etc/resolv.conf
kubectl debug node/NODE_NAME -it --image=ubuntu
cat /etc/resolv.conf
2. Check forward plugin config
kubectl get cm -n kube-system coredns -o yaml | grep forward
3. Test from CoreDNS pod
kubectl exec -it -n kube-system POD_NAME -- \
nslookup example.com 8.8.8.8
4. Check firewall allows egress port 53
gcloud compute firewall-rules list --filter="direction:EGRESS"Issue 3: High DNS Query Latency
bash
# Monitor DNS latency
kubectl top pods -n kube-system -l k8s-app=kube-dns
# If CPU high
1. Scale CoreDNS: kubectl scale deployment coredns -n kube-system --replicas=5
2. Enable NodeLocal DNSCache
3. Tune Corefile cache TTL
# If memory high
1. Reduce cache size in Corefile
2. Check for DNS queries to non-existent domains (search domains issue)Best Practices
- Use NodeLocal DNSCache for production (massive performance boost)
- Scale CoreDNS to at least 2-3 replicas (HA)
- Monitor DNS latency continuously
- Use private zones for internal services
- Document Corefile changes (keep comments in ConfigMap)
- Test DNS before deployment (test script for pods)
- Alert on DNS failures (integration with monitoring)
- Regular backups of Corefile ConfigMap