GKE Managed Control Plane Model — Standard vs Autopilot
Tại Sao Vấn Đề Này Quan Trọng Trong Production
Khi chọn tạo cluster GKE, bạn không chỉ chọn machine types — bạn chọn mô hình vận hành hoàn toàn. Google quản lý cái gì, bạn quản lý cái gì, và những hạn chế bạn phải chịu đều depend trên quyết định này.
Sai lầm phổ biến: nhiều team nghĩ "Standard cluster = bạn manage tất cả" và "Autopilot = fully managed". Thực tế phức tạp hơn. Ví dụ:
- Standard cluster: Bạn manage node pools, node scaling, OS patches — nhưng Google vẫn manage control plane, bạn không có quyền truy cập direct vào API server binary
- Autopilot: Google manage node pools, scaling, security policy — nhưng bạn vẫn phải configure workload resources đúng, nếu không pods bị reject
Hiểu rõ boundary này quyết định:
- Cost model (Reserved Instances, Spot, committed use discounts)
- Upgrade timeline (bạn control vs Google control)
- Feature availability (một số advanced features chỉ available trong một mode)
- Troubleshooting approach (debug surface area khác nhau)
Control Plane là Managed Service — Hiểu Đúng Ý Nghĩa
Trước khi dive vào Standard vs Autopilot, cần hiểu một điều fundamental: trong GKE, control plane LUÔN là managed service. Google quản lý:
- Availability: Control plane tự động replicate across zones (regional clusters) hoặc tự động failover (zonal clusters)
- Updates: Control plane patches được apply rolling-basis, transparent
- Monitoring: Google monitors API server health, etcd consistency, scheduler performance
- Scaling: Control plane components scale automatically (không có concept "node-less control plane", nhưng scaling này không visible)
Điều bạn không manage:
- Bạn không ssh vào control plane nodes
- Bạn không tuning etcd parameters trực tiếp
- Bạn không install custom admission webhooks trong control plane
- Bạn không modify API server flags (có limited options via cluster creation)
Standard Cluster Model
Định Nghĩa
Standard cluster là mô hình mà Google manage control plane, bạn manage node pools sepenuhnya.
Google Manages (Control Plane)
| Thành Phần | Chi Tiết |
|---|---|
| API Server | Deployed, scaled, HA đảm bảo bởi Google |
| etcd | Replicated backend, backups, disaster recovery |
| Scheduler | Runs on control plane, không cần config |
| Controller-Manager | Danh sách managers được run |
| Updates | Automatic patches, monthly release cadence |
| Monitoring | Google monitors CPU, memory, latency |
Bạn Manage (Data Plane)
| Thành Phần | Chi Tiết |
|---|---|
| Node Pools | Creation, scaling, machine types |
| Node OS | Container-Optimized OS (COS) versions, patches (auto by default) |
| Security | Node-level security policies, workload permissions |
| Network | VPC configuration, firewall rules |
| Storage | PersistentVolume provisioning, volumes |
| Add-ons | DNS, logging, monitoring agent configuration |
Production Patterns Trong Standard
Multi-Region HA Cluster
# Standard cluster là good fit khi bạn cần flexibility
# Ví dụ: custom node pools per workload type
gcloud container clusters create my-cluster \
--region us-central1 \
--num-nodes 3 \
--machine-type n2-standard-4
# Rồi tạo specialized pool sau này
gcloud container node-pools create gpu-pool \
--cluster=my-cluster \
--region us-central1 \
--machine-type a2-highgpu-1g \
--num-nodes 0 \
--enable-autoscaling \
--min-nodes 0 --max-nodes 10Lợi ích:
- Cấu hình node pool theo đúng nhu cầu (GPU, high-memory, etc.)
- Autoscaling policy riêng per pool
- Reserved Instances discount tuning
Tradeoff: Bạn phải monitor node health, patch window, OS issues
Cluster Autoscaler + HPA
Trong Standard cluster, autoscaling có 2 layers:
- Cluster Autoscaler (CA): thêm/xóa nodes khi pods pending/underutilized
- Horizontal Pod Autoscaler (HPA): scale replicas based on metrics
# Deploy ứng dụng có HPA + CA
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 3 # initial
template:
spec:
containers:
- name: app
resources:
requests:
cpu: 500m
memory: 256Mi
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Behavior:
- Khi CPU usage > 70%, HPA tăng replicas
- Nếu không có node space, CA tăng nodes
- Sebaliknya khi traffic turun, HPA scale down, CA remove nodes (after ~10 mins idle)
Autopilot Cluster Model
Định Nghĩa
Autopilot cluster là mô hình mana Google manage cả control plane AND node infrastructure, bạn hanya manage workloads.
Google Manages (Control Plane + Infrastructure)
| Thành Phần | Chi Tiết |
|---|---|
| Control Plane | Penuh seperti Standard |
| Node Pools | Automated creation, scaling, optimization |
| Node Selection | Automatic machine type selection based on workload |
| OS & Patches | Fully automated, zero-downtime updates |
| Security | Pod Security Standards enforced, RBAC built-in |
| Networking | VPC, firewall, DNS configuration |
| Logging & Monitoring | Built-in, opinionated stack |
Bạn Manage (Workloads Only)
| Thành Phần | Chi Tiết |
|---|---|
| Pod Definitions | spec, containers, resources |
| Deployments, Services | Application configuration |
| IAM | Who can access cluster |
| Namespaces | Logical organization |
Constraints yang Harus Paham
1. Resource Ratio Enforcement
Autopilot menjalankan resource validator pada setiap Pod submission. CPU:Memory ratio harus sesuai with preset profiles.
# ❌ AKAN DITOLAK - CPU terlalu kecil untuk memory
apiVersion: v1
kind: Pod
metadata:
name: imbalanced
spec:
containers:
- name: app
image: myapp
resources:
requests:
cpu: 100m # terlalu kecil!
memory: 4Gi # untuk 4GB memory, need minimal 500m CPU
---
# ✅ DITERIMA
apiVersion: v1
kind: Pod
metadata:
name: balanced
spec:
containers:
- name: app
image: myapp
resources:
requests:
cpu: 500m # ratio terima
memory: 2GiRatio rules (simplified):
- Balanced: 1 CPU : 3.5 - 4 GB memory
- Scale-out: 1 CPU : 8 GB memory (untuk web tier)
- Performance: 1 CPU : 1 GB memory (untuk latency-sensitive)
- Memory-optimized: 1 CPU : 16 GB memory
Jika Pod spec tidak fit any profile, Autopilot akan:
- Coba auto-adjust (mutating webhook)
- Jika tidak bisa, Pod rejection
2. Privileged Workload Restrictions
Autopilot memiliki opinionated security posture:
# ❌ AKAN DITOLAK - privileged container
apiVersion: v1
kind: Pod
spec:
containers:
- name: privileged-app
securityContext:
privileged: true # not allowed
---
# ✅ DITERIMA - baseline security
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: trueException: Beberapa partner workloads (database engines, service meshes) diallow-list oleh Google. Jika butuh privileged mode, harus request Google approval.
3. Node Pool Abstraction
Di Autopilot, "node pools" adalah virtual concept:
# Di Autopilot, ini adalah managed resource
gcloud container node-pools list --cluster=my-autopilot-cluster
# Output:
# default-pool (managed by Google)
# system-pool (for system components, managed by Google)Banyak teams mencoba membuat custom node pools di Autopilot:
# ❌ TIDAK BISA - Autopilot controls node pool creation
gcloud container node-pools create custom-pool \
--cluster=my-autopilot-cluster # ERRORWorkaround: gunakan ComputeClasses untuk mengontrol hardware profile:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
nodeSelector:
cloud.google.com/compute-class: accelerator # GPU node
containers:
- name: ml-job
image: ml-framework:latest
resources:
requests:
nvidia.com/gpu: 14. Network Constraints
Autopilot enforce tertentu networking rules:
- Hanya support container-native load balancing (Pod IPs sebagai NEG endpoints)
- hostPort adalah restricted feature (harus enable explicitly)
- DaemonSets hanya run di worker nodes, not system nodes
Perbandingan Langsung: Standard vs Autopilot
| Aspek | Standard | Autopilot |
|---|---|---|
| Control Plane | Managed | Managed |
| Node Pools | Manual create/configure | Automated, opinionated |
| Node Selection | Bisa specify machine type | Automatic, validated ratio |
| OS Updates | Configurable window | Always zero-downtime |
| Security | Flexible (sesuai need) | Hardened by default |
| Resource Constraints | Flexible | Strict ratio enforcement |
| Privileged Workloads | Full support | Limited/approved only |
| Scaling | Granular control | Simplified, automatic |
| Cost Transparency | Clear per node | Aggregate, per pod |
| Learning Curve | Steeper | Gentler |
| Operational Toil | Higher | Lower |
Production Anti-Patterns
Anti-Pattern 1: Choosing Autopilot untuk "Fully Managed" Misconception
Sai lầm: "Autopilot means Google manage everything, zero ops overhead"
Realitas: Autopilot hanya manage infrastructure. Workload reliability, scaling strategy, cost optimization, disaster recovery tetap tanggungjawab bạn.
Solusi: Treat Autopilot sebagai opinionated infrastructure, bukan magic bullet. Tetap perlu:
- Load testing & capacity planning
- Cost monitoring
- Incident response practices
- Backup strategies
Anti-Pattern 2: Pushing Strict Resource Limits ke Autopilot
Sai lầm: "Autopilot enforce ratio, jadi saya bisa set 100% resource utilization"
Realitas: Autopilot validation adalah admission check, bukan runtime enforcement. Pods masih bisa OOM atau CPU throttle if actual usage spike.
Solusi: Set requests conservative, maintain headroom:
# Conservative approach
requests:
cpu: 250m # for Balanced: 1 CPU = 3.5GB
memory: 1Gi # room for spikes
limits:
cpu: 500m
memory: 2GiAnti-Pattern 3: Avoiding Standard "Because Autopilot Simpler"
Sai lầm: Pilih Autopilot meski workload need flexibility
Realitas:
- Autopilot kan memiliki features yang tidak tersedia di Standard
- Beberapa use cases (GPU clusters, mixed-architecture deployments) lebih fit Standard
- Standard memberikan granular control untuk specialized needs
Solusi: Choose based on workload characteristic:
- Choose Autopilot jika: web/API service, standard compute, tiada special OS needs
- Choose Standard jika: GPU/TPU, custom kernel, specialized networking, mixed architectures
GCP Documentation Reference
Semua pernyataan teknis di section ini berdasarkan:
- GCP GKE Autopilot Overview
- Autopilot Compute Classes
- GKE Standard vs Autopilot Comparison
- Autopilot Pod Security
Implikasi untuk Chapters Berikutnya
Model ini (Standard vs Autopilot) berpengaruh ke chapters:
- Chapter 6 (Node Lifecycle): Node repairs, upgrades berbeda per model
- Chapter 8 (Scheduler): Scheduling constraints depend pada node pool model
- Chapter 9 (Autoscaling): Autopilot autoscaling fully automated, Standard require setup
- Chapter 12 (Control Plane Scalability): Scaling patterns depend node model
Summary
- GKE control plane SELALU managed oleh Google — bukan trade-off antara Standard vs Autopilot
- Standard memberikan flexibility, bạn manage node pools sepenuhnya → better untuk specialized workloads
- Autopilot memberikan simplicity, opinionated defaults + resource validation → better untuk standard web/API deployments
- Pilihan bukan binary — bisa ada hybrid (beberapa clusters Standard, beberapa Autopilot) sesuai workload needs
- Production success = memilih model yang sesuai workload characteristics, bukan yang "terlihat easier"