Skip to content

GKE Managed Control Plane Model — Standard vs Autopilot

Tại Sao Vấn Đề Này Quan Trọng Trong Production

Khi chọn tạo cluster GKE, bạn không chỉ chọn machine types — bạn chọn mô hình vận hành hoàn toàn. Google quản lý cái gì, bạn quản lý cái gì, và những hạn chế bạn phải chịu đều depend trên quyết định này.

Sai lầm phổ biến: nhiều team nghĩ "Standard cluster = bạn manage tất cả" và "Autopilot = fully managed". Thực tế phức tạp hơn. Ví dụ:

  • Standard cluster: Bạn manage node pools, node scaling, OS patches — nhưng Google vẫn manage control plane, bạn không có quyền truy cập direct vào API server binary
  • Autopilot: Google manage node pools, scaling, security policy — nhưng bạn vẫn phải configure workload resources đúng, nếu không pods bị reject

Hiểu rõ boundary này quyết định:

  • Cost model (Reserved Instances, Spot, committed use discounts)
  • Upgrade timeline (bạn control vs Google control)
  • Feature availability (một số advanced features chỉ available trong một mode)
  • Troubleshooting approach (debug surface area khác nhau)

Control Plane là Managed Service — Hiểu Đúng Ý Nghĩa

Trước khi dive vào Standard vs Autopilot, cần hiểu một điều fundamental: trong GKE, control plane LUÔN là managed service. Google quản lý:

  • Availability: Control plane tự động replicate across zones (regional clusters) hoặc tự động failover (zonal clusters)
  • Updates: Control plane patches được apply rolling-basis, transparent
  • Monitoring: Google monitors API server health, etcd consistency, scheduler performance
  • Scaling: Control plane components scale automatically (không có concept "node-less control plane", nhưng scaling này không visible)

Điều bạn không manage:

  • Bạn không ssh vào control plane nodes
  • Bạn không tuning etcd parameters trực tiếp
  • Bạn không install custom admission webhooks trong control plane
  • Bạn không modify API server flags (có limited options via cluster creation)

Standard Cluster Model

Định Nghĩa

Standard cluster là mô hình mà Google manage control plane, bạn manage node pools sepenuhnya.

Google Manages (Control Plane)

Thành PhầnChi Tiết
API ServerDeployed, scaled, HA đảm bảo bởi Google
etcdReplicated backend, backups, disaster recovery
SchedulerRuns on control plane, không cần config
Controller-ManagerDanh sách managers được run
UpdatesAutomatic patches, monthly release cadence
MonitoringGoogle monitors CPU, memory, latency

Bạn Manage (Data Plane)

Thành PhầnChi Tiết
Node PoolsCreation, scaling, machine types
Node OSContainer-Optimized OS (COS) versions, patches (auto by default)
SecurityNode-level security policies, workload permissions
NetworkVPC configuration, firewall rules
StoragePersistentVolume provisioning, volumes
Add-onsDNS, logging, monitoring agent configuration

Production Patterns Trong Standard

Multi-Region HA Cluster

yaml
# Standard cluster là good fit khi bạn cần flexibility
# Ví dụ: custom node pools per workload type

gcloud container clusters create my-cluster \
  --region us-central1 \
  --num-nodes 3 \
  --machine-type n2-standard-4

# Rồi tạo specialized pool sau này
gcloud container node-pools create gpu-pool \
  --cluster=my-cluster \
  --region us-central1 \
  --machine-type a2-highgpu-1g \
  --num-nodes 0 \
  --enable-autoscaling \
  --min-nodes 0 --max-nodes 10

Lợi ích:

  • Cấu hình node pool theo đúng nhu cầu (GPU, high-memory, etc.)
  • Autoscaling policy riêng per pool
  • Reserved Instances discount tuning

Tradeoff: Bạn phải monitor node health, patch window, OS issues

Cluster Autoscaler + HPA

Trong Standard cluster, autoscaling có 2 layers:

  • Cluster Autoscaler (CA): thêm/xóa nodes khi pods pending/underutilized
  • Horizontal Pod Autoscaler (HPA): scale replicas based on metrics
yaml
# Deploy ứng dụng có HPA + CA
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3  # initial
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: 500m
            memory: 256Mi

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Behavior:

  • Khi CPU usage > 70%, HPA tăng replicas
  • Nếu không có node space, CA tăng nodes
  • Sebaliknya khi traffic turun, HPA scale down, CA remove nodes (after ~10 mins idle)

Autopilot Cluster Model

Định Nghĩa

Autopilot cluster là mô hình mana Google manage cả control plane AND node infrastructure, bạn hanya manage workloads.

Google Manages (Control Plane + Infrastructure)

Thành PhầnChi Tiết
Control PlanePenuh seperti Standard
Node PoolsAutomated creation, scaling, optimization
Node SelectionAutomatic machine type selection based on workload
OS & PatchesFully automated, zero-downtime updates
SecurityPod Security Standards enforced, RBAC built-in
NetworkingVPC, firewall, DNS configuration
Logging & MonitoringBuilt-in, opinionated stack

Bạn Manage (Workloads Only)

Thành PhầnChi Tiết
Pod Definitionsspec, containers, resources
Deployments, ServicesApplication configuration
IAMWho can access cluster
NamespacesLogical organization

Constraints yang Harus Paham

1. Resource Ratio Enforcement

Autopilot menjalankan resource validator pada setiap Pod submission. CPU:Memory ratio harus sesuai with preset profiles.

yaml
# ❌ AKAN DITOLAK - CPU terlalu kecil untuk memory
apiVersion: v1
kind: Pod
metadata:
  name: imbalanced
spec:
  containers:
  - name: app
    image: myapp
    resources:
      requests:
        cpu: 100m      # terlalu kecil!
        memory: 4Gi    # untuk 4GB memory, need minimal 500m CPU

---
# ✅ DITERIMA
apiVersion: v1
kind: Pod
metadata:
  name: balanced
spec:
  containers:
  - name: app
    image: myapp
    resources:
      requests:
        cpu: 500m      # ratio terima
        memory: 2Gi

Ratio rules (simplified):

  • Balanced: 1 CPU : 3.5 - 4 GB memory
  • Scale-out: 1 CPU : 8 GB memory (untuk web tier)
  • Performance: 1 CPU : 1 GB memory (untuk latency-sensitive)
  • Memory-optimized: 1 CPU : 16 GB memory

Jika Pod spec tidak fit any profile, Autopilot akan:

  1. Coba auto-adjust (mutating webhook)
  2. Jika tidak bisa, Pod rejection

2. Privileged Workload Restrictions

Autopilot memiliki opinionated security posture:

yaml
# ❌ AKAN DITOLAK - privileged container
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: privileged-app
    securityContext:
      privileged: true  # not allowed

---
# ✅ DITERIMA - baseline security
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    securityContext:
      runAsNonRoot: true
      readOnlyRootFilesystem: true

Exception: Beberapa partner workloads (database engines, service meshes) diallow-list oleh Google. Jika butuh privileged mode, harus request Google approval.

3. Node Pool Abstraction

Di Autopilot, "node pools" adalah virtual concept:

bash
# Di Autopilot, ini adalah managed resource
gcloud container node-pools list --cluster=my-autopilot-cluster

# Output:
# default-pool (managed by Google)
# system-pool (for system components, managed by Google)

Banyak teams mencoba membuat custom node pools di Autopilot:

bash
# ❌ TIDAK BISA - Autopilot controls node pool creation
gcloud container node-pools create custom-pool \
  --cluster=my-autopilot-cluster  # ERROR

Workaround: gunakan ComputeClasses untuk mengontrol hardware profile:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  nodeSelector:
    cloud.google.com/compute-class: accelerator  # GPU node
  containers:
  - name: ml-job
    image: ml-framework:latest
    resources:
      requests:
        nvidia.com/gpu: 1

4. Network Constraints

Autopilot enforce tertentu networking rules:

  • Hanya support container-native load balancing (Pod IPs sebagai NEG endpoints)
  • hostPort adalah restricted feature (harus enable explicitly)
  • DaemonSets hanya run di worker nodes, not system nodes

Perbandingan Langsung: Standard vs Autopilot

AspekStandardAutopilot
Control PlaneManagedManaged
Node PoolsManual create/configureAutomated, opinionated
Node SelectionBisa specify machine typeAutomatic, validated ratio
OS UpdatesConfigurable windowAlways zero-downtime
SecurityFlexible (sesuai need)Hardened by default
Resource ConstraintsFlexibleStrict ratio enforcement
Privileged WorkloadsFull supportLimited/approved only
ScalingGranular controlSimplified, automatic
Cost TransparencyClear per nodeAggregate, per pod
Learning CurveSteeperGentler
Operational ToilHigherLower

Production Anti-Patterns

Anti-Pattern 1: Choosing Autopilot untuk "Fully Managed" Misconception

Sai lầm: "Autopilot means Google manage everything, zero ops overhead"

Realitas: Autopilot hanya manage infrastructure. Workload reliability, scaling strategy, cost optimization, disaster recovery tetap tanggungjawab bạn.

Solusi: Treat Autopilot sebagai opinionated infrastructure, bukan magic bullet. Tetap perlu:

  • Load testing & capacity planning
  • Cost monitoring
  • Incident response practices
  • Backup strategies

Anti-Pattern 2: Pushing Strict Resource Limits ke Autopilot

Sai lầm: "Autopilot enforce ratio, jadi saya bisa set 100% resource utilization"

Realitas: Autopilot validation adalah admission check, bukan runtime enforcement. Pods masih bisa OOM atau CPU throttle if actual usage spike.

Solusi: Set requests conservative, maintain headroom:

yaml
# Conservative approach
requests:
  cpu: 250m    # for Balanced: 1 CPU = 3.5GB
  memory: 1Gi  # room for spikes
limits:
  cpu: 500m
  memory: 2Gi

Anti-Pattern 3: Avoiding Standard "Because Autopilot Simpler"

Sai lầm: Pilih Autopilot meski workload need flexibility

Realitas:

  • Autopilot kan memiliki features yang tidak tersedia di Standard
  • Beberapa use cases (GPU clusters, mixed-architecture deployments) lebih fit Standard
  • Standard memberikan granular control untuk specialized needs

Solusi: Choose based on workload characteristic:

  • Choose Autopilot jika: web/API service, standard compute, tiada special OS needs
  • Choose Standard jika: GPU/TPU, custom kernel, specialized networking, mixed architectures

GCP Documentation Reference

Semua pernyataan teknis di section ini berdasarkan:


Implikasi untuk Chapters Berikutnya

Model ini (Standard vs Autopilot) berpengaruh ke chapters:

  • Chapter 6 (Node Lifecycle): Node repairs, upgrades berbeda per model
  • Chapter 8 (Scheduler): Scheduling constraints depend pada node pool model
  • Chapter 9 (Autoscaling): Autopilot autoscaling fully automated, Standard require setup
  • Chapter 12 (Control Plane Scalability): Scaling patterns depend node model

Summary

  • GKE control plane SELALU managed oleh Google — bukan trade-off antara Standard vs Autopilot
  • Standard memberikan flexibility, bạn manage node pools sepenuhnya → better untuk specialized workloads
  • Autopilot memberikan simplicity, opinionated defaults + resource validation → better untuk standard web/API deployments
  • Pilihan bukan binary — bisa ada hybrid (beberapa clusters Standard, beberapa Autopilot) sesuai workload needs
  • Production success = memilih model yang sesuai workload characteristics, bukan yang "terlihat easier"