SỔ TAY KỸ THUẬT GCP CẤP ĐỘ SẢN XUẤT

Hệ Thống Toàn Diện cho Platform Engineers & Staff/Principal Cloud Architects

PHẦN I: NỀN TẢNG KIẾN TRÚC & TỔNG QUAN GCP

Chương 1: GCP Resource Hierarchy & Tổ Chức Tài Nguyên

Tại sao quan trọng: Mọi quyết định IAM, billing, network boundary, org policy phụ thuộc vào hiểu được resource hierarchy. Sai ở tầng này → blast radius tối đa ảnh hưởng toàn tổ chức.

Chapter 1 Full Index & Learning Paths

Các chủ đề con:

Resource Hierarchy Fundamentals - Organization → Folder → Project → Resource: phân cấp, inheritance, override mechanics
Resource Manager API - Programmatic resource management, propagation delay, consistency model, eventual consistency handling
Project Naming & Automation - Project ID constraints, immutability, soft-delete windows, naming automation patterns
IAM Policy Propagation - Three-layer propagation, eventual consistency, caching behavior, testing strategies
Quota Management - Quota types (allocation/rate/concurrent), project vs organization-level, exhaustion scenarios
Labels, Tags & Organization - Labels vs Tags vs Network Tags: usage cho billing, firewall, IAM conditions, cost allocation
Cloud Asset Inventory - Query resource state across hierarchy, drift detection, compliance auditing
Resource Protection - Locking, deletion protection, soft-delete recovery, backup strategies
Shared VPC Model - Host project vs service projects, centralized network management, cross-project connectivity
Service Account Scoping - Cross-project access patterns, keys vs tokens, workload identity, impersonation chains
Billing Hierarchy - Cost attribution, billing account structure, chargeback models, budget alerts
Organization Policies - Constraint framework, managed/custom constraints, conditional policies, CEL expressions

Chương 2: GCP Physical Network Architecture — Jupiter Fabric & Andromeda

Tại sao quan trọng: GCP networking khác hoàn toàn so với on-prem và AWS. Jupiter spine-leaf fabric, Andromeda SDN, global routing — hiểu cơ chế này giải thích latency, failover behavior, bandwidth allocation.

Chapter 2 Full Index & Learning Paths

Các chủ đề con:

Andromeda: GCP Software-Defined Networking Stack - Control plane vs data plane, 5-step packet processing pipeline, VPC logical overlay, production patterns, anti-patterns
Jupiter Fabric: Spine-Leaf Topology - Physical datacenter topology, per-server bandwidth, ECMP routing, oversubscription implications, zone placement
Google Points of Presence (PoP) - Edge node hierarchy, traffic entry points, PoP failover mechanisms, DDoS scrubbing, anycast routing
GCP Global Backbone: Premium vs Standard Tier - User-centric vs region-centric routing, private fiber backbone, SLA differences, cost tradeoffs
Latency SLA & Fiber Path Engineering - Fiber infrastructure, latency components, multi-path redundancy, inter-region latencies, propagation delays
Anycast Routing with Global Load Balancer - BGP anycast mechanism, automatic geo-routing, single IP multiple locations, failover transparency
Cold Potato vs Hot Potato Routing Strategies - Egress point optimization, cold potato (backbone) vs hot potato (internet), strategic routing decisions
Network Service Tiers: Practical Datapath Implications - Premium vs Standard tier queuing, SLA achievement mechanics, health checking differences
Bandwidth Allocation & Egress Pricing Architecture - Per-zone capacity, bandwidth quotas, egress pricing model, burst allowance mechanics
Regional vs Global Services: Data Sovereignty - Data residency requirements, GDPR/CCPA/HIPAA compliance, regional constraints, multi-region architectures
Traffic Engineering & Multi-path Load Balancing - ECMP routing, capacity planning, failure scenarios, multi-path resilience, cascade failure prevention

Chương 3: GCP VPC Model — Kiến Trúc Mạng Ảo Toàn Cầu

Tại sao quan trọng: VPC là nền tảng của mọi thứ trong GCP. Hiểu cấu trúc global-regional, subnet design, routing primitives là điều kiện bắt buộc.

Chapter 3 Full Index & Learning Paths

Các chủ đề con:

VPC là Global Resource, Subnet là Regional - VPC scope vs subnet scope, implications cho multi-region, tại sao GCP khác AWS/Azure
Auto-mode vs Custom-mode VPC - Auto-mode limitations (10.128.0.0/9), custom-mode flexibility, tại sao production luôn custom, migration strategies
Subnet Design & CIDR Planning - Primary vs secondary ranges, GKE Pod CIDR allocation, IP address management at scale, overlap constraints
Alias IP Ranges & GKE Pods - VPC-native pod routing (không NAT), anti-spoofing checks, container networking patterns, firewall interactions
Static Routes & Next Hops - Subnet routes, custom static routes, next hop types (VMs, ILBs, VPN), route conflict resolution
Dynamic Routes & Cloud Router - BGP sessions, route learning/advertisement, regional vs global mode, on-premises connectivity
System-generated Routes - Default route, subnet routes, special paths (GFE, IAP, Serverless), reserved ranges
Firewall Rules Fundamentals - Stateful inspection, priority 0-65535, ingress/egress asymmetry, connection tracking limits
Network Tags vs Service Accounts - Tags vs SAs for firewall targeting, decision matrix, multi-tier patterns, IAM integration
Hierarchical Firewall Policies - Organization → folder → project evaluation order, allow/deny semantics, exceptions, multi-org scenarios
Cloud NGFW & L7 Inspection - FQDN filtering, TLS interception, IDS/IPS, threat intelligence, latency overhead, throughput ceilings
VPC Peering Deep Dive - No-transitivity principle, mesh topology, hub-and-spoke routing, DNS resolution, cross-project patterns
Shared VPC & Centralized Management - Host/service projects, subnet sharing, IAM role separation, multi-tenancy isolation, cost attribution
Private Google Access - 199.36.153.x/30 routing, Google APIs access without internet, private vs restricted endpoints
VPC Flow Logs Analysis - Sampling mechanics, metadata fields, BigQuery export, cost analysis, troubleshooting patterns
Network Intelligence Center - Topology visualization, connectivity tests, performance insights, firewall analysis
VPC Service Controls - Service perimeters, access levels, ingress/egress rules, data exfiltration prevention, compliance

Chương 4: Cloud DNS Architecture & Production Patterns

Tại sao quan trọng: DNS là attack surface ẩn. Misconfiguration dẫn đến outages và data exfiltration. Cloud DNS for GKE là mandatory cho Autopilot.

Chapter 4 Full Index & Learning Paths

Các chủ đề con:

Managed Zones: Public vs Private - Public/private zone fundamentals, zone characteristics, naming conventions
Split-Horizon DNS: Internal vs External Resolution - Same domain multiple answers, internal/external topology, failover patterns
DNS Peering: Hybrid On-Premises Resolution - Multi-project, multi-VPC, on-premises integration, hub-spoke architecture
DNS Forwarding: Cấu hình Upstream Resolvers - Forwarding zones, external DNS, resolver chains, failure handling
Private DNS Zones: VPC Binding & Zone Discovery - VPC attachment, zone discovery, multi-VPC patterns, GKE integration
Cloud DNS for GKE: Alternatives & Performance at Scale - GKE DNS stack, kube-dns vs CoreDNS, external service discovery, multi-cluster
Response Policy Zones (RPZ): Internal Overrides & Security - RPZ mechanisms, security use cases, malware blocking, internal redirects
NodeLocal DNSCache: Latency Reduction & Caching Mechanics - Local caching, performance impact, deployment, troubleshooting
DNS Resolution Path: Pod → NodeLocal → Cloud DNS → Upstream - Complete flow, layer-by-layer troubleshooting, debugging tools
DNS Query Logging: Detection, Audit & Compliance - Query logging, exfiltration detection, BigQuery analysis, alerting
TTL Tuning: High-Churn Environments & Consistency - TTL mechanics, environment-specific tuning, eventual consistency
DNSSEC: Validation & Key Management - DNSSEC architecture, validation, key signing, operational considerations
Multi-Cluster DNS: Cloud Service Directory Patterns - ServiceImport/Export, Service Directory, cross-cluster routing, failover

PHẦN II: GOOGLE KUBERNETES ENGINE — KIẾN TRÚC TOÀN DIỆN

Chương 5: GKE Control Plane Internals — Stateful Systems at Scale

Tại sao quan trọng: Control plane là "bộ não" của cluster. Hiểu cơ chế reconciliation, etcd behavior, control plane limitations là điều kiện tiên quyết debug scheduling failures, API server latency, upgrade issues.

Chapter 5 Full Index & Learning Paths

Các chủ đề con:

GKE Managed Control Plane Model — Standard vs Autopilot - Google quản lý gì, customer quản lý gì, implications cho operations
Kiến Trúc Control Plane Components — API Server, Scheduler, Controller-Manager - Mỗi component role, dependencies, failure modes, interoperability
etcd vs Spanner Backend — GKE State Storage & Consistency Model - Storage backends, consistency guarantees, latency implications, backup strategies
etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction - Raft consensus, replication log, watch caching, compaction schedule, performance limits
Watch Caching & API Server Local Cache — Stale Reads, Reconnection Behavior - Cache mechanics, stale reads, watch connection handling, cache invalidation
Kubernetes Informer Pattern — List-Watch Protocol, Local Cache, Resync Intervals - List-watch protocol, informer cache, resync mechanics, shared factory pattern
Controller Reconciliation Loops — Level-Triggered vs Edge-Triggered Design - Reconciliation patterns, level vs edge-triggered, failure modes, idempotency
API Priority and Fairness (APF) — Flow Schemas, Priority Levels, Rate Limiting - Request prioritization, flow classification, token bucket algorithm, debugging rejections
Admission Control Pipeline — MutatingAdmissionWebhook, ValidatingAdmissionWebhook - Request processing pipeline, webhook execution order, failure modes, cluster stability
Mutating Admission Policies — CEL-Based Policies, Webhook Alternatives - CEL expressions, policy enforcement, webhook alternatives, performance tradeoffs
API Server Request Lifecycle — Authentication → Authorization → Admission → Storage - Full request path, latency breakdown, bottleneck analysis
Control Plane Scalability — Request Rate Limits, Watch Connection Limits, Burst Handling - Scale limits, capacity planning, failure at scale, workarounds
Control Plane Connectivity — DNS-Based vs IP-Based Endpoint, Authorized Networks - Endpoint types, authorized networks, network security implications
Private Cluster Control Plane — Private Endpoint, Cloud NAT, Node Access - Private endpoint setup, node connectivity, security benefits
Credential Rotation & Zero-Downtime Updates — SSL Certificates, CA Rotation, IP Rotation - Certificate lifecycle, CA rotation, zero-downtime strategies
Control Plane SLA, Release Channels, & Versioning Policy - Availability guarantees, release cadence, version support windows, version skew policy

Chương 6: GKE Node Lifecycle & Pool Management

Tại sao quan trọng: Node management là nơi xảy ra phần lớn operational incidents. Node not ready, OOM kills, disk pressure — hiểu lifecycle giúp thiết kế clusters chịu lỗi tốt hơn.

Điều kiện tiên quyết: Chương 5, Container-Optimized OS cơ bản

Mức độ sâu: 5/5

Chapter 6 Full Index & Learning Paths

Các chủ đề con:

COS, Node Bootstrap, Node Conditions và Auto-Repair - COS hardening/immutable filesystem, kubelet registration, startup taints, node conditions, eviction behavior, auto-repair trigger và cơ chế thay node
Node Pool Upgrades, Draining, Maintenance Windows và Cluster Disruption Budget - Surge vs blue-green, maxSurge/maxUnavailable, cordon vs drain, PDB, maintenance windows/exclusions, giới hạn tần suất gián đoạn
Spot, ARM, Confidential Nodes, Reservations và Node Labeling Strategy - Spot preemption, grace shutdown, ARM T2A compatibility, confidential computing, reservation affinity, chiến lược labels cho scheduling
Max Pods, Flex Pod CIDR, Boot Disk, Local SSD và Capacity Design - max pods per node, alias IP sizing, discontiguous Pod CIDR, boot disk performance, local SSD patterns
kubelet & containerd Configuration cho Production GKE - kubelet tuning, eviction thresholds, cgroup v2 migration, registry mirrors, custom TLS CA, image pulling behavior

Chương 7: GKE Networking Internals — VPC-Native, CNI, Dataplane V2 Deep Dive

Tại sao quan trọng: GKE networking là nơi phức tạp nhất. Hiểu packet path từ pod đến pod, qua service, ra internet là điều kiện tiên quyết debug latency, packet drops, network policy violations.

Điều kiện tiên quyết: Chương 3, Linux networking (namespaces, iptables, veth pairs, bridge)

Mức độ sâu: 5/5

Chapter 7 Full Index & Learning Paths

Các chủ đề con:

VPC-Native Architecture — Alias IP, Pod CIDR Sizing & Migration - VPC-native vs routes-based, alias IP ranges trên NIC node, routes-based deprecation, Pod CIDR sizing & max-pods-per-node, secondary subnet sizing, discontiguous Pod CIDR, IP migration
CNI Evolution & Dataplane V2 — kubenet, Calico, eBPF/Cilium - kubenet legacy, Calico iptables ceiling, GKE Dataplane V2 (anetd DaemonSet, eBPF programs, no kube-proxy), eBPF vs iptables 260K endpoint limit, Cilium identity model
Detailed Packet Path Analysis — 5 Đường Đi Của Gói Tin - Same-node & cross-node pod-to-pod, pod-to-Service (ClusterIP DNAT), pod-to-external (masquerade/Cloud NAT), external-to-pod (LoadBalancer/NEG container-native)
kube-proxy & Service Dataplane — iptables vs eBPF - iptables mode chains/DNAT/session affinity, chain explosion O(Services × Endpoints), lock contention, rule resyncing & control-plane latency, Dataplane V2 eBPF replacement
NetworkPolicy Enforcement — Calico iptables vs Dataplane V2 eBPF - Mô hình default-deny, Calico ipset theo IP, Dataplane V2 theo Cilium identity, FQDN egress, NetworkPolicy logging, anti-patterns isolation
Troubleshooting Toolkit — tcpdump, nsenter, Hubble, Connectivity Tests - tcpdump trong pod network namespace, nsenter cấp node, Hubble, GCP Connectivity Tests, ip route/arp/iptables, conntrack limits, eBPF tracing với bpftrace

Chương 8: GKE Scheduler — Algorithms, Affinity, Resource Model

Tại sao quan trọng: Scheduling failures là nguyên nhân hàng đầu Pod stuck in Pending. Hiểu cơ chế scoring/filtering giúp thiết kế node pools, resource requests đúng ngay từ đầu.

Điều kiện tiên quyết: Chương 6, 7; Kubernetes resource model (requests/limits)

Mức độ sâu: 5/5

Chapter 8 Full Index & Learning Paths

Các chủ đề con:

Scheduler Architecture & Workflow — Scheduling Framework, Cycle & Queue - Scheduling cycle vs binding cycle, toàn bộ extension point (PreFilter→Filter→PostFilter→Score→Reserve→Permit→Bind), optimistic locking, ba hàng đợi activeQ/backoffQ/unschedulablePods, QueueingHints, scheduler metrics
Filter & Score Plugins — Lọc Node & Chấm Điểm - Filter plugins (NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread, VolumeBinding), Score plugins, LeastAllocated vs MostAllocated (spread vs bin-packing), GKE optimize-utilization, percentageOfNodesToScore
Node Affinity & Inter-Pod Affinity/Anti-Affinity - nodeAffinity (required/preferred, operators, weight), inter-pod affinity/anti-affinity (topologyKey, namespaceSelector), chi phí O(pods×namespaces), anti-pattern required anti-affinity hostname, tương tác autoscaler
Pod Topology Spread Constraints — Phân Bố Theo Failure Domain - Công thức skew, maxSkew, minDomains, whenUnsatisfiable (DoNotSchedule vs ScheduleAnyway), nodeAffinityPolicy/nodeTaintsPolicy, matchLabelKeys, so sánh với podAntiAffinity
Taints & Tolerations — Ràng Buộc "Đẩy" Node - Ba effect NoSchedule/PreferNoSchedule/NoExecute, tolerationSeconds, operator Equal/Exists, taint-based eviction theo node condition, default toleration 300s, taints mặc định GKE (GPU/Spot/cordon)
Resource Model, QoS & Node-Pressure Eviction - requests vs limits, CPU CFS quota throttling, memory OOM kill, QoS (Guaranteed/Burstable/BestEffort), node-pressure eviction (soft/hard threshold), oom_score_adj, overcommit, vì sao eviction không tôn trọng PDB
Pod Priority & Preemption - PriorityClass (value, globalDefault, preemptionPolicy Never), thuật toán chọn victim, nominatedNodeName, PDB best-effort, cross-node preemption, cascading eviction, starvation, ResourceQuota giới hạn priority
Extended Resources & GPU Scheduling - requests=limits cho extended resources, nvidia.com/gpu, taint GPU + ExtendedResourceToleration, device plugin/driver, GPU sharing (time-sharing/MIG), stranded GPU, TPU & Dynamic Workload Scheduler
GKE Autopilot Scheduling, Custom ComputeClasses & Scheduler Extenders - Autopilot ép tỷ lệ CPU:memory & từ chối/điều chỉnh request, compute classes, custom ComputeClasses (priorities/fallback, activeMigration, consolidation, nodePoolAutoCreation), scheduler extenders vs plugins, Kueue/Volcano

Chương 9: GKE Autoscaling — HPA, VPA, Cluster Autoscaler, KEDA

Tại sao quan trọng: Autoscaling là trái tim cost optimization và reliability. Hiểu sai autoscaling → chậm scale-up (outage), expensive over-provisioning, hoặc flapping destabilizing cluster.

Điều kiện tiên quyết: Chương 8, Cloud Monitoring metrics

Mức độ sâu: 5/5

Chapter 9 Full Index & Learning Paths

Các chủ đề con:

HorizontalPodAutoscaler — Control Loop & Thuật Toán - Control loop chu kỳ 15s, công thức desiredReplicas, tolerance 0.1, dampening Pod chưa Ready/thiếu metric, stabilization window, log atomic vs final recommendation (hpa-controller), debug qua conditions
HPA — Behavior Policies, Metrics Sources & Debugging - behavior autoscaling/v2 (scaleUp/scaleDown, selectPolicy, stabilizationWindowSeconds), Resource/Custom/External metrics, Performance HPA Profile (1000/5000 objects), xung đột HPA+VPA, tương tác rolling update, AbleToScale/ScalingActive/ScalingLimited
VerticalPodAutoscaler — Kiến Trúc, Recommender & Update Modes - Recommender/Updater/Admission Controller, histogram phân rã half-life 24h, OOM bump, update modes (Off/Initial/Recreate/Auto/InPlaceOrRecreate), In-Place Pod Resize, controlledValues, giới hạn VPA
Multidimensional Pod Autoscaling — HPA và VPA Cùng Lúc - Vì sao HPA+VPA xung đột, MultidimPodAutoscaler (CPU ngang + memory dọc), spec & constraints, migration, so sánh với HPA custom metric + VPA Off, failure modes
Cluster Autoscaler — Cơ Chế Scale-Up & Scale-Down - Pod Pending trigger, fake scheduling simulation, expander (least-waste/priority...), location_policy BALANCED/ANY, ngưỡng scale-down 0.5 & các delay, điều chặn scale-down, drain sequence, autoscaling profile
Node Auto-Provisioning — Tự Động Tạo Node Pool - NAP tự tạo/xóa pool, resourceLimits, chọn machine type, khuôn mặc định (Shielded/SA/auto-upgrade), GPU/TPU/Spot, tích hợp ComputeClass, ngưỡng 200 pool, NAP trên Autopilot
CA Troubleshooting, Capacity Buffers & Provisioning Requests - Visibility events (scaleUp/scaleDown/nodePoolCreated), noScaleUp/noScaleDown reasons, Cloud Logging queries, capacity buffer với pause Pod, Provisioning Requests & Dynamic Workload Scheduler, Kueue
KEDA — Kubernetes Event-Driven Autoscaling - Kiến trúc KEDA (operator/metrics-apiserver/webhooks) tạo HPA, ScaledObject vs ScaledJob, scale-to-zero (activation/scaling), defaults (pollingInterval/cooldownPeriod), Pub/Sub & Prometheus scaler, Cloud Tasks/BigQuery

Chương 10: GKE Admission Control & Policy Enforcement — Securing the API

Tại sao quan trọng: Admission control là cửa ngõ security. Misconfigured webhooks → down toàn cluster. Hiểu admission pipeline bắt buộc cho platform engineers.

Điều kiện tiên quyết: Chương 5, Kubernetes API fundamentals

Mức độ sâu: 5/5

Chapter 10 Full Index & Learning Paths

Các chủ đề con:

Admission Pipeline & Built-in Plugins - Vị trí admission trong vòng đời request, hai pha bất biến Mutating → Validating, danh sách plugin bật mặc định, bốn plugin then chốt LimitRanger/ResourceQuota/PodSecurity/NodeRestriction, vì sao trên GKE không sửa được --enable-admission-plugins
Mutating & Validating Webhooks — Cơ Chế Gọi & Dry-Run - WebhookConfiguration (rules/clientConfig), vòng AdmissionReview request/response, JSON Patch, reinvocationPolicy IfNeeded & idempotency, matchPolicy/objectSelector/namespaceSelector, sideEffects & dry-run, audit vs enforce
Webhook Failure Modes, Performance & Stability - failurePolicy Fail vs Ignore, timeoutSeconds (10s/30s) & p99 latency, đường ghi nóng, anti-pattern bắt kube-system/tự-validate/thiếu HA, chiến lược ổn định control plane, break-glass
Webhook Certificate Management — CA Bundle & cert-manager - Webhook là HTTPS server, SAN <service>.<ns>.svc, caBundle & verify, cert-manager + CA Injector tự bơm caBundle, rotation không downtime, self-signed CA, các lỗi x509
PodSecurity Admission (PSA) — Modes & Profiles - Ba mode enforce/audit/warn qua label namespace, version pinning, ba profile privileged/baseline/restricted với từng control chi tiết, enforce áp Pod vs audit/warn áp workload, exemptions, thay thế PSP
Gatekeeper / Policy Controller (OPA) — ConstraintTemplate & Constraint - Kiến trúc webhook + audit controller, ConstraintTemplate (Rego) → Constraint, enforcementAction deny/dryrun/warn, audit loop & status violations, referential constraints, Policy Controller trên GKE (Config Sync/fleet/bundles), Gatekeeper vs PSA
ResourceQuota & LimitRange — Quản Trị Tài Nguyên Namespace - ResourceQuota compute/storage/object-count, scoped quota theo PriorityClass, quy tắc bắt buộc khai requests/limits, LimitRange default/min/max/maxLimitRequestRatio, thứ tự LimitRanger (mutating) → ResourceQuota (validating)
ValidatingAdmissionPolicy (CEL) — Policy In-Process Không Cần Webhook - VAP (GA 1.30) + Binding + paramRef, biến CEL object/oldObject/request/params/namespaceObject, matchConditions/variables, validationActions Deny/Warn/Audit, vì sao CEL loại bỏ failure mode webhook, MutatingAdmissionPolicy, ma trận chọn engine
Organization Policies for GKE & Admission Debugging - Org Policy chặn ở GCP API layer (cluster config) vs Kubernetes admission (Pod config), custom constraints CEL trên container.googleapis.com/Cluster & NodePool, debugging qua audit log Policy Denied/dry-run/log webhook/metric apiserver_admission

Chương 11: GKE Storage — PV/PVC, StorageClasses, CSI Drivers

Tại sao quan trọng: Storage là nơi stateful workloads sống. Hiểu PV/PVC lifecycle, storage classes, volume binding ngăn data loss và performance bottlenecks.

Điều kiện tiên quyết: Chương 6, Kubernetes storage concepts

Mức độ sâu: 5/5

Chapter 11 Full Index & Learning Paths

Các chủ đề con:

Volume Types & Storage Taxonomy — Bản Đồ Toàn Cảnh - Kubernetes volume types (emptyDir, configMap, secret, projected, downwardAPI, hostPath, PVC), phân loại Block/File/Object, access modes RWO/ROX/RWX/RWOP và ngữ nghĩa node-vs-pod, khung quyết định chọn storage
PV/PVC Lifecycle & Dynamic Provisioning - Vòng đời provisioning→binding→mounting→releasing→reclaiming, reclaimPolicy Delete vs Retain, dynamic provisioning end-to-end qua StorageClass/CSI, volume binding modes Immediate vs WaitForFirstConsumer, StorageClass mặc định GKE
Persistent Disk CSI — Block Storage Nền Tảng - PD types và quan hệ IOPS-dung lượng, attach/detach và per-node limit, giới hạn RWX của block device, Regional PD replication đồng bộ, snapshots/cloning/expansion, Stateful HA Operator force-attach
Hyperdisk — Block Storage Thế Hệ Mới - Tách IOPS/throughput khỏi dung lượng, năm loại (balanced/extreme/throughput/ml/balanced-ha), per-VM performance limit, Hyperdisk ML multi-attach ROX, Storage Pools thin provisioning, VolumeAttributesClass
Local SSD & Ephemeral Storage — Tốc Độ Đổi Lấy Độ Bền - Local SSD NVMe physical, emptyDir và ephemeral storage, quy luật mất dữ liệu khi node recreate, provisioning ephemeral-storage-local-ssd, use case đúng và anti-pattern chết người
Filestore CSI — Shared NFS cho ReadWriteMany - Khi nào thật sự cần RWX, service tiers (BASIC_HDD/SSD, Zonal, Enterprise/Regional), Multishares gộp nhiều PVC nhỏ, volume snapshots, NFS tradeoffs về latency/consistency/locking
Cloud Storage FUSE — Object Storage Với File Semantics - Cơ chế FUSE giả lập filesystem trên GCS, sidecar gke-gcsfuse-sidecar, Workload Identity, file cache/metadata cache/parallel downloads, ngữ nghĩa khác POSIX, use case AI/ML read-heavy
Parallelstore & Managed Lustre — Filesystem Song Song cho AI/ML - Parallelstore nền DAOS với erasure coding 2+1 và mô hình temporary storage, Managed Lustre cho HPC, CSI driver, tích hợp GCS, khung chọn Parallelstore/Lustre/Filestore/GCS FUSE
StatefulSets, Volume Expansion & Backup for GKE - StatefulSet volumeClaimTemplates và Pod identity bền vững, PVC giữ khi scale-down, volume expansion online vs cold, Backup for GKE backup config+volume, khác biệt PD snapshot, snapshot lifecycle và chiến lược DR

Chương 12: GKE Security — Hardening, RBAC, Pod Security

Tại sao quan trọng: GKE security có nhiều lớp. Một cấu hình sai có thể phơi bày toàn bộ cluster. Production hardening là bắt buộc, không phải tùy chọn.

Điều kiện tiên quyết: Chương 5, 10, IAM fundamentals

Mức độ sâu: 5/5

Chapter 12 Full Index & Learning Paths

Các chủ đề con:

Security Model & Shared Responsibility — Ai Bảo Vệ Cái Gì - Mô hình trách nhiệm chung GKE, ranh giới dịch chuyển giữa Standard và Autopilot, bảy lớp phòng thủ (org/project → control plane → identity → node → pod → network → supply chain), threat model và các pattern hardening control plane (private cluster, authorized networks)
Authentication & Identity — Bốn Cổng Của Một Request - Luồng request bốn cổng, mô hình hai cổng IAM ↔ RBAC, các phương thức xác thực (Google identity/OIDC, gke-gcloud-auth-plugin, X.509 legacy), token ServiceAccount legacy vs bound (TokenRequest, audience-bound, hết hạn), automountServiceAccountToken: false
RBAC Deep Dive — Role, Binding & Least Privilege - Role vs ClusterRole, quy tắc scope của binding, aggregated ClusterRole, ánh xạ IAM predefined role ↔ RBAC, default role (view/edit/admin/cluster-admin), anti-pattern (cluster-admin cho SA, wildcard, system:authenticated), kiểm tra bằng kubectl auth can-i
Workload Identity Federation for GKE — Hết Long-Lived Key - Hiểm họa của service account key dạng JSON, workload identity pool PROJECT_ID.svc.id.goog, định dạng principal, ba bước trao đổi token qua GKE metadata server, direct binding vs annotation legacy, federation với external IdP
Node Security — Shielded, Confidential, gVisor, COS - Shielded Nodes (Secure Boot, vTPM, Integrity Monitoring), gVisor (runtimeClassName: gvisor, userspace kernel), Confidential Nodes (AMD SEV, mã hóa bộ nhớ), Container-Optimized OS (rootfs read-only, seccomp), node service account tối thiểu, metadata concealment
Pod & Workload Security — Pod Security Standards & securityContext - Ba mức Pod Security Standards (Privileged/Baseline/Restricted), Pod Security Admission (enforce/audit/warn, namespace label), securityContext từng trường (runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, drop capabilities), seccomp RuntimeDefault, AppArmor
Network Policy Security — Default-Deny & Đông-Tây - Pattern default-deny, bẫy chặn DNS, Dataplane V2 (Cilium/eBPF), FQDNNetworkPolicy cho egress theo tên miền, Network Policy logging phục vụ điều tra, kiểm soát lateral movement
Admission Control Security — Enforcement Tại Cổng API - Admission như cơ chế enforcement bảo mật, trade-off failurePolicy Fail/Ignore, rủi ro của mutating webhook, ValidatingAdmissionPolicy/CEL in-tree, OPA/Gatekeeper vs Kyverno, Policy Controller managed và constraint framework
Binary Authorization — Chỉ Deploy Image Đáng Tin - Mô hình attestation (digest → attestor → attestation ký số → policy), Artifact Analysis note, execution path qua admission + Binary Authorization API, policy modes (allowlist/require-attestation/dryRun), break-glass có audit, Continuous Validation, Cloud Build/SLSA provenance
Audit Logging, Security Posture & Hardening Checklist - Bốn loại Cloud Audit Logs (Admin Activity, Data Access, System Event, Policy Denied) và bẫy chi phí, Kubernetes audit log và query forensics, GKE Security Posture (config scanning + workload vulnerability scanning), tích hợp Security Command Center, checklist hardening đầy đủ bảy lớp

Chương 13: GKE Workload Identity & Service Accounts — Modern Authentication

Tại sao điều này quan trọng: Workload Identity là cơ chế hiện đại giúp các Pod xác thực với Google APIs mà không cần sử dụng các khóa dịch vụ (service account keys) tồn tại lâu dài. Nếu cấu hình không chính xác, Pod sẽ không thể truy cập hoặc gọi các Google APIs. Vì vậy, việc hiểu rõ luồng trao đổi token (token exchange flow) là yếu tố then chốt để triển khai, vận hành và khắc phục sự cố hiệu quả.

Điều kiện tiên quyết: Chương 12, IAM service accounts, OIDC basics

Mức độ sâu: 5/5

Chapter 13 Full Index & Learning Paths

Các chủ đề con:

Workload Identity Architecture — Cluster Như Một OIDC Provider - Mỗi cluster là một OIDC issuer độc lập, Workload Identity Pool PROJECT_ID.svc.id.goog làm cây cầu để IAM hiểu danh tính Kubernetes, bốn dạng định danh principal/principalSet (theo tên KSA, theo UID, cấp namespace, cấp cluster), identity sameness giữa các cluster cùng project, Fleet Workload Identity
ServiceAccount Token & Projection Mechanics — Danh Tính Được Ký - TokenRequest API và bound token thay legacy secret-based token, projected volume với audience/expirationSeconds/path, cấu trúc JWT (iss issuer cluster, aud sts.googleapis.com, exp, claim kubernetes.io), OIDC issuer endpoint và JWKS để STS verify offline, vòng đời tự refresh
Metadata Server & Token Exchange Path — Trái Tim Của Cơ Chế - gke-metadata-server DaemonSet một Pod/node chặn request 169.254.169.254, trust boundary cấp node và rủi ro hostNetwork bypass, token exchange năm bước qua Security Token Service, caching/refresh lifetime 1 giờ, scale bottleneck (500 conn/node, 3000 SA/cluster, quota 6000 req/phút), network policy egress
IAM Binding Models — Cấp Quyền Cho Danh Tính Workload - Mô hình trực tiếp bind role thẳng cho principal KSA vs mô hình impersonation qua annotation iam.gke.io/gcp-service-account và roles/iam.workloadIdentityUser, principalSet cấp namespace/cluster, cross-project với credential-quota-project, Autopilot luôn bật, return-principal-id-as-email
Workload Identity Federation cho External IdP — Liên Bang Danh Tính Đa Đám Mây - Workload Identity Pool + Provider cho external IdP, token exchange RFC 8693 qua sts.googleapis.com, IdP hỗ trợ (AWS, Entra ID, GitHub Actions, GitLab, Kubernetes, Okta, AD FS, OIDC/SAML), attribute mapping CEL google.subject/attribute.NAME, attribute condition chống confused deputy, direct vs impersonation
Truy Cập Dịch Vụ & Application Default Credentials Patterns - ADC behavior và thứ tự dò credential, vì sao client library tự hoạt động không sửa code, Secret Manager qua Workload Identity, pattern Cloud Storage/Pub-Sub/BigQuery KSA-per-workload, credential helper Artifact Registry, anti-pattern GOOGLE_APPLICATION_CREDENTIALS, khác biệt ADC local-vs-cluster
Debugging Workload Identity — Khi Token Exchange Thất Bại - Quy trình bốn tầng (token gốc, metadata server, STS, IAM binding), debug từ trong Pod bằng curl metadata, verify GKE_METADATA mọi node pool, verify IAM binding và principal string, token validity check, cây quyết định lỗi (unable to detect environment, 403, 404, treo, lỗi rải rác scale)

Chương 14: GKE Observability — Metrics, Logs, Traces

Tại sao quan trọng: GKE sinh ra lượng telemetry khổng lồ và phân tầng. Biết metric nào nằm ở tầng nào, và correlate telemetry để đi từ triệu chứng tới nguyên nhân, là kỹ năng production cốt lõi.

Điều kiện tiên quyết: Chương 5–13, Cloud Monitoring/Logging basics

Mức độ sâu: 5/5

Chapter 14 Full Index & Learning Paths

Các chủ đề con:

Observability Stack — Telemetry Phân Tầng & Mental Model - Ba tầng telemetry (control plane/system/workload), ba loại signal (metric/log/trace) với mô hình chi phí riêng, tích hợp GKE với Cloud Monitoring/Logging/Trace và Managed Prometheus, resource label nhất quán làm nền cho correlation
Control Plane Metrics — Quan Sát Bộ Não Cluster - API server (request rate/error/latency percentile, etcd op latency, inflight, admission webhook), scheduler (pending_pods, scheduling attempt duration, preemption), controller-manager (workqueue depth, reconciliation, node eviction), cách bật --monitoring, mô hình chi phí
System & Workload Metrics — kube-state-metrics, cAdvisor, DCGM GPU - System metrics node, kube-state-metrics (kube_* trạng thái object), cAdvisor (container_*, CPU CFS throttling, memory working set), DCGM GPU metrics (utilization, framebuffer, power, profiling, XID), cardinality
Application Metrics, Startup Latency & Cost Allocation - Golden signals (rate/error/duration/saturation), auto-instrumentation vs custom metric, phân rã startup latency (image pull/init/readiness), GKE cost allocation theo namespace/label (requested vs consumed), FinOps loop
GKE Logs — System, Workload, Audit & Log Control - Logging agent fluent-bit, gói log (SYSTEM/WORKLOAD/API_SERVER/...), system component logs, workload stdout/stderr và structured logging, bốn loại audit log (Admin Activity/Data Access/System Event/Policy Denied), Log Router/sink, exclusion/sampling/retention
Managed Service for Prometheus — PodMonitoring, Rules, PromQL - Managed collection (gmp-operator, collector DaemonSet scrape colocated node, rule-evaluator, alertmanager) và push model, PodMonitoring/ClusterPodMonitoring CRDs, Rules/ClusterRules/AlertmanagerConfig, PromQL trong Cloud Monitoring, high cardinality và metricRelabeling
Managed OpenTelemetry & Custom Metrics cho HPA - Managed OpenTelemetry cho GKE (in-cluster OTLP collector, Instrumentation CRD, signal routing), Google-Built OpenTelemetry Collector, custom metric cho HPA (Custom Metrics Stackdriver Adapter vs Prometheus Adapter, không chạy đồng thời), ServiceMonitor/PodMonitor, liên kết KEDA
Self-Managed Observability — Elastic Stack trên GKE - Khi nào tự vận hành (data sovereignty, multi-cloud, log analytics nâng cao, anti-lock-in), Elastic Cloud on Kubernetes (ECK), performance tuning (Hyperdisk, JVM heap 50%/≤31GB, shard sizing, ILM hot-warm-cold), khung quyết định managed vs self-managed, pattern hybrid
Troubleshooting & Dashboard — Tích Hợp Metrics, Logs, Traces - GKE dashboard trong Cloud Console, workflow correlate dashboard → metric → log → trace qua resource label chung, runbook (Pod Pending, OOMKill, latency spike, API server overload, node NotReady), alerting SLO/burn-rate tránh alert fatigue

Chương 15: GKE Upgrade Mechanics & Disruption Management

Tại sao quan trọng: Sai upgrade strategy → production outage. Hiểu upgrade mechanics, release channels, node draining là nền tảng để thực hiện zero-downtime upgrades.

Điều kiện tiên quyết: Chương 5, 6, 7, 13

Mức độ sâu: 5/5

Chapter 15 Full Index & Learning Paths

Các chủ đề con:

Release Channels, Versioning & Version Skew Policy - GKE release channels (Rapid/Regular/Stable/Extended) cadence, auto-upgrade triggers, capping behavior, Kubernetes version skew policy control plane ↔ kubelet, n-2 support model, patch version advance notice
Cơ Chế Upgrade Cluster GKE: Control Plane, Node Pool & Autopilot - Control plane upgrade zonal vs regional, node pool sequencing, auto-upgrade vs manual upgrade, Autopilot managed upgrade mechanics, rollout sequencing trong fleet, upgrade notifications Pub/Sub
Node Upgrade Strategies: Surge vs Blue-Green - Surge upgrade (maxSurge/maxUnavailable mechanics, pod scheduling, quota implications), blue-green upgrade (5 phases, parallel pool creation, pod migration, rollback), autoscaled blue-green, chiến lược chọn theo workload type, concurrent node pool upgrades
Maintenance Windows, Exclusions & Cluster Disruption Budget - Maintenance windows (UTC timezone, RRULE recurrence, 48h/32d requirement), ba loại maintenance exclusion (no-upgrades/no-minor/no-minor-node), precedence rules, cluster disruption budget cho fleet, rollout sequencing patterns
Workload Disruption Readiness: PDB, Annotations & Upgrade Notifications - PodDisruptionBudget semantics (minAvailable/maxUnavailable, 1-giờ hard limit, PDB + topology spread), pod-deletion-cost dynamic annotation, safe-to-evict, terminationGracePeriodSeconds + preStop hooks, upgrade notification automation, workload checklist
Troubleshooting Stuck Upgrades & Testing Upgrade Strategy - Diagnose stuck upgrade (PDB blocking, quota exhaustion, node affinity, webhook failures), manual intervention (force drain, rollback blue-green), staging cluster validation, kubectl drain testing, API deprecation checks, post-upgrade validation checklist

Chương 16: GKE Autopilot Mode — Managed Infrastructure

Tại sao quan trọng: Autopilot thay đổi cách tư duy về infrastructure. Hiểu Autopilot mechanics, resource enforcement, compute classes giúp tránh resource waste và Pods bị rejected.

Điều kiện tiên quyết: Chương 5, 8, 9

Mức độ sâu: 4/5

Chapter 16 Full Index & Learning Paths

Các chủ đề con:

Autopilot vs Standard — Managed Node Model, Billing, Feature Gaps - Ranh giới trách nhiệm, billing per-Pod vs per-node, feature comparison đầy đủ
Resource Enforcement — Min/Max Requests, CPU:Memory Ratio - Luồng xử lý khi submit Pod, automatic adjustment, minimum/maximum theo compute class, tỷ lệ CPU:memory enforcement
Compute Classes — Balanced, Scale-Out, Performance, Accelerator - Mapping VM families, resource limits theo class, khi nào dùng mỗi class, Custom ComputeClasses
Security Hardening — Pod Security, Privileged Workloads, Org Policy - Pod Security Standards mặc định, Linux capabilities bị drop, allowlist cho privileged workloads, org policy constraints
Spot Pods & Extended Duration Pods - Preemption behavior (25s grace period), design patterns cho batch jobs, Extended Duration bảo vệ khỏi node upgrades (7 ngày)
Cluster Upgrades — Zero-Downtime, Surge, Maintenance Windows - Control plane zero-downtime, surge upgrade strategy, maintenance windows/exclusions, tương tác với PDB và Extended Duration Pods
Networking — IP Allocation, VPC-Native, hostPort - Fixed 32 Pods per node, Pod CIDR sizing, Cloud DNS requirement, hostPort limitations, Dataplane V2/Cilium
Observability — Metrics, Logs, Monitoring - System metrics available trong Autopilot, Managed Prometheus, structured logging, debugging mà không có SSH access
Migration từ Standard sang Autopilot - Pre-flight check, incompatibility checklist đầy đủ, blue-green và MCS migration strategies, Running Autopilot Pods trong Standard clusters

Chương 17: GKE Multi-Tenancy & Workload Isolation

Tại sao quan trọng: Multi-tenant GKE menghemat cost tapi require careful isolation. Understand boundaries dari namespace isolation, resource quotas, network policies untuk design correctly.

Điều kiện tiên quyết: Chương 8, 10, 12

Mức độ sâu: 4/5

Các chủ đề con:

Multi-tenancy models: soft (RBAC + NetworkPolicy) vs hard (separate clusters)
Namespace isolation: shared vs isolated resources
RBAC untuk multi-tenancy: ClusterRole vs Role, impersonation risks
NetworkPolicy untuk namespace isolation: ingress/egress rules
ResourceQuota per namespace: fair allocation
LimitRange: default constraints, quota enforcement
GKE Sandbox: kernel interception, use cases, overhead
Workload separation: dedicated node pools per team
Node isolation: sole-tenant nodes, HIPAA/PCI use cases
Multi-tenant logging: per-namespace routing
Hierarchical Namespace Controller: templates, policy propagation
Pod Security Standards per namespace: privilege restriction
Cost attribution: per-namespace billing

Chương 18: GKE Fleet Management & Multi-Cluster Architecture

Tại sao quan trọng: Production GKE deployments biasanya multi-cluster. Fleet management mengurangi toil untuk platform teams operating 10s-1000s of clusters.

Điều kiện tiên quyết: Chương 5–17

Mức độ sâu: 4/5

Các chủ đề con:

Fleet concept: logical grouping clusters, hub membership
Fleet workload identity: unified identity across clusters
Config Sync: GitOps untuk Kubernetes config, sync dari Git/OCI
Config Sync architecture: RootSync, RepoSync, reconciler Pods
Config Sync sources: Git, OCI, Helm chart
Hierarchical repository structure: cluster/namespace/app configs
Policy Controller: OPA constraints, audit/enforce modes
Multi-Cluster Services (MCS): ServiceImport/ServiceExport, cross-cluster DNS
Multi-Cluster Ingress: global load balancing, cross-cluster backends
Multi-Cluster Gateway: Gateway API multi-cluster
Fleet-based RBAC: member clusters inherit policies
Config Controller: manage Google Cloud resources via Kubernetes CRDs
Fleet Observability: cross-cluster monitoring dashboards
Anthos Service Mesh multi-cluster: cross-cluster traffic, trust federation
Network Connectivity Center: hub-and-spoke topology

PHẦN III: NETWORKING & TRAFFIC MANAGEMENT

Chương 19: VPC Architecture Deep Dive — Subnets, Routes, Firewall

Tại sao quan trọng: VPC adalah foundation. Misunderstand VPC model → security gaps, unexpected traffic paths, routing failures.

Điều kiện tiên quyết: Chap

ter 3, network fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

VPC sebagai global resource: subnet sebagai regional
Subnet primary range vs secondary ranges: sizing strategy
Routes: system-generated, static, dynamic via Cloud Router
Route priority mechanism: metric evaluation
Cloud Router architecture: regional, BGP sessions
BGP configuration: ASN, session establishment
Route propagation: advertisement, import, filtering
Firewall rules evaluation order: ingress/egress, priority
Firewall rule matching: Network Tags vs Service Accounts
VPC Peering: connectivity, firewall implications
Shared VPC: host vs service projects, subnet sharing
Private Google Access: routing untuk Google APIs
VPC Flow Logs: sampling, cost, export destinations
Network Intelligence Center: topology, connectivity tests
VPC Service Controls: perimeter security, access policies

Chap 20: Cloud Load Balancing — Architecture & Mechanics

Tại sao quan trọng: LB adalah traffic entry point. Sai configuration → uneven distribution, health check failures, SSL issues.

Điều kiện tiên quyết: Chap 19, HTTP/HTTPS, TCP fundamentals

Mức độ sâu: 5/5

Các chủ đề con:

GCP LB taxonomy: L4 vs L7, internal vs external, regional vs global
Global External Application LB:
- Anycast, Maglev backend selection
- Google Front End (GFE)
- URL Maps: host-based, path-based routing
- Backend services, health checks, session affinity
- Cloud CDN, Cloud Armor integration
Regional External Application LB: Envoy-based, regional scope
Internal Application LB: Envoy dalam VPC, proxy-only subnet
Passthrough Network LBs: DSR mode, connection tracking
Network Endpoint Groups (NEGs):
- Zonal NEGs: VM endpoints
- Serverless NEGs: Cloud Run, App Engine
- Container-native LB: Pod IP NEGs
- Health checks: protocol-specific, interval/timeout/threshold
GKE Services:
- LoadBalancer type: External vs Internal
- NEG-based vs legacy
- SessionAffinity: ClientIP mode
- ExternalTrafficPolicy: Local vs Cluster
Connection draining: timeout mechanics, graceful shutdown
SSL policies: TLS versions, cipher suites
Cloud Armor integration: WAF, DDoS protection

Chap 21: GKE Ingress & Gateway API — Exposing Applications

Tại sao quan trọng: Ingress/Gateway adalah cara expose apps externally. Salah konfigurasi → SSL issues, 502 errors, security vulnerabilities.

Điều kiện tiên quyết: Chap 7, 20, Kubernetes Services

Mức độ sâu: 5/5

Các chủ đề kon:

GKE LB overview: Gateway vs Ingress vs LoadBalancer Service
GKE Ingress (Legacy):
- Controller reconciliation: Ingress resources → GCP Application LB
- External vs Internal: annotation differences
- BackendConfig CRD: health check tuning, Cloud Armor, session affinity, CDN
- FrontendConfig CRD: SSL policies, HTTPS redirect
- Multi-cluster Ingress: cross-cluster routing
- Packet traversal: client → Google edge → pod
Gateway API (Recommended):
- GatewayClass, Gateway, HTTPRoute, TCPRoute, TLSRoute
- GKE Gateway controller implementation
- Path matching, header matching, traffic splitting
- TLS termination: Certificate Manager, managed certs
- Multi-cluster Gateway: global LB dengan multi-cluster backends
Container-native LB internals: NEG dengan Pod IPs, Pod-level health
Standalone NEGs: manual management, use cases

Chap 22: Cloud DNS & Service Discovery

Tại sao quan trọng: DNS failure adalah common cause microservice outages. Understand resolution path helps debug "connection refused".

Điều kiện tiên quyết: Chapter 4, DNS fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

DNS resolution dalam Pod: /etc/resolv.conf, ndots:5, search domains
ndots:5 impact: FQDN lookup path, negative caching, latency
CoreDNS dalam GKE: plugin chain, behavior
Kubernetes DNS spec: <service>.<namespace>.svc.cluster.local discovery
Headless Services: DNS per Pod
ExternalName Services: CNAME resolution
NodeLocal DNSCache:
- DaemonSet, link-local IP 169.254.20.10
- Cache, fallback, latency reduction
Cloud DNS untuk GKE:
- Private zones, peering zones
- Split-horizon DNS
DNS debugging: nslookup, dig, CoreDNS logs
DNS performance tuning: cache sizing, TTL

Chap 23: Cloud NAT — Port Allocation & Exhaustion Prevention

Tại sao quan trọng: Cloud NAT port exhaustion adalah silent failure — connections drop tanpa error message clear. Understand allocation mechanics untuk capacity planning.

Điều kiện tiên quyết: Chap 19, NAT/SNAT fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

Cloud NAT architecture: distributed NAT, Andromeda integration
NAT translation: SNAT flow, source IP/port replacement
Port allocation modes: static vs dynamic
Port math: 64,512 ports per NAT IP / ports-per-VM = max VMs
5-tuple constraint: reuse delay, TCP TIME_WAIT
Port exhaustion symptoms: NAT_ALLOCATION_FAILED, connection drops
Mitigation: more NAT IPs, connection pooling, keep-alives
Cloud NAT dengan GKE: node VM egress, private cluster setup
NAT metrics: port usage, dropped connections
NAT rules: custom IP ranges, logging
Timeouts: TCP, UDP, ICMP

Chap 24: Private Service Connect — Modern Service Exposure

Tại sao quan trọng: PSC adalah modern way expose services tanpa VPC peering. Understand PSC giàng design multi-tenant service architecture.

Điều kiện tiên quyết: Chap 19, VPC Peering concepts

Mức độ sâu: 5/5

Các chủ đề kon:

PSC components: Service Attachment (producer), PSC Endpoint (consumer)
Service producer → PSC endpoint → consumer VPC connectivity
PSC vs VPC Peering: routing, security differences, use cases
PSC for Google APIs: private endpoints
PSC for GKE control plane: private cluster access
PSC for managed services: Cloud SQL, Memorystore, AlloyDB
PSC NAT: overlapping IP ranges handling
PSC consumer vs producer: IAM, approval workflow
PSC global access: cross-region consumers
PSC DNS: A record creation
Troubleshoot PSC: connectivity tests, flow logs

Chap 25: Cloud Router & BGP Internals

Tại sao quan trọng: Cloud Router adalah control plane untuk dynamic routing. Sai BGP → routes tidak advertise atau incorrect routes propagate.

Điều kiện tiên quyết: Chap 19, BGP fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

Cloud Router architecture: regional, BGP sessions
eBGP vs iBGP: routing dynamics
ASN configuration: private ranges, conflicts
BGP session establishment: OPEN, KEEPALIVE, UPDATE
Route advertisement: VPC subnets, custom routes
Custom route advertisement
Route import: from on-premise
BGP communities: filtering, tagging
BFD: fast failover
Cloud Router dengan Cloud VPN: dynamic routing
Cloud Router dengan Cloud Interconnect: VLAN attachments
Multi-regional routing: global vs regional modes
Route filtering: import/export policies
Monitoring BGP sessions: status, routes

Chap 26: Cloud Interconnect & Cloud VPN — Hybrid Connectivity

Tại sao quan trọng: Hybrid connectivity adalah foundation enterprise GCP deployments. Design decisions impact latency, cost, security posture.

Điều kiện tiên quyết: Chap 25, MPLS/WAN networking cơ bản

Mức độ sâu: 4/5

Các chủ đề kon:

Cloud VPN:
- Classic vs HA: redundancy, SLA
- VPN tunnel mechanics: IKE, ESP
- Dynamic routing via Cloud Router
- MTU considerations, TCP MSS clamping
Cloud Interconnect:
- Dedicated vs Partner: bandwidth, latency
- VLAN attachments: logical connections
- BGP sessions over Interconnect
- Redundancy: 99.99% SLA
- MACsec: L2 encryption
Network Connectivity Center: hub-and-spoke
Production patterns: active-passive, active-active failover
Monitoring: interface metrics, BGP state, packet loss

Chap 27: Network Security — Firewall Policies, Cloud NGFW, Cloud Armor

Tại sao quan trọng: Network security adalah outer perimeter. Misconfigured firewall expose sensitive services atau block legitimate traffic.

Điều kiện tiên quyết: Chap 19, security fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

VPC firewall rules: stateful, ingress/egress, priorities
Hierarchical firewall policies: organization-level enforcement
Cloud NGFW:
- L7 inspection, FQDN rules
- IDS integration
VPC Service Controls: perimeter design, data exfiltration prevention
Cloud Armor:
- WAF rules, OWASP ruleset
- Adaptive Protection
- Rate limiting, security policies
Cloud IDS: intrusion detection
Network Intelligence Center: firewall insights
Secure Web Proxy: egress filtering
Private NAT: secure egress patterns

PHẦN IV: STORAGE & DATA SYSTEMS

Chap 28: Cloud Storage — Architecture, Consistency, Performance

Tại sao quan trọng: GCS adalah universal data store. Understand consistency model dan performance characteristics prevent data races dan slow reads.

Điều kiện tiên quyết: Object storage concepts, HTTP/S basics

Mức độ sâu: 4/5

Các chủ đề kon:

GCS object model: buckets, objects, generations
Strong consistency: post-2021 guarantee
Storage classes: Standard, Nearline, Coldline, Archive
Location types: multi-region, dual-region, regional
Lifecycle management: tiering, deletion
Uniform bucket-level access: IAM vs ACLs
Signed URLs: V4 signing, expiry
Requester Pays
Cloud Storage FUSE: POSIX interface
Transfer Service: bulk migration
VPC Service Controls integration
Performance: throughput scaling, parallel uploads

Chap 29: Persistent Disk & Hyperdisk — Block Storage

Tại sao quan trọng: Disk type dan sizing impact application performance directly. IOPS/throughput limits adalah often misunderstood.

Điều kiện tiên quyết: Compute Engine basics

Mức độ sâu: 4/5

Các chủ đề kon:

PD types: standard, balanced, ssd, extreme — IOPS/throughput
Performance caps: formula, VM-level limits
Multi-writer disks: limitations
Hyperdisk:
- Types: Balanced, Extreme, ML, Throughput
- Provisioned performance: capacity + IOPS/throughput
Snapshots: incremental, cross-region copies
Regional PD: replication, failover
Encryption: Google-managed, CMEK

Chap 30: Filestore & Advanced Storage Options

Tại sao quan trọng: Filestore provides shared NFS untuk multi-reader workloads. Misunderstand performance tiers → IO bottlenecks.

Điều kiện tiên quyết: Chap 11, NFS basics

Mức độ sâu: 3/5

Các chủ đề kon:

Filestore tiers: Basic HDD, Basic SSD, Enterprise
Performance: IOPS dan throughput per tier
Filestore CSI: dynamic provisioning
Multishares: one instance → multiple PVCs
Backup: snapshots, recovery
Cross-zone: Regional tier, HA

PHẦN V: IAM, SECURITY & COMPLIANCE

Chap 31: IAM Deep Dive — Model, Propagation, Conditions

Tại sao quan trọng: IAM adalah access control duy-satunya dalam GCP. Understand propagation model dan conditions prevent privilege escalation.

Điều kiện tiên quyết: Chap 1, resource hierarchy

Mức độ sâu: 5/5

Các chủ đề kon:

IAM policy model: allow vs deny policies
Role types: basic, predefined, custom
Resource-level vs project-level vs org-level: hierarchy
IAM propagation: eventual consistency, caching
Condition expressions (CEL): time-based, resource-based
IAM Deny: deny before allow evaluation
Service accounts: SA key management, impersonation
Default service accounts: dangers
Audit logging: Admin Activity, Data Access
VPC Service Controls: perimeter concept
Organization Policies: constraints, inheritance
IAM Recommender: least-privilege suggestions

Chap 32: Secret Manager & Cloud KMS — Secrets & Encryption

Tại sao quan trọng: Secrets management adalah critical security control. Understand Storage mechanics dan KMS key hierarchy untuk design correct encryption strategies.

Điều kiện tiên quyết: Chap 31, encryption fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

Secret Manager versioning: states, aliases
Replication: automatic vs manual
Secret rotation: scheduling, Pub/Sub notifications
Accessing secrets dalam GKE: CSI driver, sidecar, init container
Secret Manager vs environment variables: trade-offs
Cloud KMS:
- Key hierarchy: Key Ring → CryptoKey → CryptoKeyVersion
- Key purposes: ENCRYPT_DECRYPT, ASYMMETRIC_SIGN, etc.
- Protection levels: SOFTWARE, HSM, EXTERNAL
- Key rotation: automatic, manual
- Envelope encryption: DEK encrypted by KEK
- CMEK: customer-managed encryption
- Cloud EKM: keys managed outside Google
- Key deletion: 24h soft-delete

Chap 33: VPC Service Controls & Organization Policies

Tại sao quan trọng: VPC SC adalah primary control prevent data exfiltration. Org Policies provide guardrails at scale.

Điều kiện tiên quyết: Chap 31, 32

Mức độ sâu: 4/5

Các chủ đề kon:

VPC SC architecture: perimeter, protected resources, access levels
Ingress/Egress rules: fine-grained cross-perimeter access
Dry-run mode: test before enforce
Org Policy constraints: compute.restrictCloudNATUsage, etc.
Custom constraints: CEL expressions
Policy inheritance: exceptions
Policy Troubleshooter: debug denials

Chap 34: Binary Authorization — Secure Container Deployment

Tại sao quan trọng: Binary Authorization ensures chỉ trusted images được deployed. Bypass mechanisms dan misconfiguration adalah real security risks.

Điều kiện tiên quyết: Chap 32, container basics, GKE admission

Mức độ sâu: 4/5

Các chủ đề kon:

Model: policy, attestors, attestations, deployment decision
Attestor types: Note resources
Attestation: cryptographic signatures, PGP/PKIX signing
Cloud Build integration: automated attestation
BinAuthz enforcement path: admission → policy evaluation
Image digest pinning: why digests matter
Continuous validation: re-evaluate, evict non-compliant
Dry-run vs enforcement mode: gradual rollout
Break-glass override: emergency bypass
Policy exceptions: allowlisted images
Artifact Analysis integration: CVE scanning

PHẦN VI: MESSAGING & DISTRIBUTED SYSTEMS

Chap 35: Cloud Pub/Sub — Architecture & Delivery Semantics

Tại sao quan trọng: Pub/Sub adalah messaging backbone. Understand delivery semantics, ordering, failure modes prevent duplicate processing dan message loss.

Điều kiện tiên quyết: Distributed systems fundamentals, messaging patterns

Mức độ sâu: 5/5

Các chủ đề kon:

Pub/Sub distributed log: sharding, replication
Message lifecycle: publish → store → deliver → ack
Delivery semantics:
- At-least-once (default)
- Exactly-once (opt-in, regional constraint)
Ack deadline: extension, max 600s
Push vs Pull:
- Pull API: unary vs StreamingPull
- Push subscriptions: HTTP endpoint, retry mechanics
Ordering keys:
- Per-key ordering guarantee, regional scope
- Single-region endpoint requirement
Dead Letter Topics:
- Trigger conditions: max delivery attempts
- DLT subscription: processing dead letters
Backpressure patterns:
- Flow control: maxOutstandingMessages
- Subscriber scaling dengan backlog
- Metrics: undelivered messages, oldest ack age
Message schemas: Avro, Protocol Buffers
Consumer scaling patterns

Chap 36: Pub/Sub Regional Failure Behavior

Tại sao quan trọng: Pub/Sub memiliki global SLA tetapi regional failure dapat impact message delivery. Understand behavior untuk design resilient consumers.

Điều kiện tiên quyết: Chap 35, GCP regions/zones

Mức độ sâu: 5/5

Các chủ đề kon:

Pub/Sub storage model: multi-region messages
Regional endpoint publishing: kunci untuk ordering
Regional failure impact: ordering resumption, redelivery
Pub/Sub + Dataflow: exactly-once processing
Message deduplication: Pub/Sub role vs application-level
Subscriber failover: multiple instances, lease competition
Monitoring: error rates, latency spikes
Recovery patterns: reprocessing, timestamp seeking

Chap 37: Eventarc — Event Routing & CloudEvents

Tại sao quan trọng: Eventarc adalah managed event bus. Understand event routing dan CloudEvents standard untuk design event-driven architectures correctly.

Điều kiện tiên quyết: Chap 35, CloudEvents spec basics

Mức độ sâu: 3/5

Các chủ đề kon:

Event sources: Audit Logs, Pub/Sub, Cloud Storage
Triggers: event filtering, service account requirements
Destinations: Cloud Run, GKE, Workflows, Cloud Functions
CloudEvents format: context attributes, data
Delivery guarantees: at-least-once, retry
Dead letter handling
Eventarc Advanced: channels, buses, pipelines
IAM integration

Chap 38: Cloud Tasks — Asynchronous Task Execution

Tại sao quan trọng: Cloud Tasks adalah managed task queue untuk async work. Understand retry dan rate limiting prevent thundering herd dan duplicates.

Điều kiện tiên quyết: Distributed systems, HTTP fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

Cloud Tasks vs Pub/Sub: when to use each
Task queue model: explicit execution
Rate limiting: dispatch rate, max burst
Retry: exponential backoff, configurable
Task deduplication: ID-based, 1 hour window
HTTP targets: authentication
Dead letter tasks: logging, alerting
Pause/Resume: operational patterns
Integration with GKE: HTTP target to service

PHẦN VII: OBSERVABILITY & RELIABILITY ENGINEERING

Chap 39: Cloud Monitoring — Metrics, Alerting, SLOs

Tại sao quan trọng: Cloud Monitoring adalah single pane of glass untuk GCP. Understand metrics model, alerting mechanics, SLO framework untuk build reliable services dan respond quickly.

Điều kiện tiên quyết: Chap 14, SRE fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

Metrics model: GAUGE, DELTA, CUMULATIVE
Metric kinds dan value types
Monitored resource types
Free vs chargeable metrics
Managed Service for Prometheus: PromQL, Rule Evaluator
Alerting architecture:
- Policies: conditions, notification channels
- Condition types: metric threshold, log-based, uptime checks
- Auto-close, repeat intervals
- Notification channels: reliability considerations
SLO framework:
- SLI types: availability, latency, quality
- Compliance periods: rolling vs calendar
- Error budget: consumption tracking
- Burn rate: select_slo_burn_rate MQL
- Multi-window alerting: fast + slow burn
Dashboards as code dengan Terraform
USE/RED/Golden Signals methods
Alert best practices: false positive reduction

Chap 40: Cloud Logging — Architecture, Routing, Cost Management

Tại sao quan trọng: Cloud Logging ingestion dapat sangat mahal jika tidak dikelola. Understand routing architecture untuk send correct logs to correct destination dengan correct cost.

Điều kiện tiên quyết: Chap 39

Mức độ sâu: 4/5

Các chủ đề kon:

Log types: Platform, User-written, Security logs
Audit logs: Admin Activity (free), Data Access (paid), System Event
Log router: _Default, _Required buckets, exclusion filters, sinks
Sinks: destinations, inclusion/exclusion filters
Log-based metrics: counters, distributions
Retention: default 30 days, custom up to 3650 days
Field exclusions: reduce ingestion cost
Advanced queries: MQL-like syntax
Log alerting: log-based metric + policy
Debug patterns: correlate logs

Chap 41: Cloud Trace, Profiler, Error Reporting

Tại sao quan trọng: Distributed tracing dan profiling adalah tools untuk debug latency dalam microservices. Error Reporting prioritizes bugs by frequency.

Điều kiện tiên quyết: Chap 39, 40, OpenTelemetry basics

Mức độ sâu: 3/5

Các chủ đề kon:

Cloud Trace: trace collection, sampling, storage
Trace propagation: HTTP headers, W3C standard
OpenTelemetry integration
Trace → Logs → Metrics correlation
Cloud Profiler: CPU, Heap, Goroutine profiles
Continuous profiling overhead
Error Reporting: grouping, affected users
Error notifications

Chap 42: SRE Practices trên GCP — SLO, Incident Response, Chaos

Tại sao quan trọng: SRE principles aplikasi di GCP memerlukan understanding both people dan systems. Error budgets, incident response, toil reduction adalah practical skills.

Điều kiện tiên quyết: Chap 39–41, SRE Book concepts

Mức độ sâu: 5/5

Các chủ đề kon:

SLI/SLO/Error Budget: crafting meaningful SLIs
Error budget policy: freeze pada exhaustion
Incident response: levels, roles (IC, SME, Comms)
Runbooks: machine-readable, regularly tested
Blameless postmortems: 5 Whys, contributing factors, action items
Blast radius reduction: canaries, circuit breakers, feature flags
Graceful degradation: fallback responses, cached data
Failure injection (Chaos Engineering):
- Fault injection via Istio
- Node/Pod disruption testing
- Chaos Mesh on GKE
Timeout hierarchies: prevent cascading
Retry budgets: prevent storms
Load shedding: server-side rejection

Chap 43: GKE Production Debugging Methodology

Tại sao quan trọng: Debugging production issues requires systematic approach. "Random kubectl exec" adalah antipattern. Build mental model untuk structured debugging.

Điều kiện tiên quyết: Semua GKE chapters

Mức độ sâu: 5/5

Các chủ đề kon:

Pod lifecycle debugging:
- Pending: node affinity, resources, PVC, scheduler events
- CrashLoopBackOff: exit codes, logs
- OOMKilled: container vs cgroup, memory leak detection
- Init container failures
Service connectivity debugging:
- kubectl exec + curl pattern
- DNS resolution: nslookup, dig
- NetworkPolicy violations: Hubble
- ClusterIP routing: iptables/eBPF verification
- port-forward untuk bypass LB
Node debugging:
- NotReady: logs, kubelet, containerd status
- Disk pressure: df, du, describe
- CPU throttling: metrics, cgroup limits
- Network issues: routes, conntrack
Control plane debugging:
- API latency: metrics
- etcd performance: slow ops
- Webhook timeouts
- Scheduler failures: logs, events
Cross-cutting debugging:
- Request tracing: LB → node → pod dengan Cloud Trace
- Correlate: access logs + app logs + traces
- gcloud container operations list

PHẦN VIII: PLATFORM AUTOMATION & CI/CD

Chap 44: Terraform on GCP — State Management, Modules, IaC Patterns

Tại sao quan trọng: Infrastructure as Code adalah non-negotiable untuk production. Understand GCP-specific Terraform patterns dan state management prevent drift dan destructive applies.

Điều kiện tiên quyết: Terraform fundamentals, Chap 1–7

Mức độ sâu: 4/5

Các chủ đề kon:

Google provider: authentication (ADC, impersonation)
GCS backend: remote state, locking
State file security: encryption, access, versioning
Module design: reusable modules for GKE, VPC, IAM
Resource dependencies: explicit vs implicit
lifecycle blocks: prevent_destroy, ignore_changes
terraform import: existing resource management
State manipulation: move, rm, show
Workspaces: isolated state per environment
Drift detection: scheduled terraform plan
Testing: terraform validate, terratest, conftest
CI/CD integration: Cloud Build pipeline
Cost estimation: Infracost
Google Cloud Foundation Fabric: reference modules

Chap 45: Cloud Build & Artifact Registry — CI/CD Pipeline

Tại sao quan trọng: Secure CI/CD pipeline adalah critical security control. Understand Cloud Build execution model dan Artifact Registry security prevent supply chain attacks.

Điều kiện tiên quyết: Chap 32, Docker/containers, CI/CD fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

Cloud Build architecture: build steps, cloudbuild.yaml, workers
Default vs custom service account: least privilege
Private worker pools: VPC, peering, no public IP
Triggers: Cloud Source Repos, GitHub, GitLab, schedule, webhook
Build caching: layer caching, custom (GCS), speed optimization
Substitution variables: built-in, custom interpolation
Artifact Registry:
- Docker, Maven, npm, Python, generic repositories
- Regional, cost-efficient artifact storage
- Container vulnerability scanning: on-push, continuous
- SBOM generation
- Cleanup policies: deletion, protection
Remote build provenance: SLSA attestation dari Cloud Build
Binary Authorization integration
Cloud Deploy: managed delivery, canary, blue-green
allowedIntegrations org policy

Chap 46: Cloud Deploy & GitOps — Progressive Delivery

Tại sao quan trọng: Cloud Deploy provides managed CD dengan built-in approval, rollback, tracking. Understanding mechanics untuk design safe deployment pipelines critical.

Điều kiện tiên quyết: Chap 45, Kubernetes Deployments

Mức độ sâu: 4/5

Các chủ đề kon:

Cloud Deploy model: pipelines, targets, releases, rollouts
Pipeline definition: series targets, promotion flow
Target types: GKE, Cloud Run, custom
Approval flows: manual gates
Rollback mechanics: one-click, automatic
Canary deployments: traffic splitting
Blue-green deployments: parallel, cutover
Deployment verification: post-deploy checks
Hooks: pre-deploy, post-deploy, verify
Cloud Deploy IAM: deployer, approver roles
Notifications: Pub/Sub, Slack
Deploy history: audit trail
GitOps patterns:
- Config Sync: GitOps untuk GKE
- Syncing dari Git/OCI dengan reconciliation
- Multi-cluster Config Sync
- Policy Controller dengan GitOps
- Fleet management integration

PHẦN IX: ADVANCED PRODUCTION PATTERNS

Chap 47: GKE Service Mesh — Cloud Service Mesh (Managed Istio)

Tại sao quan trọng: Service mesh provides mTLS, observability, traffic management at infrastructure level. Hiểu Istio/Envoy mechanics untuk debug connection failures dan tune performance.

Điều kiện tiên quyết: Chap 7, 21, microservices patterns

Mức độ sâu: 4/5

Các chủ đề kon:

Cloud Service Mesh (CSM): managed Istio/Envoy
Data plane vs control plane: Envoy sidecars vs Istiod
Sidecar injection: automatic, init container iptables rules
mTLS: PERMISSIVE vs STRICT, certificate lifecycle, SPIFFE/SVID
Traffic management:
- VirtualService, DestinationRule, Gateway (Istio)
- Load balancing algorithms, circuit breaking
- Retries, timeouts
Envoy xDS API: CDS, EDS, LDS, RDS, SDS
Distributed tracing: trace propagation, sampling
Observability: metrics, access logs, Hubble
CSM dashboard: SLOs, topology
Sidecar resource consumption

Chap 48: Multi-Cluster Architecture & Networking

Tại sao quan trọng: Multi-cluster adalah standard pattern untuk production GKE deployments. Networking across clusters adds complexity requiring specific patterns.

Điều kiện tiên quyết: Chap 21, 47, Chap 18 Fleet

Mức độ sâu: 4/5

Các chủ đề kon:

Multi-cluster use cases: HA, DR, data residency, scale
Multi-Cluster Services (MCS): ServiceImport/ServiceExport, DNS
Multi-Cluster Ingress: global LB, cross-cluster backends
Gateway API multi-cluster
Workload identity across clusters
Cross-cluster service mesh: trust federation
Network isolation between clusters: VPC peering
DNS peering across clusters

Chap 49: GKE AI/ML Infrastructure — GPU, TPU, Large-Scale Workloads

Tại sao quan trọng: AI/ML adalah dominant workload pattern. GPU/TPU infrastructure memiliki unique characteristics untuk maximize utilization dan minimize cost.

Điều kiện tiên quyết: Chap 6, 9, GPU/accelerator fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

GPU node pools: machine types, driver installation
NVIDIA device plugin: resource scheduling
Multi-Instance GPU (MIG): sharing
GPU Time-Slicing: oversubscription
TPU types: topologies, slice configuration
Dynamic Workload Scheduler: gang scheduling
ProvisioningRequest: batch job reservations
NCCL Fast Socket: inter-GPU communication
Multi-NIC Pods: GPUDirect, RDMA
InfiniBand networking: A3 clusters
HPC clusters: H4D, compact placement
Data loading: Hyperdisk ML, Parallelstore
LLM serving patterns: vLLM, TGI, Triton
DRA (Dynamic Resource Allocation): next-gen scheduling
Cost optimization: Spot VMs, preemption handling

Chap 50: GKE Large-Scale Design — 1000+ Nodes

Tại sao quan trọng: GKE clusters > 1000 nodes memiliki operational characteristics berbeda. Architecture decisions at creation time mempengaruhi scalability ceiling.

Điều kiện tiên quyết: Chap 5, 6, 8, 9

Mức độ sâu: 5/5

Các chủ đesse kon:

GKE scalability limits: max nodes, Pods, Services
Planning: node pool sizing, Pod density
API server scalability: request rate, watch connections
etcd scalability: object count, size, compaction
Controller manager reconciliation: queue, workers
Scheduler performance: latency at scale
IP planning: CIDR sizing, expansion
Service mesh scalability: xDS updates, memory overhead
NodeLocal DNSCache necessity
Node pool strategy: multiple small vs fewer large
Workload distribution: topology spread, bin packing
Large-scale upgrade: surge sizing, concurrency
Network policy scalability: eBPF necessity
Metrics cardinality: limits, aggregation
Log volume: sampling, exclusion strategies

Chap 51: Cost Optimization Engineering — Systematic Approach

Tại sao quan trọng: Cloud costs adalah significant operational concern. Systematic cost optimization requires understanding billing mechanics dan optimization levers.

Điều kiện tiên quyết: Semua service chapters

Mức độ sâu: 3/5

Các chủ đề kon:

Committed Use Discounts (CUDs): 1-year/3-year, resource-based vs flexible
Sustained Use Discounts: automatic untuk GCE
Spot VMs: interrupt frequency, cost savings 60–91%
GKE cost allocation: namespace-level breakdown
Rightsizing: VPA recommendations, insights
Idle resource detection: Recommender API
Egress costs: inter-region, internet egress optimization
Storage tier automation: lifecycle policies
Cloud Billing exports: BigQuery analysis
Budget alerts: programmatic controls
Cost monitoring dashboard patterns

Chap 52: GKE Disaster Recovery & High Availability

Tại sao quan trọng: DR planning untuk production systems adalah critical. GCP provides banyak options dengan different cost/complexity tradeoffs.

Điều kiện tiên quyết: Chap 9, storage fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

RTO vs RPO: definitions, trade-offs
Multi-region architecture: active-active, active-passive
Backup for GKE: backup plans, restore procedures
PD snapshots: cross-region copies
GCS geo-redundancy: dual-region, multi-region buckets
Database DR: Cloud SQL replicas, Spanner global
Config backup: GitOps, Terraform state
Multi-region failover testing: chaos at region level
DNS failover: health checks, weighted routing
DR runbooks: step-by-step procedures
RTO validation, data integrity checks

PHẦN X: ADVANCED DEBUGGING & INCIDENT MANAGEMENT

Chap 53: Production GKE Debugging Framework

Tại sao quan trọng: Structured debugging methodology essential untuk resolve production incidents quickly. Understanding telemetry sources dan correlation methods adalah core skill.

Điều kiện tiên quyết: Chap 39–43

Mức độ sâu: 5/5

Các chủ đề kon:

Systematic debugging approach: hypothesis → test → validate
Information gathering: logs, metrics, events, traces
Pod debugging: pending, crashed, hung states
Service connectivity: DNS, routing, network policies
Node debugging: capacity, health, pressure signals
Control plane: API latency, etcd performance
Cross-layer correlation: trace → logs → metrics
GKE dashboard analysis: cluster health signals
Incident timeline reconstruction: event correlation
Root cause analysis techniques: Five Whys, fishbone
Action items: immediate vs long-term fixes

Chap 54: Incident Response & Post-Mortems

Tại sao quan trọng: Incident response adalah skill that separates good SREs dari great ones. Structured approach mengurangi MTTR dan blast radius.

Điều kiện tiên quyết: Chap 39–43, Chap 53

Mức độ sâu: 4/5

Các chủ đề kon:

Incident classification: severity levels
Incident command structure: IC, SME, Comms Lead
Detection → Triage → Mitigation → Resolution flow
GCP debugging tools dalam incident: Logging, Monitoring, Trace
Mitigation patterns: rollback, feature flags, circuit breakers
Communication: status page, stakeholder updates
Blameless postmortem culture
Timeline reconstruction, root cause analysis
Action items: tracking, follow-up
Knowledge sharing: incident readout, runbook updates

PHẦN XI: SPECIAL TOPICS & ADVANCED CONCEPTS

Chap 55: Kubernetes API Machinery Deep Dive

Tại sao quan trọng: Understanding API server internals, informer pattern, watch mechanism essential untuk debug complex control plane behaviors.

Điều kiện tiên quyết: Chap 5

Mức độ sâu: 5/5

Các chủ đề kon:

API server request pipeline: auth → authz → admission → storage
Watch mechanism: efficient state propagation without polling
Informer pattern: list-watch, local cache, resync intervals
Controller runtime framework: reconciliation loop patterns
API priority dan fairness (APF): flow schemas, priority levels
Etcd consistency: linearizability guarantees, watch caching
Resource versioning: optimistic locking, conflict resolution
Custom Resource Definitions (CRDs): extensibility mechanism

Chap 56: Kubernetes Advanced RBAC & Authorization Patterns

Tại sao quan trọng: RBAC design untuk production scale requires careful planning. Understand aggregation, impersonation, conditions prevent privilege creep.

Điều kiện tiên quyết: Chap 12, Chap 31

Mức độ sâu: 4/5

Các chủ đề kon:

ClusterRole aggregation: composing roles from multiple roles
ClusterRoleBinding vs RoleBinding: scoping semantics
Service account impersonation: delegation chains, risks
Group binding strategies: LDAP, Google Groups integration
Least privilege RBAC: role reduction, time-bound roles
Conditions dalam RBAC: attribute-based access control
RBAC for multi-tenancy: namespace isolation
Audit: logging RBAC decisions untuk compliance

Chap 57: GKE with Windows Server Containers

Tại sao quan trọng: Windows containers pada GKE adalah niche tetapi important para enterprise .NET workloads.

Điều kiện tiên quyết: Chap 5, 6, Windows fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

GKE Windows node pool creation
Windows CNI considerations: different networking model
Image pulling: Windows image registry optimization
Resource requests: CPU/memory pada Windows
Pod disruption: graceful termination handling
Monitoring: Windows-specific metrics

Chap 58: Confidential Compute on GKE — AMD SEV & Intel TDX

Tại sao quan trọng: Confidential computing adalah emerging pattern para sensitive workloads. Understanding attestation dan performance overhead critical.

Điều kiện tiên quyết: Chap 6, Chap 32

Mức độ sâu: 3/5

Các chủ đề kon:

AMD SEV: memory encryption, attestation
Intel TDX: trusted domain extensions
Performance overhead: latency, throughput
Use cases: regulated industries, financial
Attestation verification: remote attestation
Key management dalam confidential VMs

Chap 59: Managed Prometheus at Scale — Optimization & Troubleshooting

Tại sao quan trọng: Managed Service for Prometheus is scalable Prometheus solution tetapi high cardinality dapat spike costs dan latency.

Điều kiện tiên quyết: Chap 39, Prometheus concepts

Mức độ sâu: 4/5

Các chủ đề kon:

GMP architecture: globally managed backend
PodMonitoring CRDs: configuration patterns
Recording rules: pre-compute expensive queries
Alert evaluation: Ruler component
High cardinality antipatterns: label explosion
Metric ingestion costs: active time series billing
PromQL performance: query optimization
Thanos integration: federation, retention
Troubleshooting: query timeout, high cardinality detection

Chap 60: Advanced Cloud Armor WAF Configuration

Tại sao quan trọng: Cloud Armor adalah GCP's Web Application Firewall. Tuning rules properly prevents both false positives/negatives dan DDoS attacks.

Điều kiện tiên quyết: Chap 27, security fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

WAF rule types: OWASP ruleset, custom rules
Adaptive Protection: ML-based DDoS detection
Rate limiting: threshold configuration
Rule evaluation order: deny/allow decision
Signed Cookies: custom domain patterns
URL field masking: protecting sensitive data dalam logs
Google-managed rules: automatic updates

PHẦN XII: SPECIAL PRODUCTION RUNBOOKS & TROUBLESHOOTING

Chap 61: GKE Troubleshooting Runbook — Common Issues & Solutions

Tại sao quan trọng: Common GKE issues require specific debugging steps. Pre-written runbooks mengurangi MTTR.

Điều kiện tiên quyết: Chap 39–43, 53–54

Mức độ sâu: 4/5

Các chủ đề kon:

Pod creation failures: troubleshooting checklist
Scheduling failures: pending pods resolution
Networking issues: connectivity test procedures
Storage issues: volume attachment failures
Control plane issues: API server latency, etcd health
Node issues: NotReady diagnosis
Workload Identity failures: token exchange debugging
Autoscaling issues: HPA/CA troubleshooting

Chap 62: GKE Cluster Upgrade Runbook — Zero-Downtime Procedures

Tại sao quan trọng: Cluster upgrades dapat disruptive jika tidak dilakukan carefully. Proven runbook essential untuk production.

Điều kiện tiên quyết: Chap 15, Chap 54

Mức độ sâu: 4/5

Các chủ đesse kon:

Pre-upgrade validation: compatibility checks
Node surge strategy: sizing untuk stable rollout
PDB configuration: ensuring disruption budget
Control plane upgrade window: monitoring, rollback triggers
Node pool upgrade execution: monitoring, health checks
Post-upgrade validation: functionality verification
Rollback procedures: emergency rollback steps

Chap 63: GCP Network Troubleshooting Methodology

Tại sao quan trọng: Network issues sulit untuk debug. Systematic approach dan tool knowledge essential untuk resolve quickly.

Điều kiện tiên quyết: Chap 3, 19–27

Mức độ sâu: 4/5

Các chủ đề kon:

Connectivity Tests: GCP native tool
VPC Flow Logs analysis: packet-level debugging
firewall rule debugging: evaluation order, matching
Route troubleshooting: destination matching, recursive lookup
DNS debugging: resolution path, TTL issues
NAT issues: port exhaustion diagnosis
Load Balancer debugging: backend health, traffic distribution
Service Mesh networking: traffic flow through Envoy

B. SUMMARY & COVERAGE VALIDATION

Coverage Statistics:

Total chapters: 63 main chapters
Total sub-topics: 800+ detailed sub-topics
Total parts: 12 major sections
Estimated pages: 1,500–2,000 pages (if printed)
Estimated study time: 8–12 months for deep mastery

Covered Domains (100% Coverage):

✅ GKE Architecture & Internals (Chapters 5–18) ✅ GCP Networking Foundation & Services (Chapters 2–4, 19–27) ✅ Storage & Persistence (Chapters 28–30) ✅ IAM, Security, Compliance (Chapters 31–34) ✅ Messaging & Distributed Systems (Chapters 35–38) ✅ Observability & SRE (Chapters 39–43) ✅ CI/CD & Automation (Chapters 44–46) ✅ Service Mesh & Multi-Cluster (Chapters 47–48) ✅ Advanced Workloads (Chapters 49–50) ✅ Cost Optimization & DR (Chapters 51–52) ✅ Debugging & Incident Response (Chapters 53–54) ✅ Advanced Deep-Dives (Chapters 55–60) ✅ Production Runbooks (Chapters 61–63)

Advanced "Kill Content" - Staff/Principal Level Topics:

"GKE Packet Path Anatomy: Từ Container veth pair đến Internet" — Complete packet trace mọi layer
"Cluster Autoscaler Decision Engine: Tại sao scale-up chậm 45 giây" — Latency breakdown, node provisioning mechanics
"etcd vs Spanner: GKE Control Plane State Storage" — Backend comparison, consistency implications
"Cloud NAT Port Exhaustion: Silent Killer Production" — 5-tuple exhaustion, monitoring strategy
"Workload Identity Token Exchange: Mỗi bước chi tiết" — JWT validation, STS exchange, security boundaries
"Private Cluster Leak Assumptions: 5 Cách Traffic Exposes" — metadata server, DNS leaks, edge cases
"Pub/Sub Regional Failure: Ordering & Failover Mechanics" — Region-scoped constraints, failover detection
"Binary Authorization dalam Production: Gaps & Vectors" — Break-glass abuse, attestation replay, workarounds
"SLO Error Budget Burn Rate: Multi-Window Alerting Math" — 2% budget trong 1h = 100x burn rate
"Production GKE Upgrade Runbook: Zero-Downtime Playbook" — Proven procedures, validation gates, rollback

C. RECOMMENDED READING SEQUENCE

Phase 1: Foundation (Weeks 1–4)

Chap 1: Resource Hierarchy
Chap 2: Jupiter Fabric & Andromeda
Chap 3: VPC Model
Chap 4: Cloud DNS
Chap 31: IAM Deep Dive

Phase 2: GKE Essentials (Weeks 5–12)

Chap 5: Control Plane Internals
Chap 6: Node Lifecycle
Chap 7: Networking Internals
Chap 8: Scheduler
Chap 9: Autoscaling
Chap 10: Admission Control

Phase 3: Production Operations (Weeks 13–24)

Chap 39: Cloud Monitoring
Chap 40: Cloud Logging
Chap 42: SRE Practices
Chap 43: Debugging Methodology
Chap 54: Incident Response

Phase 4: Advanced Topics (Weeks 25–52)

Chap 32–34: Security & Secrets
Chap 35–38: Messaging Systems
Chap 44–46: Automation & CI/CD
Chap 47–50: Advanced Workloads
Chap 55–63: Deep-Dives & Runbooks

SỔ TAY KỸ THUẬT GCP CẤP ĐỘ SẢN XUẤT ​

Hệ Thống Toàn Diện cho Platform Engineers & Staff/Principal Cloud Architects ​

PHẦN I: NỀN TẢNG KIẾN TRÚC & TỔNG QUAN GCP ​

Chương 1: GCP Resource Hierarchy & Tổ Chức Tài Nguyên ​

Chương 2: GCP Physical Network Architecture — Jupiter Fabric & Andromeda ​

Chương 3: GCP VPC Model — Kiến Trúc Mạng Ảo Toàn Cầu ​

Chương 4: Cloud DNS Architecture & Production Patterns ​

PHẦN II: GOOGLE KUBERNETES ENGINE — KIẾN TRÚC TOÀN DIỆN ​

Chương 5: GKE Control Plane Internals — Stateful Systems at Scale ​

Chương 6: GKE Node Lifecycle & Pool Management ​

Chương 7: GKE Networking Internals — VPC-Native, CNI, Dataplane V2 Deep Dive ​

Chương 8: GKE Scheduler — Algorithms, Affinity, Resource Model ​

Chương 9: GKE Autoscaling — HPA, VPA, Cluster Autoscaler, KEDA ​

Chương 10: GKE Admission Control & Policy Enforcement — Securing the API ​

Chương 11: GKE Storage — PV/PVC, StorageClasses, CSI Drivers ​

Chương 12: GKE Security — Hardening, RBAC, Pod Security ​

Chương 13: GKE Workload Identity & Service Accounts — Modern Authentication ​

Chương 14: GKE Observability — Metrics, Logs, Traces ​

Chương 15: GKE Upgrade Mechanics & Disruption Management ​

Chương 16: GKE Autopilot Mode — Managed Infrastructure ​

Chương 17: GKE Multi-Tenancy & Workload Isolation ​

Chương 18: GKE Fleet Management & Multi-Cluster Architecture ​

PHẦN III: NETWORKING & TRAFFIC MANAGEMENT ​

Chương 19: VPC Architecture Deep Dive — Subnets, Routes, Firewall ​

Chap 20: Cloud Load Balancing — Architecture & Mechanics ​

Chap 21: GKE Ingress & Gateway API — Exposing Applications ​

Chap 22: Cloud DNS & Service Discovery ​

Chap 23: Cloud NAT — Port Allocation & Exhaustion Prevention ​

Chap 24: Private Service Connect — Modern Service Exposure ​

Chap 25: Cloud Router & BGP Internals ​

Chap 26: Cloud Interconnect & Cloud VPN — Hybrid Connectivity ​

Chap 27: Network Security — Firewall Policies, Cloud NGFW, Cloud Armor ​

PHẦN IV: STORAGE & DATA SYSTEMS ​

Chap 28: Cloud Storage — Architecture, Consistency, Performance ​

Chap 29: Persistent Disk & Hyperdisk — Block Storage ​

Chap 30: Filestore & Advanced Storage Options ​

PHẦN V: IAM, SECURITY & COMPLIANCE ​

Chap 31: IAM Deep Dive — Model, Propagation, Conditions ​

Chap 32: Secret Manager & Cloud KMS — Secrets & Encryption ​

Chap 33: VPC Service Controls & Organization Policies ​

Chap 34: Binary Authorization — Secure Container Deployment ​

PHẦN VI: MESSAGING & DISTRIBUTED SYSTEMS ​

Chap 35: Cloud Pub/Sub — Architecture & Delivery Semantics ​

Chap 36: Pub/Sub Regional Failure Behavior ​

Chap 37: Eventarc — Event Routing & CloudEvents ​

Chap 38: Cloud Tasks — Asynchronous Task Execution ​

PHẦN VII: OBSERVABILITY & RELIABILITY ENGINEERING ​

Chap 39: Cloud Monitoring — Metrics, Alerting, SLOs ​

Chap 40: Cloud Logging — Architecture, Routing, Cost Management ​

Chap 41: Cloud Trace, Profiler, Error Reporting ​

Chap 42: SRE Practices trên GCP — SLO, Incident Response, Chaos ​

Chap 43: GKE Production Debugging Methodology ​

PHẦN VIII: PLATFORM AUTOMATION & CI/CD ​

Chap 44: Terraform on GCP — State Management, Modules, IaC Patterns ​

Chap 45: Cloud Build & Artifact Registry — CI/CD Pipeline ​

Chap 46: Cloud Deploy & GitOps — Progressive Delivery ​

PHẦN IX: ADVANCED PRODUCTION PATTERNS ​

Chap 47: GKE Service Mesh — Cloud Service Mesh (Managed Istio) ​

Chap 48: Multi-Cluster Architecture & Networking ​

Chap 49: GKE AI/ML Infrastructure — GPU, TPU, Large-Scale Workloads ​

Chap 50: GKE Large-Scale Design — 1000+ Nodes ​

Chap 51: Cost Optimization Engineering — Systematic Approach ​

Chap 52: GKE Disaster Recovery & High Availability ​

PHẦN X: ADVANCED DEBUGGING & INCIDENT MANAGEMENT ​

Chap 53: Production GKE Debugging Framework ​

Chap 54: Incident Response & Post-Mortems ​

PHẦN XI: SPECIAL TOPICS & ADVANCED CONCEPTS ​

Chap 55: Kubernetes API Machinery Deep Dive ​

Chap 56: Kubernetes Advanced RBAC & Authorization Patterns ​

Chap 57: GKE with Windows Server Containers ​

Chap 58: Confidential Compute on GKE — AMD SEV & Intel TDX ​

Chap 59: Managed Prometheus at Scale — Optimization & Troubleshooting ​

Chap 60: Advanced Cloud Armor WAF Configuration ​

PHẦN XII: SPECIAL PRODUCTION RUNBOOKS & TROUBLESHOOTING ​

Chap 61: GKE Troubleshooting Runbook — Common Issues & Solutions ​

Chap 62: GKE Cluster Upgrade Runbook — Zero-Downtime Procedures ​

Chap 63: GCP Network Troubleshooting Methodology ​

B. SUMMARY & COVERAGE VALIDATION ​

Coverage Statistics: ​

Covered Domains (100% Coverage): ​

SỔ TAY KỸ THUẬT GCP CẤP ĐỘ SẢN XUẤT

Hệ Thống Toàn Diện cho Platform Engineers & Staff/Principal Cloud Architects

PHẦN I: NỀN TẢNG KIẾN TRÚC & TỔNG QUAN GCP

Chương 1: GCP Resource Hierarchy & Tổ Chức Tài Nguyên

Chương 2: GCP Physical Network Architecture — Jupiter Fabric & Andromeda

Chương 3: GCP VPC Model — Kiến Trúc Mạng Ảo Toàn Cầu

Chương 4: Cloud DNS Architecture & Production Patterns

PHẦN II: GOOGLE KUBERNETES ENGINE — KIẾN TRÚC TOÀN DIỆN

Chương 5: GKE Control Plane Internals — Stateful Systems at Scale

Chương 6: GKE Node Lifecycle & Pool Management

Chương 7: GKE Networking Internals — VPC-Native, CNI, Dataplane V2 Deep Dive

Chương 8: GKE Scheduler — Algorithms, Affinity, Resource Model

Chương 9: GKE Autoscaling — HPA, VPA, Cluster Autoscaler, KEDA

Chương 10: GKE Admission Control & Policy Enforcement — Securing the API

Chương 11: GKE Storage — PV/PVC, StorageClasses, CSI Drivers

Chương 12: GKE Security — Hardening, RBAC, Pod Security

Chương 13: GKE Workload Identity & Service Accounts — Modern Authentication

Chương 14: GKE Observability — Metrics, Logs, Traces

Chương 15: GKE Upgrade Mechanics & Disruption Management

Chương 16: GKE Autopilot Mode — Managed Infrastructure

Chương 17: GKE Multi-Tenancy & Workload Isolation

Chương 18: GKE Fleet Management & Multi-Cluster Architecture

PHẦN III: NETWORKING & TRAFFIC MANAGEMENT

Chương 19: VPC Architecture Deep Dive — Subnets, Routes, Firewall

Chap 20: Cloud Load Balancing — Architecture & Mechanics

Chap 21: GKE Ingress & Gateway API — Exposing Applications

Chap 22: Cloud DNS & Service Discovery

Chap 23: Cloud NAT — Port Allocation & Exhaustion Prevention

Chap 24: Private Service Connect — Modern Service Exposure

Chap 25: Cloud Router & BGP Internals

Chap 26: Cloud Interconnect & Cloud VPN — Hybrid Connectivity

Chap 27: Network Security — Firewall Policies, Cloud NGFW, Cloud Armor

PHẦN IV: STORAGE & DATA SYSTEMS

Chap 28: Cloud Storage — Architecture, Consistency, Performance

Chap 29: Persistent Disk & Hyperdisk — Block Storage

Chap 30: Filestore & Advanced Storage Options

PHẦN V: IAM, SECURITY & COMPLIANCE

Chap 31: IAM Deep Dive — Model, Propagation, Conditions

Chap 32: Secret Manager & Cloud KMS — Secrets & Encryption

Chap 33: VPC Service Controls & Organization Policies

Chap 34: Binary Authorization — Secure Container Deployment

PHẦN VI: MESSAGING & DISTRIBUTED SYSTEMS

Chap 35: Cloud Pub/Sub — Architecture & Delivery Semantics

Chap 36: Pub/Sub Regional Failure Behavior

Chap 37: Eventarc — Event Routing & CloudEvents

Chap 38: Cloud Tasks — Asynchronous Task Execution

PHẦN VII: OBSERVABILITY & RELIABILITY ENGINEERING

Chap 39: Cloud Monitoring — Metrics, Alerting, SLOs

Chap 40: Cloud Logging — Architecture, Routing, Cost Management

Chap 41: Cloud Trace, Profiler, Error Reporting

Chap 42: SRE Practices trên GCP — SLO, Incident Response, Chaos

Chap 43: GKE Production Debugging Methodology

PHẦN VIII: PLATFORM AUTOMATION & CI/CD

Chap 44: Terraform on GCP — State Management, Modules, IaC Patterns

Chap 45: Cloud Build & Artifact Registry — CI/CD Pipeline

Chap 46: Cloud Deploy & GitOps — Progressive Delivery

PHẦN IX: ADVANCED PRODUCTION PATTERNS

Chap 47: GKE Service Mesh — Cloud Service Mesh (Managed Istio)

Chap 48: Multi-Cluster Architecture & Networking

Chap 49: GKE AI/ML Infrastructure — GPU, TPU, Large-Scale Workloads

Chap 50: GKE Large-Scale Design — 1000+ Nodes

Chap 51: Cost Optimization Engineering — Systematic Approach

Chap 52: GKE Disaster Recovery & High Availability

PHẦN X: ADVANCED DEBUGGING & INCIDENT MANAGEMENT

Chap 53: Production GKE Debugging Framework

Chap 54: Incident Response & Post-Mortems

PHẦN XI: SPECIAL TOPICS & ADVANCED CONCEPTS

Chap 55: Kubernetes API Machinery Deep Dive

Chap 56: Kubernetes Advanced RBAC & Authorization Patterns

Chap 57: GKE with Windows Server Containers

Chap 58: Confidential Compute on GKE — AMD SEV & Intel TDX

Chap 59: Managed Prometheus at Scale — Optimization & Troubleshooting

Chap 60: Advanced Cloud Armor WAF Configuration

PHẦN XII: SPECIAL PRODUCTION RUNBOOKS & TROUBLESHOOTING

Chap 61: GKE Troubleshooting Runbook — Common Issues & Solutions

Chap 62: GKE Cluster Upgrade Runbook — Zero-Downtime Procedures

Chap 63: GCP Network Troubleshooting Methodology

B. SUMMARY & COVERAGE VALIDATION

Coverage Statistics:

Covered Domains (100% Coverage):