Skip to content

SỔ TAY KỸ THUẬT GCP CẤP ĐỘ SẢN XUẤT

Hệ Thống Toàn Diện cho Platform Engineers & Staff/Principal Cloud Architects


PHẦN I: NỀN TẢNG KIẾN TRÚC & TỔNG QUAN GCP


Chương 1: GCP Resource Hierarchy & Tổ Chức Tài Nguyên

Tại sao quan trọng: Mọi quyết định IAM, billing, network boundary, org policy phụ thuộc vào hiểu được resource hierarchy. Sai ở tầng này → blast radius tối đa ảnh hưởng toàn tổ chức.

Chapter 1 Full Index & Learning Paths

Các chủ đề con:

  1. Resource Hierarchy Fundamentals - Organization → Folder → Project → Resource: phân cấp, inheritance, override mechanics

  2. Resource Manager API - Programmatic resource management, propagation delay, consistency model, eventual consistency handling

  3. Project Naming & Automation - Project ID constraints, immutability, soft-delete windows, naming automation patterns

  4. IAM Policy Propagation - Three-layer propagation, eventual consistency, caching behavior, testing strategies

  5. Quota Management - Quota types (allocation/rate/concurrent), project vs organization-level, exhaustion scenarios

  6. Labels, Tags & Organization - Labels vs Tags vs Network Tags: usage cho billing, firewall, IAM conditions, cost allocation

  7. Cloud Asset Inventory - Query resource state across hierarchy, drift detection, compliance auditing

  8. Resource Protection - Locking, deletion protection, soft-delete recovery, backup strategies

  9. Shared VPC Model - Host project vs service projects, centralized network management, cross-project connectivity

  10. Service Account Scoping - Cross-project access patterns, keys vs tokens, workload identity, impersonation chains

  11. Billing Hierarchy - Cost attribution, billing account structure, chargeback models, budget alerts

  12. Organization Policies - Constraint framework, managed/custom constraints, conditional policies, CEL expressions


Chương 2: GCP Physical Network Architecture — Jupiter Fabric & Andromeda

Tại sao quan trọng: GCP networking khác hoàn toàn so với on-prem và AWS. Jupiter spine-leaf fabric, Andromeda SDN, global routing — hiểu cơ chế này giải thích latency, failover behavior, bandwidth allocation.

Chapter 2 Full Index & Learning Paths

Các chủ đề con:

  1. Andromeda: GCP Software-Defined Networking Stack - Control plane vs data plane, 5-step packet processing pipeline, VPC logical overlay, production patterns, anti-patterns

  2. Jupiter Fabric: Spine-Leaf Topology - Physical datacenter topology, per-server bandwidth, ECMP routing, oversubscription implications, zone placement

  3. Google Points of Presence (PoP) - Edge node hierarchy, traffic entry points, PoP failover mechanisms, DDoS scrubbing, anycast routing

  4. GCP Global Backbone: Premium vs Standard Tier - User-centric vs region-centric routing, private fiber backbone, SLA differences, cost tradeoffs

  5. Latency SLA & Fiber Path Engineering - Fiber infrastructure, latency components, multi-path redundancy, inter-region latencies, propagation delays

  6. Anycast Routing with Global Load Balancer - BGP anycast mechanism, automatic geo-routing, single IP multiple locations, failover transparency

  7. Cold Potato vs Hot Potato Routing Strategies - Egress point optimization, cold potato (backbone) vs hot potato (internet), strategic routing decisions

  8. Network Service Tiers: Practical Datapath Implications - Premium vs Standard tier queuing, SLA achievement mechanics, health checking differences

  9. Bandwidth Allocation & Egress Pricing Architecture - Per-zone capacity, bandwidth quotas, egress pricing model, burst allowance mechanics

  10. Regional vs Global Services: Data Sovereignty - Data residency requirements, GDPR/CCPA/HIPAA compliance, regional constraints, multi-region architectures

  11. Traffic Engineering & Multi-path Load Balancing - ECMP routing, capacity planning, failure scenarios, multi-path resilience, cascade failure prevention


Chương 3: GCP VPC Model — Kiến Trúc Mạng Ảo Toàn Cầu

Tại sao quan trọng: VPC là nền tảng của mọi thứ trong GCP. Hiểu cấu trúc global-regional, subnet design, routing primitives là điều kiện bắt buộc.

Chapter 3 Full Index & Learning Paths

Các chủ đề con:

  1. VPC là Global Resource, Subnet là Regional - VPC scope vs subnet scope, implications cho multi-region, tại sao GCP khác AWS/Azure

  2. Auto-mode vs Custom-mode VPC - Auto-mode limitations (10.128.0.0/9), custom-mode flexibility, tại sao production luôn custom, migration strategies

  3. Subnet Design & CIDR Planning - Primary vs secondary ranges, GKE Pod CIDR allocation, IP address management at scale, overlap constraints

  4. Alias IP Ranges & GKE Pods - VPC-native pod routing (không NAT), anti-spoofing checks, container networking patterns, firewall interactions

  5. Static Routes & Next Hops - Subnet routes, custom static routes, next hop types (VMs, ILBs, VPN), route conflict resolution

  6. Dynamic Routes & Cloud Router - BGP sessions, route learning/advertisement, regional vs global mode, on-premises connectivity

  7. System-generated Routes - Default route, subnet routes, special paths (GFE, IAP, Serverless), reserved ranges

  8. Firewall Rules Fundamentals - Stateful inspection, priority 0-65535, ingress/egress asymmetry, connection tracking limits

  9. Network Tags vs Service Accounts - Tags vs SAs for firewall targeting, decision matrix, multi-tier patterns, IAM integration

  10. Hierarchical Firewall Policies - Organization → folder → project evaluation order, allow/deny semantics, exceptions, multi-org scenarios

  11. Cloud NGFW & L7 Inspection - FQDN filtering, TLS interception, IDS/IPS, threat intelligence, latency overhead, throughput ceilings

  12. VPC Peering Deep Dive - No-transitivity principle, mesh topology, hub-and-spoke routing, DNS resolution, cross-project patterns

  13. Shared VPC & Centralized Management - Host/service projects, subnet sharing, IAM role separation, multi-tenancy isolation, cost attribution

  14. Private Google Access - 199.36.153.x/30 routing, Google APIs access without internet, private vs restricted endpoints

  15. VPC Flow Logs Analysis - Sampling mechanics, metadata fields, BigQuery export, cost analysis, troubleshooting patterns

  16. Network Intelligence Center - Topology visualization, connectivity tests, performance insights, firewall analysis

  17. VPC Service Controls - Service perimeters, access levels, ingress/egress rules, data exfiltration prevention, compliance


Chương 4: Cloud DNS Architecture & Production Patterns

Tại sao quan trọng: DNS là attack surface ẩn. Misconfiguration dẫn đến outages và data exfiltration. Cloud DNS for GKE là mandatory cho Autopilot.

Chapter 4 Full Index & Learning Paths

Các chủ đề con:

  1. Managed Zones: Public vs Private - Public/private zone fundamentals, zone characteristics, naming conventions

  2. Split-Horizon DNS: Internal vs External Resolution - Same domain multiple answers, internal/external topology, failover patterns

  3. DNS Peering: Hybrid On-Premises Resolution - Multi-project, multi-VPC, on-premises integration, hub-spoke architecture

  4. DNS Forwarding: Cấu hình Upstream Resolvers - Forwarding zones, external DNS, resolver chains, failure handling

  5. Private DNS Zones: VPC Binding & Zone Discovery - VPC attachment, zone discovery, multi-VPC patterns, GKE integration

  6. Cloud DNS for GKE: Alternatives & Performance at Scale - GKE DNS stack, kube-dns vs CoreDNS, external service discovery, multi-cluster

  7. Response Policy Zones (RPZ): Internal Overrides & Security - RPZ mechanisms, security use cases, malware blocking, internal redirects

  8. NodeLocal DNSCache: Latency Reduction & Caching Mechanics - Local caching, performance impact, deployment, troubleshooting

  9. DNS Resolution Path: Pod → NodeLocal → Cloud DNS → Upstream - Complete flow, layer-by-layer troubleshooting, debugging tools

  10. DNS Query Logging: Detection, Audit & Compliance - Query logging, exfiltration detection, BigQuery analysis, alerting

  11. TTL Tuning: High-Churn Environments & Consistency - TTL mechanics, environment-specific tuning, eventual consistency

  12. DNSSEC: Validation & Key Management - DNSSEC architecture, validation, key signing, operational considerations

  13. Multi-Cluster DNS: Cloud Service Directory Patterns - ServiceImport/Export, Service Directory, cross-cluster routing, failover


PHẦN II: GOOGLE KUBERNETES ENGINE — KIẾN TRÚC TOÀN DIỆN


Chương 5: GKE Control Plane Internals — Stateful Systems at Scale

Tại sao quan trọng: Control plane là "bộ não" của cluster. Hiểu cơ chế reconciliation, etcd behavior, control plane limitations là điều kiện tiên quyết debug scheduling failures, API server latency, upgrade issues.

Chapter 5 Full Index & Learning Paths

Các chủ đề con:

  1. GKE Managed Control Plane Model — Standard vs Autopilot - Google quản lý gì, customer quản lý gì, implications cho operations

  2. Kiến Trúc Control Plane Components — API Server, Scheduler, Controller-Manager - Mỗi component role, dependencies, failure modes, interoperability

  3. etcd vs Spanner Backend — GKE State Storage & Consistency Model - Storage backends, consistency guarantees, latency implications, backup strategies

  4. etcd Architecture Deep Dive — Quorum, Replication, Watch Mechanism, Compaction - Raft consensus, replication log, watch caching, compaction schedule, performance limits

  5. Watch Caching & API Server Local Cache — Stale Reads, Reconnection Behavior - Cache mechanics, stale reads, watch connection handling, cache invalidation

  6. Kubernetes Informer Pattern — List-Watch Protocol, Local Cache, Resync Intervals - List-watch protocol, informer cache, resync mechanics, shared factory pattern

  7. Controller Reconciliation Loops — Level-Triggered vs Edge-Triggered Design - Reconciliation patterns, level vs edge-triggered, failure modes, idempotency

  8. API Priority and Fairness (APF) — Flow Schemas, Priority Levels, Rate Limiting - Request prioritization, flow classification, token bucket algorithm, debugging rejections

  9. Admission Control Pipeline — MutatingAdmissionWebhook, ValidatingAdmissionWebhook - Request processing pipeline, webhook execution order, failure modes, cluster stability

  10. Mutating Admission Policies — CEL-Based Policies, Webhook Alternatives - CEL expressions, policy enforcement, webhook alternatives, performance tradeoffs

  11. API Server Request Lifecycle — Authentication → Authorization → Admission → Storage - Full request path, latency breakdown, bottleneck analysis

  12. Control Plane Scalability — Request Rate Limits, Watch Connection Limits, Burst Handling - Scale limits, capacity planning, failure at scale, workarounds

  13. Control Plane Connectivity — DNS-Based vs IP-Based Endpoint, Authorized Networks - Endpoint types, authorized networks, network security implications

  14. Private Cluster Control Plane — Private Endpoint, Cloud NAT, Node Access - Private endpoint setup, node connectivity, security benefits

  15. Credential Rotation & Zero-Downtime Updates — SSL Certificates, CA Rotation, IP Rotation - Certificate lifecycle, CA rotation, zero-downtime strategies

  16. Control Plane SLA, Release Channels, & Versioning Policy - Availability guarantees, release cadence, version support windows, version skew policy


Chương 6: GKE Node Lifecycle & Pool Management

Tại sao quan trọng: Node management là nơi xảy ra phần lớn operational incidents. Node not ready, OOM kills, disk pressure — hiểu lifecycle giúp thiết kế clusters chịu lỗi tốt hơn.

Điều kiện tiên quyết: Chương 5, Container-Optimized OS cơ bản

Mức độ sâu: 5/5

Chapter 6 Full Index & Learning Paths

Các chủ đề con:

  1. COS, Node Bootstrap, Node Conditions và Auto-Repair - COS hardening/immutable filesystem, kubelet registration, startup taints, node conditions, eviction behavior, auto-repair trigger và cơ chế thay node

  2. Node Pool Upgrades, Draining, Maintenance Windows và Cluster Disruption Budget - Surge vs blue-green, maxSurge/maxUnavailable, cordon vs drain, PDB, maintenance windows/exclusions, giới hạn tần suất gián đoạn

  3. Spot, ARM, Confidential Nodes, Reservations và Node Labeling Strategy - Spot preemption, grace shutdown, ARM T2A compatibility, confidential computing, reservation affinity, chiến lược labels cho scheduling

  4. Max Pods, Flex Pod CIDR, Boot Disk, Local SSD và Capacity Design - max pods per node, alias IP sizing, discontiguous Pod CIDR, boot disk performance, local SSD patterns

  5. kubelet & containerd Configuration cho Production GKE - kubelet tuning, eviction thresholds, cgroup v2 migration, registry mirrors, custom TLS CA, image pulling behavior


Chương 7: GKE Networking Internals — VPC-Native, CNI, Dataplane V2 Deep Dive

Tại sao quan trọng: GKE networking là nơi phức tạp nhất. Hiểu packet path từ pod đến pod, qua service, ra internet là điều kiện tiên quyết debug latency, packet drops, network policy violations.

Điều kiện tiên quyết: Chương 3, Linux networking (namespaces, iptables, veth pairs, bridge)

Mức độ sâu: 5/5

Chapter 7 Full Index & Learning Paths

Các chủ đề con:

  1. VPC-Native Architecture — Alias IP, Pod CIDR Sizing & Migration - VPC-native vs routes-based, alias IP ranges trên NIC node, routes-based deprecation, Pod CIDR sizing & max-pods-per-node, secondary subnet sizing, discontiguous Pod CIDR, IP migration

  2. CNI Evolution & Dataplane V2 — kubenet, Calico, eBPF/Cilium - kubenet legacy, Calico iptables ceiling, GKE Dataplane V2 (anetd DaemonSet, eBPF programs, no kube-proxy), eBPF vs iptables 260K endpoint limit, Cilium identity model

  3. Detailed Packet Path Analysis — 5 Đường Đi Của Gói Tin - Same-node & cross-node pod-to-pod, pod-to-Service (ClusterIP DNAT), pod-to-external (masquerade/Cloud NAT), external-to-pod (LoadBalancer/NEG container-native)

  4. kube-proxy & Service Dataplane — iptables vs eBPF - iptables mode chains/DNAT/session affinity, chain explosion O(Services × Endpoints), lock contention, rule resyncing & control-plane latency, Dataplane V2 eBPF replacement

  5. NetworkPolicy Enforcement — Calico iptables vs Dataplane V2 eBPF - Mô hình default-deny, Calico ipset theo IP, Dataplane V2 theo Cilium identity, FQDN egress, NetworkPolicy logging, anti-patterns isolation

  6. Troubleshooting Toolkit — tcpdump, nsenter, Hubble, Connectivity Tests - tcpdump trong pod network namespace, nsenter cấp node, Hubble, GCP Connectivity Tests, ip route/arp/iptables, conntrack limits, eBPF tracing với bpftrace


Chương 8: GKE Scheduler — Algorithms, Affinity, Resource Model

Tại sao quan trọng: Scheduling failures là nguyên nhân hàng đầu Pod stuck in Pending. Hiểu cơ chế scoring/filtering giúp thiết kế node pools, resource requests đúng ngay từ đầu.

Điều kiện tiên quyết: Chương 6, 7; Kubernetes resource model (requests/limits)

Mức độ sâu: 5/5

Chapter 8 Full Index & Learning Paths

Các chủ đề con:

  1. Scheduler Architecture & Workflow — Scheduling Framework, Cycle & Queue - Scheduling cycle vs binding cycle, toàn bộ extension point (PreFilter→Filter→PostFilter→Score→Reserve→Permit→Bind), optimistic locking, ba hàng đợi activeQ/backoffQ/unschedulablePods, QueueingHints, scheduler metrics

  2. Filter & Score Plugins — Lọc Node & Chấm Điểm - Filter plugins (NodeResourcesFit, NodeAffinity, TaintToleration, PodTopologySpread, VolumeBinding), Score plugins, LeastAllocated vs MostAllocated (spread vs bin-packing), GKE optimize-utilization, percentageOfNodesToScore

  3. Node Affinity & Inter-Pod Affinity/Anti-Affinity - nodeAffinity (required/preferred, operators, weight), inter-pod affinity/anti-affinity (topologyKey, namespaceSelector), chi phí O(pods×namespaces), anti-pattern required anti-affinity hostname, tương tác autoscaler

  4. Pod Topology Spread Constraints — Phân Bố Theo Failure Domain - Công thức skew, maxSkew, minDomains, whenUnsatisfiable (DoNotSchedule vs ScheduleAnyway), nodeAffinityPolicy/nodeTaintsPolicy, matchLabelKeys, so sánh với podAntiAffinity

  5. Taints & Tolerations — Ràng Buộc "Đẩy" Node - Ba effect NoSchedule/PreferNoSchedule/NoExecute, tolerationSeconds, operator Equal/Exists, taint-based eviction theo node condition, default toleration 300s, taints mặc định GKE (GPU/Spot/cordon)

  6. Resource Model, QoS & Node-Pressure Eviction - requests vs limits, CPU CFS quota throttling, memory OOM kill, QoS (Guaranteed/Burstable/BestEffort), node-pressure eviction (soft/hard threshold), oom_score_adj, overcommit, vì sao eviction không tôn trọng PDB

  7. Pod Priority & Preemption - PriorityClass (value, globalDefault, preemptionPolicy Never), thuật toán chọn victim, nominatedNodeName, PDB best-effort, cross-node preemption, cascading eviction, starvation, ResourceQuota giới hạn priority

  8. Extended Resources & GPU Scheduling - requests=limits cho extended resources, nvidia.com/gpu, taint GPU + ExtendedResourceToleration, device plugin/driver, GPU sharing (time-sharing/MIG), stranded GPU, TPU & Dynamic Workload Scheduler

  9. GKE Autopilot Scheduling, Custom ComputeClasses & Scheduler Extenders - Autopilot ép tỷ lệ CPU:memory & từ chối/điều chỉnh request, compute classes, custom ComputeClasses (priorities/fallback, activeMigration, consolidation, nodePoolAutoCreation), scheduler extenders vs plugins, Kueue/Volcano


Chương 9: GKE Autoscaling — HPA, VPA, Cluster Autoscaler, KEDA

Tại sao quan trọng: Autoscaling là trái tim cost optimization và reliability. Hiểu sai autoscaling → chậm scale-up (outage), expensive over-provisioning, hoặc flapping destabilizing cluster.

Điều kiện tiên quyết: Chương 8, Cloud Monitoring metrics

Mức độ sâu: 5/5

Chapter 9 Full Index & Learning Paths

Các chủ đề con:

  1. HorizontalPodAutoscaler — Control Loop & Thuật Toán - Control loop chu kỳ 15s, công thức desiredReplicas, tolerance 0.1, dampening Pod chưa Ready/thiếu metric, stabilization window, log atomic vs final recommendation (hpa-controller), debug qua conditions

  2. HPA — Behavior Policies, Metrics Sources & Debugging - behavior autoscaling/v2 (scaleUp/scaleDown, selectPolicy, stabilizationWindowSeconds), Resource/Custom/External metrics, Performance HPA Profile (1000/5000 objects), xung đột HPA+VPA, tương tác rolling update, AbleToScale/ScalingActive/ScalingLimited

  3. VerticalPodAutoscaler — Kiến Trúc, Recommender & Update Modes - Recommender/Updater/Admission Controller, histogram phân rã half-life 24h, OOM bump, update modes (Off/Initial/Recreate/Auto/InPlaceOrRecreate), In-Place Pod Resize, controlledValues, giới hạn VPA

  4. Multidimensional Pod Autoscaling — HPA và VPA Cùng Lúc - Vì sao HPA+VPA xung đột, MultidimPodAutoscaler (CPU ngang + memory dọc), spec & constraints, migration, so sánh với HPA custom metric + VPA Off, failure modes

  5. Cluster Autoscaler — Cơ Chế Scale-Up & Scale-Down - Pod Pending trigger, fake scheduling simulation, expander (least-waste/priority...), location_policy BALANCED/ANY, ngưỡng scale-down 0.5 & các delay, điều chặn scale-down, drain sequence, autoscaling profile

  6. Node Auto-Provisioning — Tự Động Tạo Node Pool - NAP tự tạo/xóa pool, resourceLimits, chọn machine type, khuôn mặc định (Shielded/SA/auto-upgrade), GPU/TPU/Spot, tích hợp ComputeClass, ngưỡng 200 pool, NAP trên Autopilot

  7. CA Troubleshooting, Capacity Buffers & Provisioning Requests - Visibility events (scaleUp/scaleDown/nodePoolCreated), noScaleUp/noScaleDown reasons, Cloud Logging queries, capacity buffer với pause Pod, Provisioning Requests & Dynamic Workload Scheduler, Kueue

  8. KEDA — Kubernetes Event-Driven Autoscaling - Kiến trúc KEDA (operator/metrics-apiserver/webhooks) tạo HPA, ScaledObject vs ScaledJob, scale-to-zero (activation/scaling), defaults (pollingInterval/cooldownPeriod), Pub/Sub & Prometheus scaler, Cloud Tasks/BigQuery


Chương 10: GKE Admission Control & Policy Enforcement — Securing the API

Tại sao quan trọng: Admission control là cửa ngõ security. Misconfigured webhooks → down toàn cluster. Hiểu admission pipeline bắt buộc cho platform engineers.

Điều kiện tiên quyết: Chương 5, Kubernetes API fundamentals

Mức độ sâu: 5/5

Chapter 10 Full Index & Learning Paths

Các chủ đề con:

  1. Admission Pipeline & Built-in Plugins - Vị trí admission trong vòng đời request, hai pha bất biến Mutating → Validating, danh sách plugin bật mặc định, bốn plugin then chốt LimitRanger/ResourceQuota/PodSecurity/NodeRestriction, vì sao trên GKE không sửa được --enable-admission-plugins

  2. Mutating & Validating Webhooks — Cơ Chế Gọi & Dry-Run - WebhookConfiguration (rules/clientConfig), vòng AdmissionReview request/response, JSON Patch, reinvocationPolicy IfNeeded & idempotency, matchPolicy/objectSelector/namespaceSelector, sideEffects & dry-run, audit vs enforce

  3. Webhook Failure Modes, Performance & Stability - failurePolicy Fail vs Ignore, timeoutSeconds (10s/30s) & p99 latency, đường ghi nóng, anti-pattern bắt kube-system/tự-validate/thiếu HA, chiến lược ổn định control plane, break-glass

  4. Webhook Certificate Management — CA Bundle & cert-manager - Webhook là HTTPS server, SAN <service>.<ns>.svc, caBundle & verify, cert-manager + CA Injector tự bơm caBundle, rotation không downtime, self-signed CA, các lỗi x509

  5. PodSecurity Admission (PSA) — Modes & Profiles - Ba mode enforce/audit/warn qua label namespace, version pinning, ba profile privileged/baseline/restricted với từng control chi tiết, enforce áp Pod vs audit/warn áp workload, exemptions, thay thế PSP

  6. Gatekeeper / Policy Controller (OPA) — ConstraintTemplate & Constraint - Kiến trúc webhook + audit controller, ConstraintTemplate (Rego) → Constraint, enforcementAction deny/dryrun/warn, audit loop & status violations, referential constraints, Policy Controller trên GKE (Config Sync/fleet/bundles), Gatekeeper vs PSA

  7. ResourceQuota & LimitRange — Quản Trị Tài Nguyên Namespace - ResourceQuota compute/storage/object-count, scoped quota theo PriorityClass, quy tắc bắt buộc khai requests/limits, LimitRange default/min/max/maxLimitRequestRatio, thứ tự LimitRanger (mutating) → ResourceQuota (validating)

  8. ValidatingAdmissionPolicy (CEL) — Policy In-Process Không Cần Webhook - VAP (GA 1.30) + Binding + paramRef, biến CEL object/oldObject/request/params/namespaceObject, matchConditions/variables, validationActions Deny/Warn/Audit, vì sao CEL loại bỏ failure mode webhook, MutatingAdmissionPolicy, ma trận chọn engine

  9. Organization Policies for GKE & Admission Debugging - Org Policy chặn ở GCP API layer (cluster config) vs Kubernetes admission (Pod config), custom constraints CEL trên container.googleapis.com/Cluster & NodePool, debugging qua audit log Policy Denied/dry-run/log webhook/metric apiserver_admission


Chương 11: GKE Storage — PV/PVC, StorageClasses, CSI Drivers

Tại sao quan trọng: Storage là nơi stateful workloads sống. Hiểu PV/PVC lifecycle, storage classes, volume binding ngăn data loss và performance bottlenecks.

Điều kiện tiên quyết: Chương 6, Kubernetes storage concepts

Mức độ sâu: 5/5

Chapter 11 Full Index & Learning Paths

Các chủ đề con:

  1. Volume Types & Storage Taxonomy — Bản Đồ Toàn Cảnh - Kubernetes volume types (emptyDir, configMap, secret, projected, downwardAPI, hostPath, PVC), phân loại Block/File/Object, access modes RWO/ROX/RWX/RWOP và ngữ nghĩa node-vs-pod, khung quyết định chọn storage

  2. PV/PVC Lifecycle & Dynamic Provisioning - Vòng đời provisioning→binding→mounting→releasing→reclaiming, reclaimPolicy Delete vs Retain, dynamic provisioning end-to-end qua StorageClass/CSI, volume binding modes Immediate vs WaitForFirstConsumer, StorageClass mặc định GKE

  3. Persistent Disk CSI — Block Storage Nền Tảng - PD types và quan hệ IOPS-dung lượng, attach/detach và per-node limit, giới hạn RWX của block device, Regional PD replication đồng bộ, snapshots/cloning/expansion, Stateful HA Operator force-attach

  4. Hyperdisk — Block Storage Thế Hệ Mới - Tách IOPS/throughput khỏi dung lượng, năm loại (balanced/extreme/throughput/ml/balanced-ha), per-VM performance limit, Hyperdisk ML multi-attach ROX, Storage Pools thin provisioning, VolumeAttributesClass

  5. Local SSD & Ephemeral Storage — Tốc Độ Đổi Lấy Độ Bền - Local SSD NVMe physical, emptyDir và ephemeral storage, quy luật mất dữ liệu khi node recreate, provisioning ephemeral-storage-local-ssd, use case đúng và anti-pattern chết người

  6. Filestore CSI — Shared NFS cho ReadWriteMany - Khi nào thật sự cần RWX, service tiers (BASIC_HDD/SSD, Zonal, Enterprise/Regional), Multishares gộp nhiều PVC nhỏ, volume snapshots, NFS tradeoffs về latency/consistency/locking

  7. Cloud Storage FUSE — Object Storage Với File Semantics - Cơ chế FUSE giả lập filesystem trên GCS, sidecar gke-gcsfuse-sidecar, Workload Identity, file cache/metadata cache/parallel downloads, ngữ nghĩa khác POSIX, use case AI/ML read-heavy

  8. Parallelstore & Managed Lustre — Filesystem Song Song cho AI/ML - Parallelstore nền DAOS với erasure coding 2+1 và mô hình temporary storage, Managed Lustre cho HPC, CSI driver, tích hợp GCS, khung chọn Parallelstore/Lustre/Filestore/GCS FUSE

  9. StatefulSets, Volume Expansion & Backup for GKE - StatefulSet volumeClaimTemplates và Pod identity bền vững, PVC giữ khi scale-down, volume expansion online vs cold, Backup for GKE backup config+volume, khác biệt PD snapshot, snapshot lifecycle và chiến lược DR


Chương 12: GKE Security — Hardening, RBAC, Pod Security

Tại sao quan trọng: GKE security có nhiều lớp. Một cấu hình sai có thể phơi bày toàn bộ cluster. Production hardening là bắt buộc, không phải tùy chọn.

Điều kiện tiên quyết: Chương 5, 10, IAM fundamentals

Mức độ sâu: 5/5

Chapter 12 Full Index & Learning Paths

Các chủ đề con:

  1. Security Model & Shared Responsibility — Ai Bảo Vệ Cái Gì - Mô hình trách nhiệm chung GKE, ranh giới dịch chuyển giữa Standard và Autopilot, bảy lớp phòng thủ (org/project → control plane → identity → node → pod → network → supply chain), threat model và các pattern hardening control plane (private cluster, authorized networks)

  2. Authentication & Identity — Bốn Cổng Của Một Request - Luồng request bốn cổng, mô hình hai cổng IAM ↔ RBAC, các phương thức xác thực (Google identity/OIDC, gke-gcloud-auth-plugin, X.509 legacy), token ServiceAccount legacy vs bound (TokenRequest, audience-bound, hết hạn), automountServiceAccountToken: false

  3. RBAC Deep Dive — Role, Binding & Least Privilege - Role vs ClusterRole, quy tắc scope của binding, aggregated ClusterRole, ánh xạ IAM predefined role ↔ RBAC, default role (view/edit/admin/cluster-admin), anti-pattern (cluster-admin cho SA, wildcard, system:authenticated), kiểm tra bằng kubectl auth can-i

  4. Workload Identity Federation for GKE — Hết Long-Lived Key - Hiểm họa của service account key dạng JSON, workload identity pool PROJECT_ID.svc.id.goog, định dạng principal, ba bước trao đổi token qua GKE metadata server, direct binding vs annotation legacy, federation với external IdP

  5. Node Security — Shielded, Confidential, gVisor, COS - Shielded Nodes (Secure Boot, vTPM, Integrity Monitoring), gVisor (runtimeClassName: gvisor, userspace kernel), Confidential Nodes (AMD SEV, mã hóa bộ nhớ), Container-Optimized OS (rootfs read-only, seccomp), node service account tối thiểu, metadata concealment

  6. Pod & Workload Security — Pod Security Standards & securityContext - Ba mức Pod Security Standards (Privileged/Baseline/Restricted), Pod Security Admission (enforce/audit/warn, namespace label), securityContext từng trường (runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, drop capabilities), seccomp RuntimeDefault, AppArmor

  7. Network Policy Security — Default-Deny & Đông-Tây - Pattern default-deny, bẫy chặn DNS, Dataplane V2 (Cilium/eBPF), FQDNNetworkPolicy cho egress theo tên miền, Network Policy logging phục vụ điều tra, kiểm soát lateral movement

  8. Admission Control Security — Enforcement Tại Cổng API - Admission như cơ chế enforcement bảo mật, trade-off failurePolicy Fail/Ignore, rủi ro của mutating webhook, ValidatingAdmissionPolicy/CEL in-tree, OPA/Gatekeeper vs Kyverno, Policy Controller managed và constraint framework

  9. Binary Authorization — Chỉ Deploy Image Đáng Tin - Mô hình attestation (digest → attestor → attestation ký số → policy), Artifact Analysis note, execution path qua admission + Binary Authorization API, policy modes (allowlist/require-attestation/dryRun), break-glass có audit, Continuous Validation, Cloud Build/SLSA provenance

  10. Audit Logging, Security Posture & Hardening Checklist - Bốn loại Cloud Audit Logs (Admin Activity, Data Access, System Event, Policy Denied) và bẫy chi phí, Kubernetes audit log và query forensics, GKE Security Posture (config scanning + workload vulnerability scanning), tích hợp Security Command Center, checklist hardening đầy đủ bảy lớp


Chương 13: GKE Workload Identity & Service Accounts — Modern Authentication

Tại sao điều này quan trọng: Workload Identity là cơ chế hiện đại giúp các Pod xác thực với Google APIs mà không cần sử dụng các khóa dịch vụ (service account keys) tồn tại lâu dài. Nếu cấu hình không chính xác, Pod sẽ không thể truy cập hoặc gọi các Google APIs. Vì vậy, việc hiểu rõ luồng trao đổi token (token exchange flow) là yếu tố then chốt để triển khai, vận hành và khắc phục sự cố hiệu quả.

Điều kiện tiên quyết: Chương 12, IAM service accounts, OIDC basics

Mức độ sâu: 5/5

Chapter 13 Full Index & Learning Paths

Các chủ đề con:

  1. Workload Identity Architecture — Cluster Như Một OIDC Provider - Mỗi cluster là một OIDC issuer độc lập, Workload Identity Pool PROJECT_ID.svc.id.goog làm cây cầu để IAM hiểu danh tính Kubernetes, bốn dạng định danh principal/principalSet (theo tên KSA, theo UID, cấp namespace, cấp cluster), identity sameness giữa các cluster cùng project, Fleet Workload Identity

  2. ServiceAccount Token & Projection Mechanics — Danh Tính Được Ký - TokenRequest API và bound token thay legacy secret-based token, projected volume với audience/expirationSeconds/path, cấu trúc JWT (iss issuer cluster, aud sts.googleapis.com, exp, claim kubernetes.io), OIDC issuer endpoint và JWKS để STS verify offline, vòng đời tự refresh

  3. Metadata Server & Token Exchange Path — Trái Tim Của Cơ Chế - gke-metadata-server DaemonSet một Pod/node chặn request 169.254.169.254, trust boundary cấp node và rủi ro hostNetwork bypass, token exchange năm bước qua Security Token Service, caching/refresh lifetime 1 giờ, scale bottleneck (500 conn/node, 3000 SA/cluster, quota 6000 req/phút), network policy egress

  4. IAM Binding Models — Cấp Quyền Cho Danh Tính Workload - Mô hình trực tiếp bind role thẳng cho principal KSA vs mô hình impersonation qua annotation iam.gke.io/gcp-service-account và roles/iam.workloadIdentityUser, principalSet cấp namespace/cluster, cross-project với credential-quota-project, Autopilot luôn bật, return-principal-id-as-email

  5. Workload Identity Federation cho External IdP — Liên Bang Danh Tính Đa Đám Mây - Workload Identity Pool + Provider cho external IdP, token exchange RFC 8693 qua sts.googleapis.com, IdP hỗ trợ (AWS, Entra ID, GitHub Actions, GitLab, Kubernetes, Okta, AD FS, OIDC/SAML), attribute mapping CEL google.subject/attribute.NAME, attribute condition chống confused deputy, direct vs impersonation

  6. Truy Cập Dịch Vụ & Application Default Credentials Patterns - ADC behavior và thứ tự dò credential, vì sao client library tự hoạt động không sửa code, Secret Manager qua Workload Identity, pattern Cloud Storage/Pub-Sub/BigQuery KSA-per-workload, credential helper Artifact Registry, anti-pattern GOOGLE_APPLICATION_CREDENTIALS, khác biệt ADC local-vs-cluster

  7. Debugging Workload Identity — Khi Token Exchange Thất Bại - Quy trình bốn tầng (token gốc, metadata server, STS, IAM binding), debug từ trong Pod bằng curl metadata, verify GKE_METADATA mọi node pool, verify IAM binding và principal string, token validity check, cây quyết định lỗi (unable to detect environment, 403, 404, treo, lỗi rải rác scale)


Chương 14: GKE Observability — Metrics, Logs, Traces

Tại sao quan trọng: GKE sinh ra lượng telemetry khổng lồ và phân tầng. Biết metric nào nằm ở tầng nào, và correlate telemetry để đi từ triệu chứng tới nguyên nhân, là kỹ năng production cốt lõi.

Điều kiện tiên quyết: Chương 5–13, Cloud Monitoring/Logging basics

Mức độ sâu: 5/5

Chapter 14 Full Index & Learning Paths

Các chủ đề con:

  1. Observability Stack — Telemetry Phân Tầng & Mental Model - Ba tầng telemetry (control plane/system/workload), ba loại signal (metric/log/trace) với mô hình chi phí riêng, tích hợp GKE với Cloud Monitoring/Logging/Trace và Managed Prometheus, resource label nhất quán làm nền cho correlation

  2. Control Plane Metrics — Quan Sát Bộ Não Cluster - API server (request rate/error/latency percentile, etcd op latency, inflight, admission webhook), scheduler (pending_pods, scheduling attempt duration, preemption), controller-manager (workqueue depth, reconciliation, node eviction), cách bật --monitoring, mô hình chi phí

  3. System & Workload Metrics — kube-state-metrics, cAdvisor, DCGM GPU - System metrics node, kube-state-metrics (kube_* trạng thái object), cAdvisor (container_*, CPU CFS throttling, memory working set), DCGM GPU metrics (utilization, framebuffer, power, profiling, XID), cardinality

  4. Application Metrics, Startup Latency & Cost Allocation - Golden signals (rate/error/duration/saturation), auto-instrumentation vs custom metric, phân rã startup latency (image pull/init/readiness), GKE cost allocation theo namespace/label (requested vs consumed), FinOps loop

  5. GKE Logs — System, Workload, Audit & Log Control - Logging agent fluent-bit, gói log (SYSTEM/WORKLOAD/API_SERVER/...), system component logs, workload stdout/stderr và structured logging, bốn loại audit log (Admin Activity/Data Access/System Event/Policy Denied), Log Router/sink, exclusion/sampling/retention

  6. Managed Service for Prometheus — PodMonitoring, Rules, PromQL - Managed collection (gmp-operator, collector DaemonSet scrape colocated node, rule-evaluator, alertmanager) và push model, PodMonitoring/ClusterPodMonitoring CRDs, Rules/ClusterRules/AlertmanagerConfig, PromQL trong Cloud Monitoring, high cardinality và metricRelabeling

  7. Managed OpenTelemetry & Custom Metrics cho HPA - Managed OpenTelemetry cho GKE (in-cluster OTLP collector, Instrumentation CRD, signal routing), Google-Built OpenTelemetry Collector, custom metric cho HPA (Custom Metrics Stackdriver Adapter vs Prometheus Adapter, không chạy đồng thời), ServiceMonitor/PodMonitor, liên kết KEDA

  8. Self-Managed Observability — Elastic Stack trên GKE - Khi nào tự vận hành (data sovereignty, multi-cloud, log analytics nâng cao, anti-lock-in), Elastic Cloud on Kubernetes (ECK), performance tuning (Hyperdisk, JVM heap 50%/≤31GB, shard sizing, ILM hot-warm-cold), khung quyết định managed vs self-managed, pattern hybrid

  9. Troubleshooting & Dashboard — Tích Hợp Metrics, Logs, Traces - GKE dashboard trong Cloud Console, workflow correlate dashboard → metric → log → trace qua resource label chung, runbook (Pod Pending, OOMKill, latency spike, API server overload, node NotReady), alerting SLO/burn-rate tránh alert fatigue


Chương 15: GKE Upgrade Mechanics & Disruption Management

Tại sao quan trọng: Sai upgrade strategy → production outage. Hiểu upgrade mechanics, release channels, node draining là nền tảng để thực hiện zero-downtime upgrades.

Điều kiện tiên quyết: Chương 5, 6, 7, 13

Mức độ sâu: 5/5

Chapter 15 Full Index & Learning Paths

Các chủ đề con:

  1. Release Channels, Versioning & Version Skew Policy - GKE release channels (Rapid/Regular/Stable/Extended) cadence, auto-upgrade triggers, capping behavior, Kubernetes version skew policy control plane ↔ kubelet, n-2 support model, patch version advance notice

  2. Cơ Chế Upgrade Cluster GKE: Control Plane, Node Pool & Autopilot - Control plane upgrade zonal vs regional, node pool sequencing, auto-upgrade vs manual upgrade, Autopilot managed upgrade mechanics, rollout sequencing trong fleet, upgrade notifications Pub/Sub

  3. Node Upgrade Strategies: Surge vs Blue-Green - Surge upgrade (maxSurge/maxUnavailable mechanics, pod scheduling, quota implications), blue-green upgrade (5 phases, parallel pool creation, pod migration, rollback), autoscaled blue-green, chiến lược chọn theo workload type, concurrent node pool upgrades

  4. Maintenance Windows, Exclusions & Cluster Disruption Budget - Maintenance windows (UTC timezone, RRULE recurrence, 48h/32d requirement), ba loại maintenance exclusion (no-upgrades/no-minor/no-minor-node), precedence rules, cluster disruption budget cho fleet, rollout sequencing patterns

  5. Workload Disruption Readiness: PDB, Annotations & Upgrade Notifications - PodDisruptionBudget semantics (minAvailable/maxUnavailable, 1-giờ hard limit, PDB + topology spread), pod-deletion-cost dynamic annotation, safe-to-evict, terminationGracePeriodSeconds + preStop hooks, upgrade notification automation, workload checklist

  6. Troubleshooting Stuck Upgrades & Testing Upgrade Strategy - Diagnose stuck upgrade (PDB blocking, quota exhaustion, node affinity, webhook failures), manual intervention (force drain, rollback blue-green), staging cluster validation, kubectl drain testing, API deprecation checks, post-upgrade validation checklist


Chương 16: GKE Autopilot Mode — Managed Infrastructure

Tại sao quan trọng: Autopilot thay đổi cách tư duy về infrastructure. Hiểu Autopilot mechanics, resource enforcement, compute classes giúp tránh resource waste và Pods bị rejected.

Điều kiện tiên quyết: Chương 5, 8, 9

Mức độ sâu: 4/5

Chapter 16 Full Index & Learning Paths

Các chủ đề con:

  1. Autopilot vs Standard — Managed Node Model, Billing, Feature Gaps - Ranh giới trách nhiệm, billing per-Pod vs per-node, feature comparison đầy đủ

  2. Resource Enforcement — Min/Max Requests, CPU:Memory Ratio - Luồng xử lý khi submit Pod, automatic adjustment, minimum/maximum theo compute class, tỷ lệ CPU:memory enforcement

  3. Compute Classes — Balanced, Scale-Out, Performance, Accelerator - Mapping VM families, resource limits theo class, khi nào dùng mỗi class, Custom ComputeClasses

  4. Security Hardening — Pod Security, Privileged Workloads, Org Policy - Pod Security Standards mặc định, Linux capabilities bị drop, allowlist cho privileged workloads, org policy constraints

  5. Spot Pods & Extended Duration Pods - Preemption behavior (25s grace period), design patterns cho batch jobs, Extended Duration bảo vệ khỏi node upgrades (7 ngày)

  6. Cluster Upgrades — Zero-Downtime, Surge, Maintenance Windows - Control plane zero-downtime, surge upgrade strategy, maintenance windows/exclusions, tương tác với PDB và Extended Duration Pods

  7. Networking — IP Allocation, VPC-Native, hostPort - Fixed 32 Pods per node, Pod CIDR sizing, Cloud DNS requirement, hostPort limitations, Dataplane V2/Cilium

  8. Observability — Metrics, Logs, Monitoring - System metrics available trong Autopilot, Managed Prometheus, structured logging, debugging mà không có SSH access

  9. Migration từ Standard sang Autopilot - Pre-flight check, incompatibility checklist đầy đủ, blue-green và MCS migration strategies, Running Autopilot Pods trong Standard clusters


Chương 17: GKE Multi-Tenancy & Workload Isolation

Tại sao quan trọng: Multi-tenant GKE menghemat cost tapi require careful isolation. Understand boundaries dari namespace isolation, resource quotas, network policies untuk design correctly.

Điều kiện tiên quyết: Chương 8, 10, 12

Mức độ sâu: 4/5

Các chủ đề con:

  • Multi-tenancy models: soft (RBAC + NetworkPolicy) vs hard (separate clusters)
  • Namespace isolation: shared vs isolated resources
  • RBAC untuk multi-tenancy: ClusterRole vs Role, impersonation risks
  • NetworkPolicy untuk namespace isolation: ingress/egress rules
  • ResourceQuota per namespace: fair allocation
  • LimitRange: default constraints, quota enforcement
  • GKE Sandbox: kernel interception, use cases, overhead
  • Workload separation: dedicated node pools per team
  • Node isolation: sole-tenant nodes, HIPAA/PCI use cases
  • Multi-tenant logging: per-namespace routing
  • Hierarchical Namespace Controller: templates, policy propagation
  • Pod Security Standards per namespace: privilege restriction
  • Cost attribution: per-namespace billing

Chương 18: GKE Fleet Management & Multi-Cluster Architecture

Tại sao quan trọng: Production GKE deployments biasanya multi-cluster. Fleet management mengurangi toil untuk platform teams operating 10s-1000s of clusters.

Điều kiện tiên quyết: Chương 5–17

Mức độ sâu: 4/5

Các chủ đề con:

  • Fleet concept: logical grouping clusters, hub membership
  • Fleet workload identity: unified identity across clusters
  • Config Sync: GitOps untuk Kubernetes config, sync dari Git/OCI
  • Config Sync architecture: RootSync, RepoSync, reconciler Pods
  • Config Sync sources: Git, OCI, Helm chart
  • Hierarchical repository structure: cluster/namespace/app configs
  • Policy Controller: OPA constraints, audit/enforce modes
  • Multi-Cluster Services (MCS): ServiceImport/ServiceExport, cross-cluster DNS
  • Multi-Cluster Ingress: global load balancing, cross-cluster backends
  • Multi-Cluster Gateway: Gateway API multi-cluster
  • Fleet-based RBAC: member clusters inherit policies
  • Config Controller: manage Google Cloud resources via Kubernetes CRDs
  • Fleet Observability: cross-cluster monitoring dashboards
  • Anthos Service Mesh multi-cluster: cross-cluster traffic, trust federation
  • Network Connectivity Center: hub-and-spoke topology

PHẦN III: NETWORKING & TRAFFIC MANAGEMENT


Chương 19: VPC Architecture Deep Dive — Subnets, Routes, Firewall

Tại sao quan trọng: VPC adalah foundation. Misunderstand VPC model → security gaps, unexpected traffic paths, routing failures.

Điều kiện tiên quyết: Chap

ter 3, network fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

  • VPC sebagai global resource: subnet sebagai regional
  • Subnet primary range vs secondary ranges: sizing strategy
  • Routes: system-generated, static, dynamic via Cloud Router
  • Route priority mechanism: metric evaluation
  • Cloud Router architecture: regional, BGP sessions
  • BGP configuration: ASN, session establishment
  • Route propagation: advertisement, import, filtering
  • Firewall rules evaluation order: ingress/egress, priority
  • Firewall rule matching: Network Tags vs Service Accounts
  • VPC Peering: connectivity, firewall implications
  • Shared VPC: host vs service projects, subnet sharing
  • Private Google Access: routing untuk Google APIs
  • VPC Flow Logs: sampling, cost, export destinations
  • Network Intelligence Center: topology, connectivity tests
  • VPC Service Controls: perimeter security, access policies

Chap 20: Cloud Load Balancing — Architecture & Mechanics

Tại sao quan trọng: LB adalah traffic entry point. Sai configuration → uneven distribution, health check failures, SSL issues.

Điều kiện tiên quyết: Chap 19, HTTP/HTTPS, TCP fundamentals

Mức độ sâu: 5/5

Các chủ đề con:

  • GCP LB taxonomy: L4 vs L7, internal vs external, regional vs global
  • Global External Application LB:
    • Anycast, Maglev backend selection
    • Google Front End (GFE)
    • URL Maps: host-based, path-based routing
    • Backend services, health checks, session affinity
    • Cloud CDN, Cloud Armor integration
  • Regional External Application LB: Envoy-based, regional scope
  • Internal Application LB: Envoy dalam VPC, proxy-only subnet
  • Passthrough Network LBs: DSR mode, connection tracking
  • Network Endpoint Groups (NEGs):
    • Zonal NEGs: VM endpoints
    • Serverless NEGs: Cloud Run, App Engine
    • Container-native LB: Pod IP NEGs
    • Health checks: protocol-specific, interval/timeout/threshold
  • GKE Services:
    • LoadBalancer type: External vs Internal
    • NEG-based vs legacy
    • SessionAffinity: ClientIP mode
    • ExternalTrafficPolicy: Local vs Cluster
  • Connection draining: timeout mechanics, graceful shutdown
  • SSL policies: TLS versions, cipher suites
  • Cloud Armor integration: WAF, DDoS protection

Chap 21: GKE Ingress & Gateway API — Exposing Applications

Tại sao quan trọng: Ingress/Gateway adalah cara expose apps externally. Salah konfigurasi → SSL issues, 502 errors, security vulnerabilities.

Điều kiện tiên quyết: Chap 7, 20, Kubernetes Services

Mức độ sâu: 5/5

Các chủ đề kon:

  • GKE LB overview: Gateway vs Ingress vs LoadBalancer Service
  • GKE Ingress (Legacy):
    • Controller reconciliation: Ingress resources → GCP Application LB
    • External vs Internal: annotation differences
    • BackendConfig CRD: health check tuning, Cloud Armor, session affinity, CDN
    • FrontendConfig CRD: SSL policies, HTTPS redirect
    • Multi-cluster Ingress: cross-cluster routing
    • Packet traversal: client → Google edge → pod
  • Gateway API (Recommended):
    • GatewayClass, Gateway, HTTPRoute, TCPRoute, TLSRoute
    • GKE Gateway controller implementation
    • Path matching, header matching, traffic splitting
    • TLS termination: Certificate Manager, managed certs
    • Multi-cluster Gateway: global LB dengan multi-cluster backends
  • Container-native LB internals: NEG dengan Pod IPs, Pod-level health
  • Standalone NEGs: manual management, use cases

Chap 22: Cloud DNS & Service Discovery

Tại sao quan trọng: DNS failure adalah common cause microservice outages. Understand resolution path helps debug "connection refused".

Điều kiện tiên quyết: Chapter 4, DNS fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

  • DNS resolution dalam Pod: /etc/resolv.conf, ndots:5, search domains
  • ndots:5 impact: FQDN lookup path, negative caching, latency
  • CoreDNS dalam GKE: plugin chain, behavior
  • Kubernetes DNS spec: <service>.<namespace>.svc.cluster.local discovery
  • Headless Services: DNS per Pod
  • ExternalName Services: CNAME resolution
  • NodeLocal DNSCache:
    • DaemonSet, link-local IP 169.254.20.10
    • Cache, fallback, latency reduction
  • Cloud DNS untuk GKE:
    • Private zones, peering zones
    • Split-horizon DNS
  • DNS debugging: nslookup, dig, CoreDNS logs
  • DNS performance tuning: cache sizing, TTL

Chap 23: Cloud NAT — Port Allocation & Exhaustion Prevention

Tại sao quan trọng: Cloud NAT port exhaustion adalah silent failure — connections drop tanpa error message clear. Understand allocation mechanics untuk capacity planning.

Điều kiện tiên quyết: Chap 19, NAT/SNAT fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

  • Cloud NAT architecture: distributed NAT, Andromeda integration
  • NAT translation: SNAT flow, source IP/port replacement
  • Port allocation modes: static vs dynamic
  • Port math: 64,512 ports per NAT IP / ports-per-VM = max VMs
  • 5-tuple constraint: reuse delay, TCP TIME_WAIT
  • Port exhaustion symptoms: NAT_ALLOCATION_FAILED, connection drops
  • Mitigation: more NAT IPs, connection pooling, keep-alives
  • Cloud NAT dengan GKE: node VM egress, private cluster setup
  • NAT metrics: port usage, dropped connections
  • NAT rules: custom IP ranges, logging
  • Timeouts: TCP, UDP, ICMP

Chap 24: Private Service Connect — Modern Service Exposure

Tại sao quan trọng: PSC adalah modern way expose services tanpa VPC peering. Understand PSC giàng design multi-tenant service architecture.

Điều kiện tiên quyết: Chap 19, VPC Peering concepts

Mức độ sâu: 5/5

Các chủ đề kon:

  • PSC components: Service Attachment (producer), PSC Endpoint (consumer)
  • Service producer → PSC endpoint → consumer VPC connectivity
  • PSC vs VPC Peering: routing, security differences, use cases
  • PSC for Google APIs: private endpoints
  • PSC for GKE control plane: private cluster access
  • PSC for managed services: Cloud SQL, Memorystore, AlloyDB
  • PSC NAT: overlapping IP ranges handling
  • PSC consumer vs producer: IAM, approval workflow
  • PSC global access: cross-region consumers
  • PSC DNS: A record creation
  • Troubleshoot PSC: connectivity tests, flow logs

Chap 25: Cloud Router & BGP Internals

Tại sao quan trọng: Cloud Router adalah control plane untuk dynamic routing. Sai BGP → routes tidak advertise atau incorrect routes propagate.

Điều kiện tiên quyết: Chap 19, BGP fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • Cloud Router architecture: regional, BGP sessions
  • eBGP vs iBGP: routing dynamics
  • ASN configuration: private ranges, conflicts
  • BGP session establishment: OPEN, KEEPALIVE, UPDATE
  • Route advertisement: VPC subnets, custom routes
  • Custom route advertisement
  • Route import: from on-premise
  • BGP communities: filtering, tagging
  • BFD: fast failover
  • Cloud Router dengan Cloud VPN: dynamic routing
  • Cloud Router dengan Cloud Interconnect: VLAN attachments
  • Multi-regional routing: global vs regional modes
  • Route filtering: import/export policies
  • Monitoring BGP sessions: status, routes

Chap 26: Cloud Interconnect & Cloud VPN — Hybrid Connectivity

Tại sao quan trọng: Hybrid connectivity adalah foundation enterprise GCP deployments. Design decisions impact latency, cost, security posture.

Điều kiện tiên quyết: Chap 25, MPLS/WAN networking cơ bản

Mức độ sâu: 4/5

Các chủ đề kon:

  • Cloud VPN:
    • Classic vs HA: redundancy, SLA
    • VPN tunnel mechanics: IKE, ESP
    • Dynamic routing via Cloud Router
    • MTU considerations, TCP MSS clamping
  • Cloud Interconnect:
    • Dedicated vs Partner: bandwidth, latency
    • VLAN attachments: logical connections
    • BGP sessions over Interconnect
    • Redundancy: 99.99% SLA
    • MACsec: L2 encryption
  • Network Connectivity Center: hub-and-spoke
  • Production patterns: active-passive, active-active failover
  • Monitoring: interface metrics, BGP state, packet loss

Chap 27: Network Security — Firewall Policies, Cloud NGFW, Cloud Armor

Tại sao quan trọng: Network security adalah outer perimeter. Misconfigured firewall expose sensitive services atau block legitimate traffic.

Điều kiện tiên quyết: Chap 19, security fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • VPC firewall rules: stateful, ingress/egress, priorities
  • Hierarchical firewall policies: organization-level enforcement
  • Cloud NGFW:
    • L7 inspection, FQDN rules
    • IDS integration
  • VPC Service Controls: perimeter design, data exfiltration prevention
  • Cloud Armor:
    • WAF rules, OWASP ruleset
    • Adaptive Protection
    • Rate limiting, security policies
  • Cloud IDS: intrusion detection
  • Network Intelligence Center: firewall insights
  • Secure Web Proxy: egress filtering
  • Private NAT: secure egress patterns

PHẦN IV: STORAGE & DATA SYSTEMS


Chap 28: Cloud Storage — Architecture, Consistency, Performance

Tại sao quan trọng: GCS adalah universal data store. Understand consistency model dan performance characteristics prevent data races dan slow reads.

Điều kiện tiên quyết: Object storage concepts, HTTP/S basics

Mức độ sâu: 4/5

Các chủ đề kon:

  • GCS object model: buckets, objects, generations
  • Strong consistency: post-2021 guarantee
  • Storage classes: Standard, Nearline, Coldline, Archive
  • Location types: multi-region, dual-region, regional
  • Lifecycle management: tiering, deletion
  • Uniform bucket-level access: IAM vs ACLs
  • Signed URLs: V4 signing, expiry
  • Requester Pays
  • Cloud Storage FUSE: POSIX interface
  • Transfer Service: bulk migration
  • VPC Service Controls integration
  • Performance: throughput scaling, parallel uploads

Chap 29: Persistent Disk & Hyperdisk — Block Storage

Tại sao quan trọng: Disk type dan sizing impact application performance directly. IOPS/throughput limits adalah often misunderstood.

Điều kiện tiên quyết: Compute Engine basics

Mức độ sâu: 4/5

Các chủ đề kon:

  • PD types: standard, balanced, ssd, extreme — IOPS/throughput
  • Performance caps: formula, VM-level limits
  • Multi-writer disks: limitations
  • Hyperdisk:
    • Types: Balanced, Extreme, ML, Throughput
    • Provisioned performance: capacity + IOPS/throughput
  • Snapshots: incremental, cross-region copies
  • Regional PD: replication, failover
  • Encryption: Google-managed, CMEK

Chap 30: Filestore & Advanced Storage Options

Tại sao quan trọng: Filestore provides shared NFS untuk multi-reader workloads. Misunderstand performance tiers → IO bottlenecks.

Điều kiện tiên quyết: Chap 11, NFS basics

Mức độ sâu: 3/5

Các chủ đề kon:

  • Filestore tiers: Basic HDD, Basic SSD, Enterprise
  • Performance: IOPS dan throughput per tier
  • Filestore CSI: dynamic provisioning
  • Multishares: one instance → multiple PVCs
  • Backup: snapshots, recovery
  • Cross-zone: Regional tier, HA

PHẦN V: IAM, SECURITY & COMPLIANCE


Chap 31: IAM Deep Dive — Model, Propagation, Conditions

Tại sao quan trọng: IAM adalah access control duy-satunya dalam GCP. Understand propagation model dan conditions prevent privilege escalation.

Điều kiện tiên quyết: Chap 1, resource hierarchy

Mức độ sâu: 5/5

Các chủ đề kon:

  • IAM policy model: allow vs deny policies
  • Role types: basic, predefined, custom
  • Resource-level vs project-level vs org-level: hierarchy
  • IAM propagation: eventual consistency, caching
  • Condition expressions (CEL): time-based, resource-based
  • IAM Deny: deny before allow evaluation
  • Service accounts: SA key management, impersonation
  • Default service accounts: dangers
  • Audit logging: Admin Activity, Data Access
  • VPC Service Controls: perimeter concept
  • Organization Policies: constraints, inheritance
  • IAM Recommender: least-privilege suggestions

Chap 32: Secret Manager & Cloud KMS — Secrets & Encryption

Tại sao quan trọng: Secrets management adalah critical security control. Understand Storage mechanics dan KMS key hierarchy untuk design correct encryption strategies.

Điều kiện tiên quyết: Chap 31, encryption fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • Secret Manager versioning: states, aliases
  • Replication: automatic vs manual
  • Secret rotation: scheduling, Pub/Sub notifications
  • Accessing secrets dalam GKE: CSI driver, sidecar, init container
  • Secret Manager vs environment variables: trade-offs
  • Cloud KMS:
    • Key hierarchy: Key Ring → CryptoKey → CryptoKeyVersion
    • Key purposes: ENCRYPT_DECRYPT, ASYMMETRIC_SIGN, etc.
    • Protection levels: SOFTWARE, HSM, EXTERNAL
    • Key rotation: automatic, manual
    • Envelope encryption: DEK encrypted by KEK
    • CMEK: customer-managed encryption
    • Cloud EKM: keys managed outside Google
    • Key deletion: 24h soft-delete

Chap 33: VPC Service Controls & Organization Policies

Tại sao quan trọng: VPC SC adalah primary control prevent data exfiltration. Org Policies provide guardrails at scale.

Điều kiện tiên quyết: Chap 31, 32

Mức độ sâu: 4/5

Các chủ đề kon:

  • VPC SC architecture: perimeter, protected resources, access levels
  • Ingress/Egress rules: fine-grained cross-perimeter access
  • Dry-run mode: test before enforce
  • Org Policy constraints: compute.restrictCloudNATUsage, etc.
  • Custom constraints: CEL expressions
  • Policy inheritance: exceptions
  • Policy Troubleshooter: debug denials

Chap 34: Binary Authorization — Secure Container Deployment

Tại sao quan trọng: Binary Authorization ensures chỉ trusted images được deployed. Bypass mechanisms dan misconfiguration adalah real security risks.

Điều kiện tiên quyết: Chap 32, container basics, GKE admission

Mức độ sâu: 4/5

Các chủ đề kon:

  • Model: policy, attestors, attestations, deployment decision
  • Attestor types: Note resources
  • Attestation: cryptographic signatures, PGP/PKIX signing
  • Cloud Build integration: automated attestation
  • BinAuthz enforcement path: admission → policy evaluation
  • Image digest pinning: why digests matter
  • Continuous validation: re-evaluate, evict non-compliant
  • Dry-run vs enforcement mode: gradual rollout
  • Break-glass override: emergency bypass
  • Policy exceptions: allowlisted images
  • Artifact Analysis integration: CVE scanning

PHẦN VI: MESSAGING & DISTRIBUTED SYSTEMS


Chap 35: Cloud Pub/Sub — Architecture & Delivery Semantics

Tại sao quan trọng: Pub/Sub adalah messaging backbone. Understand delivery semantics, ordering, failure modes prevent duplicate processing dan message loss.

Điều kiện tiên quyết: Distributed systems fundamentals, messaging patterns

Mức độ sâu: 5/5

Các chủ đề kon:

  • Pub/Sub distributed log: sharding, replication
  • Message lifecycle: publish → store → deliver → ack
  • Delivery semantics:
    • At-least-once (default)
    • Exactly-once (opt-in, regional constraint)
  • Ack deadline: extension, max 600s
  • Push vs Pull:
    • Pull API: unary vs StreamingPull
    • Push subscriptions: HTTP endpoint, retry mechanics
  • Ordering keys:
    • Per-key ordering guarantee, regional scope
    • Single-region endpoint requirement
  • Dead Letter Topics:
    • Trigger conditions: max delivery attempts
    • DLT subscription: processing dead letters
  • Backpressure patterns:
    • Flow control: maxOutstandingMessages
    • Subscriber scaling dengan backlog
    • Metrics: undelivered messages, oldest ack age
  • Message schemas: Avro, Protocol Buffers
  • Consumer scaling patterns

Chap 36: Pub/Sub Regional Failure Behavior

Tại sao quan trọng: Pub/Sub memiliki global SLA tetapi regional failure dapat impact message delivery. Understand behavior untuk design resilient consumers.

Điều kiện tiên quyết: Chap 35, GCP regions/zones

Mức độ sâu: 5/5

Các chủ đề kon:

  • Pub/Sub storage model: multi-region messages
  • Regional endpoint publishing: kunci untuk ordering
  • Regional failure impact: ordering resumption, redelivery
  • Pub/Sub + Dataflow: exactly-once processing
  • Message deduplication: Pub/Sub role vs application-level
  • Subscriber failover: multiple instances, lease competition
  • Monitoring: error rates, latency spikes
  • Recovery patterns: reprocessing, timestamp seeking

Chap 37: Eventarc — Event Routing & CloudEvents

Tại sao quan trọng: Eventarc adalah managed event bus. Understand event routing dan CloudEvents standard untuk design event-driven architectures correctly.

Điều kiện tiên quyết: Chap 35, CloudEvents spec basics

Mức độ sâu: 3/5

Các chủ đề kon:

  • Event sources: Audit Logs, Pub/Sub, Cloud Storage
  • Triggers: event filtering, service account requirements
  • Destinations: Cloud Run, GKE, Workflows, Cloud Functions
  • CloudEvents format: context attributes, data
  • Delivery guarantees: at-least-once, retry
  • Dead letter handling
  • Eventarc Advanced: channels, buses, pipelines
  • IAM integration

Chap 38: Cloud Tasks — Asynchronous Task Execution

Tại sao quan trọng: Cloud Tasks adalah managed task queue untuk async work. Understand retry dan rate limiting prevent thundering herd dan duplicates.

Điều kiện tiên quyết: Distributed systems, HTTP fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

  • Cloud Tasks vs Pub/Sub: when to use each
  • Task queue model: explicit execution
  • Rate limiting: dispatch rate, max burst
  • Retry: exponential backoff, configurable
  • Task deduplication: ID-based, 1 hour window
  • HTTP targets: authentication
  • Dead letter tasks: logging, alerting
  • Pause/Resume: operational patterns
  • Integration with GKE: HTTP target to service

PHẦN VII: OBSERVABILITY & RELIABILITY ENGINEERING


Chap 39: Cloud Monitoring — Metrics, Alerting, SLOs

Tại sao quan trọng: Cloud Monitoring adalah single pane of glass untuk GCP. Understand metrics model, alerting mechanics, SLO framework untuk build reliable services dan respond quickly.

Điều kiện tiên quyết: Chap 14, SRE fundamentals

Mức độ sâu: 5/5

Các chủ đề kon:

  • Metrics model: GAUGE, DELTA, CUMULATIVE
  • Metric kinds dan value types
  • Monitored resource types
  • Free vs chargeable metrics
  • Managed Service for Prometheus: PromQL, Rule Evaluator
  • Alerting architecture:
    • Policies: conditions, notification channels
    • Condition types: metric threshold, log-based, uptime checks
    • Auto-close, repeat intervals
    • Notification channels: reliability considerations
  • SLO framework:
    • SLI types: availability, latency, quality
    • Compliance periods: rolling vs calendar
    • Error budget: consumption tracking
    • Burn rate: select_slo_burn_rate MQL
    • Multi-window alerting: fast + slow burn
  • Dashboards as code dengan Terraform
  • USE/RED/Golden Signals methods
  • Alert best practices: false positive reduction

Chap 40: Cloud Logging — Architecture, Routing, Cost Management

Tại sao quan trọng: Cloud Logging ingestion dapat sangat mahal jika tidak dikelola. Understand routing architecture untuk send correct logs to correct destination dengan correct cost.

Điều kiện tiên quyết: Chap 39

Mức độ sâu: 4/5

Các chủ đề kon:

  • Log types: Platform, User-written, Security logs
  • Audit logs: Admin Activity (free), Data Access (paid), System Event
  • Log router: _Default, _Required buckets, exclusion filters, sinks
  • Sinks: destinations, inclusion/exclusion filters
  • Log-based metrics: counters, distributions
  • Retention: default 30 days, custom up to 3650 days
  • Field exclusions: reduce ingestion cost
  • Advanced queries: MQL-like syntax
  • Log alerting: log-based metric + policy
  • Debug patterns: correlate logs

Chap 41: Cloud Trace, Profiler, Error Reporting

Tại sao quan trọng: Distributed tracing dan profiling adalah tools untuk debug latency dalam microservices. Error Reporting prioritizes bugs by frequency.

Điều kiện tiên quyết: Chap 39, 40, OpenTelemetry basics

Mức độ sâu: 3/5

Các chủ đề kon:

  • Cloud Trace: trace collection, sampling, storage
  • Trace propagation: HTTP headers, W3C standard
  • OpenTelemetry integration
  • Trace → Logs → Metrics correlation
  • Cloud Profiler: CPU, Heap, Goroutine profiles
  • Continuous profiling overhead
  • Error Reporting: grouping, affected users
  • Error notifications

Chap 42: SRE Practices trên GCP — SLO, Incident Response, Chaos

Tại sao quan trọng: SRE principles aplikasi di GCP memerlukan understanding both people dan systems. Error budgets, incident response, toil reduction adalah practical skills.

Điều kiện tiên quyết: Chap 39–41, SRE Book concepts

Mức độ sâu: 5/5

Các chủ đề kon:

  • SLI/SLO/Error Budget: crafting meaningful SLIs
  • Error budget policy: freeze pada exhaustion
  • Incident response: levels, roles (IC, SME, Comms)
  • Runbooks: machine-readable, regularly tested
  • Blameless postmortems: 5 Whys, contributing factors, action items
  • Blast radius reduction: canaries, circuit breakers, feature flags
  • Graceful degradation: fallback responses, cached data
  • Failure injection (Chaos Engineering):
    • Fault injection via Istio
    • Node/Pod disruption testing
    • Chaos Mesh on GKE
  • Timeout hierarchies: prevent cascading
  • Retry budgets: prevent storms
  • Load shedding: server-side rejection

Chap 43: GKE Production Debugging Methodology

Tại sao quan trọng: Debugging production issues requires systematic approach. "Random kubectl exec" adalah antipattern. Build mental model untuk structured debugging.

Điều kiện tiên quyết: Semua GKE chapters

Mức độ sâu: 5/5

Các chủ đề kon:

  • Pod lifecycle debugging:
    • Pending: node affinity, resources, PVC, scheduler events
    • CrashLoopBackOff: exit codes, logs
    • OOMKilled: container vs cgroup, memory leak detection
    • Init container failures
  • Service connectivity debugging:
    • kubectl exec + curl pattern
    • DNS resolution: nslookup, dig
    • NetworkPolicy violations: Hubble
    • ClusterIP routing: iptables/eBPF verification
    • port-forward untuk bypass LB
  • Node debugging:
    • NotReady: logs, kubelet, containerd status
    • Disk pressure: df, du, describe
    • CPU throttling: metrics, cgroup limits
    • Network issues: routes, conntrack
  • Control plane debugging:
    • API latency: metrics
    • etcd performance: slow ops
    • Webhook timeouts
    • Scheduler failures: logs, events
  • Cross-cutting debugging:
    • Request tracing: LB → node → pod dengan Cloud Trace
    • Correlate: access logs + app logs + traces
    • gcloud container operations list

PHẦN VIII: PLATFORM AUTOMATION & CI/CD


Chap 44: Terraform on GCP — State Management, Modules, IaC Patterns

Tại sao quan trọng: Infrastructure as Code adalah non-negotiable untuk production. Understand GCP-specific Terraform patterns dan state management prevent drift dan destructive applies.

Điều kiện tiên quyết: Terraform fundamentals, Chap 1–7

Mức độ sâu: 4/5

Các chủ đề kon:

  • Google provider: authentication (ADC, impersonation)
  • GCS backend: remote state, locking
  • State file security: encryption, access, versioning
  • Module design: reusable modules for GKE, VPC, IAM
  • Resource dependencies: explicit vs implicit
  • lifecycle blocks: prevent_destroy, ignore_changes
  • terraform import: existing resource management
  • State manipulation: move, rm, show
  • Workspaces: isolated state per environment
  • Drift detection: scheduled terraform plan
  • Testing: terraform validate, terratest, conftest
  • CI/CD integration: Cloud Build pipeline
  • Cost estimation: Infracost
  • Google Cloud Foundation Fabric: reference modules

Chap 45: Cloud Build & Artifact Registry — CI/CD Pipeline

Tại sao quan trọng: Secure CI/CD pipeline adalah critical security control. Understand Cloud Build execution model dan Artifact Registry security prevent supply chain attacks.

Điều kiện tiên quyết: Chap 32, Docker/containers, CI/CD fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • Cloud Build architecture: build steps, cloudbuild.yaml, workers
  • Default vs custom service account: least privilege
  • Private worker pools: VPC, peering, no public IP
  • Triggers: Cloud Source Repos, GitHub, GitLab, schedule, webhook
  • Build caching: layer caching, custom (GCS), speed optimization
  • Substitution variables: built-in, custom interpolation
  • Artifact Registry:
    • Docker, Maven, npm, Python, generic repositories
    • Regional, cost-efficient artifact storage
    • Container vulnerability scanning: on-push, continuous
    • SBOM generation
    • Cleanup policies: deletion, protection
  • Remote build provenance: SLSA attestation dari Cloud Build
  • Binary Authorization integration
  • Cloud Deploy: managed delivery, canary, blue-green
  • allowedIntegrations org policy

Chap 46: Cloud Deploy & GitOps — Progressive Delivery

Tại sao quan trọng: Cloud Deploy provides managed CD dengan built-in approval, rollback, tracking. Understanding mechanics untuk design safe deployment pipelines critical.

Điều kiện tiên quyết: Chap 45, Kubernetes Deployments

Mức độ sâu: 4/5

Các chủ đề kon:

  • Cloud Deploy model: pipelines, targets, releases, rollouts
  • Pipeline definition: series targets, promotion flow
  • Target types: GKE, Cloud Run, custom
  • Approval flows: manual gates
  • Rollback mechanics: one-click, automatic
  • Canary deployments: traffic splitting
  • Blue-green deployments: parallel, cutover
  • Deployment verification: post-deploy checks
  • Hooks: pre-deploy, post-deploy, verify
  • Cloud Deploy IAM: deployer, approver roles
  • Notifications: Pub/Sub, Slack
  • Deploy history: audit trail
  • GitOps patterns:
    • Config Sync: GitOps untuk GKE
    • Syncing dari Git/OCI dengan reconciliation
    • Multi-cluster Config Sync
    • Policy Controller dengan GitOps
    • Fleet management integration

PHẦN IX: ADVANCED PRODUCTION PATTERNS


Chap 47: GKE Service Mesh — Cloud Service Mesh (Managed Istio)

Tại sao quan trọng: Service mesh provides mTLS, observability, traffic management at infrastructure level. Hiểu Istio/Envoy mechanics untuk debug connection failures dan tune performance.

Điều kiện tiên quyết: Chap 7, 21, microservices patterns

Mức độ sâu: 4/5

Các chủ đề kon:

  • Cloud Service Mesh (CSM): managed Istio/Envoy
  • Data plane vs control plane: Envoy sidecars vs Istiod
  • Sidecar injection: automatic, init container iptables rules
  • mTLS: PERMISSIVE vs STRICT, certificate lifecycle, SPIFFE/SVID
  • Traffic management:
    • VirtualService, DestinationRule, Gateway (Istio)
    • Load balancing algorithms, circuit breaking
    • Retries, timeouts
  • Envoy xDS API: CDS, EDS, LDS, RDS, SDS
  • Distributed tracing: trace propagation, sampling
  • Observability: metrics, access logs, Hubble
  • CSM dashboard: SLOs, topology
  • Sidecar resource consumption

Chap 48: Multi-Cluster Architecture & Networking

Tại sao quan trọng: Multi-cluster adalah standard pattern untuk production GKE deployments. Networking across clusters adds complexity requiring specific patterns.

Điều kiện tiên quyết: Chap 21, 47, Chap 18 Fleet

Mức độ sâu: 4/5

Các chủ đề kon:

  • Multi-cluster use cases: HA, DR, data residency, scale
  • Multi-Cluster Services (MCS): ServiceImport/ServiceExport, DNS
  • Multi-Cluster Ingress: global LB, cross-cluster backends
  • Gateway API multi-cluster
  • Workload identity across clusters
  • Cross-cluster service mesh: trust federation
  • Network isolation between clusters: VPC peering
  • DNS peering across clusters

Chap 49: GKE AI/ML Infrastructure — GPU, TPU, Large-Scale Workloads

Tại sao quan trọng: AI/ML adalah dominant workload pattern. GPU/TPU infrastructure memiliki unique characteristics untuk maximize utilization dan minimize cost.

Điều kiện tiên quyết: Chap 6, 9, GPU/accelerator fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • GPU node pools: machine types, driver installation
  • NVIDIA device plugin: resource scheduling
  • Multi-Instance GPU (MIG): sharing
  • GPU Time-Slicing: oversubscription
  • TPU types: topologies, slice configuration
  • Dynamic Workload Scheduler: gang scheduling
  • ProvisioningRequest: batch job reservations
  • NCCL Fast Socket: inter-GPU communication
  • Multi-NIC Pods: GPUDirect, RDMA
  • InfiniBand networking: A3 clusters
  • HPC clusters: H4D, compact placement
  • Data loading: Hyperdisk ML, Parallelstore
  • LLM serving patterns: vLLM, TGI, Triton
  • DRA (Dynamic Resource Allocation): next-gen scheduling
  • Cost optimization: Spot VMs, preemption handling

Chap 50: GKE Large-Scale Design — 1000+ Nodes

Tại sao quan trọng: GKE clusters > 1000 nodes memiliki operational characteristics berbeda. Architecture decisions at creation time mempengaruhi scalability ceiling.

Điều kiện tiên quyết: Chap 5, 6, 8, 9

Mức độ sâu: 5/5

Các chủ đesse kon:

  • GKE scalability limits: max nodes, Pods, Services
  • Planning: node pool sizing, Pod density
  • API server scalability: request rate, watch connections
  • etcd scalability: object count, size, compaction
  • Controller manager reconciliation: queue, workers
  • Scheduler performance: latency at scale
  • IP planning: CIDR sizing, expansion
  • Service mesh scalability: xDS updates, memory overhead
  • NodeLocal DNSCache necessity
  • Node pool strategy: multiple small vs fewer large
  • Workload distribution: topology spread, bin packing
  • Large-scale upgrade: surge sizing, concurrency
  • Network policy scalability: eBPF necessity
  • Metrics cardinality: limits, aggregation
  • Log volume: sampling, exclusion strategies

Chap 51: Cost Optimization Engineering — Systematic Approach

Tại sao quan trọng: Cloud costs adalah significant operational concern. Systematic cost optimization requires understanding billing mechanics dan optimization levers.

Điều kiện tiên quyết: Semua service chapters

Mức độ sâu: 3/5

Các chủ đề kon:

  • Committed Use Discounts (CUDs): 1-year/3-year, resource-based vs flexible
  • Sustained Use Discounts: automatic untuk GCE
  • Spot VMs: interrupt frequency, cost savings 60–91%
  • GKE cost allocation: namespace-level breakdown
  • Rightsizing: VPA recommendations, insights
  • Idle resource detection: Recommender API
  • Egress costs: inter-region, internet egress optimization
  • Storage tier automation: lifecycle policies
  • Cloud Billing exports: BigQuery analysis
  • Budget alerts: programmatic controls
  • Cost monitoring dashboard patterns

Chap 52: GKE Disaster Recovery & High Availability

Tại sao quan trọng: DR planning untuk production systems adalah critical. GCP provides banyak options dengan different cost/complexity tradeoffs.

Điều kiện tiên quyết: Chap 9, storage fundamentals

Mức độ sâu: 4/5

Các chủ đề kon:

  • RTO vs RPO: definitions, trade-offs
  • Multi-region architecture: active-active, active-passive
  • Backup for GKE: backup plans, restore procedures
  • PD snapshots: cross-region copies
  • GCS geo-redundancy: dual-region, multi-region buckets
  • Database DR: Cloud SQL replicas, Spanner global
  • Config backup: GitOps, Terraform state
  • Multi-region failover testing: chaos at region level
  • DNS failover: health checks, weighted routing
  • DR runbooks: step-by-step procedures
  • RTO validation, data integrity checks

PHẦN X: ADVANCED DEBUGGING & INCIDENT MANAGEMENT


Chap 53: Production GKE Debugging Framework

Tại sao quan trọng: Structured debugging methodology essential untuk resolve production incidents quickly. Understanding telemetry sources dan correlation methods adalah core skill.

Điều kiện tiên quyết: Chap 39–43

Mức độ sâu: 5/5

Các chủ đề kon:

  • Systematic debugging approach: hypothesis → test → validate
  • Information gathering: logs, metrics, events, traces
  • Pod debugging: pending, crashed, hung states
  • Service connectivity: DNS, routing, network policies
  • Node debugging: capacity, health, pressure signals
  • Control plane: API latency, etcd performance
  • Cross-layer correlation: trace → logs → metrics
  • GKE dashboard analysis: cluster health signals
  • Incident timeline reconstruction: event correlation
  • Root cause analysis techniques: Five Whys, fishbone
  • Action items: immediate vs long-term fixes

Chap 54: Incident Response & Post-Mortems

Tại sao quan trọng: Incident response adalah skill that separates good SREs dari great ones. Structured approach mengurangi MTTR dan blast radius.

Điều kiện tiên quyết: Chap 39–43, Chap 53

Mức độ sâu: 4/5

Các chủ đề kon:

  • Incident classification: severity levels
  • Incident command structure: IC, SME, Comms Lead
  • Detection → Triage → Mitigation → Resolution flow
  • GCP debugging tools dalam incident: Logging, Monitoring, Trace
  • Mitigation patterns: rollback, feature flags, circuit breakers
  • Communication: status page, stakeholder updates
  • Blameless postmortem culture
  • Timeline reconstruction, root cause analysis
  • Action items: tracking, follow-up
  • Knowledge sharing: incident readout, runbook updates

PHẦN XI: SPECIAL TOPICS & ADVANCED CONCEPTS


Chap 55: Kubernetes API Machinery Deep Dive

Tại sao quan trọng: Understanding API server internals, informer pattern, watch mechanism essential untuk debug complex control plane behaviors.

Điều kiện tiên quyết: Chap 5

Mức độ sâu: 5/5

Các chủ đề kon:

  • API server request pipeline: auth → authz → admission → storage
  • Watch mechanism: efficient state propagation without polling
  • Informer pattern: list-watch, local cache, resync intervals
  • Controller runtime framework: reconciliation loop patterns
  • API priority dan fairness (APF): flow schemas, priority levels
  • Etcd consistency: linearizability guarantees, watch caching
  • Resource versioning: optimistic locking, conflict resolution
  • Custom Resource Definitions (CRDs): extensibility mechanism

Chap 56: Kubernetes Advanced RBAC & Authorization Patterns

Tại sao quan trọng: RBAC design untuk production scale requires careful planning. Understand aggregation, impersonation, conditions prevent privilege creep.

Điều kiện tiên quyết: Chap 12, Chap 31

Mức độ sâu: 4/5

Các chủ đề kon:

  • ClusterRole aggregation: composing roles from multiple roles
  • ClusterRoleBinding vs RoleBinding: scoping semantics
  • Service account impersonation: delegation chains, risks
  • Group binding strategies: LDAP, Google Groups integration
  • Least privilege RBAC: role reduction, time-bound roles
  • Conditions dalam RBAC: attribute-based access control
  • RBAC for multi-tenancy: namespace isolation
  • Audit: logging RBAC decisions untuk compliance

Chap 57: GKE with Windows Server Containers

Tại sao quan trọng: Windows containers pada GKE adalah niche tetapi important para enterprise .NET workloads.

Điều kiện tiên quyết: Chap 5, 6, Windows fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

  • GKE Windows node pool creation
  • Windows CNI considerations: different networking model
  • Image pulling: Windows image registry optimization
  • Resource requests: CPU/memory pada Windows
  • Pod disruption: graceful termination handling
  • Monitoring: Windows-specific metrics

Chap 58: Confidential Compute on GKE — AMD SEV & Intel TDX

Tại sao quan trọng: Confidential computing adalah emerging pattern para sensitive workloads. Understanding attestation dan performance overhead critical.

Điều kiện tiên quyết: Chap 6, Chap 32

Mức độ sâu: 3/5

Các chủ đề kon:

  • AMD SEV: memory encryption, attestation
  • Intel TDX: trusted domain extensions
  • Performance overhead: latency, throughput
  • Use cases: regulated industries, financial
  • Attestation verification: remote attestation
  • Key management dalam confidential VMs

Chap 59: Managed Prometheus at Scale — Optimization & Troubleshooting

Tại sao quan trọng: Managed Service for Prometheus is scalable Prometheus solution tetapi high cardinality dapat spike costs dan latency.

Điều kiện tiên quyết: Chap 39, Prometheus concepts

Mức độ sâu: 4/5

Các chủ đề kon:

  • GMP architecture: globally managed backend
  • PodMonitoring CRDs: configuration patterns
  • Recording rules: pre-compute expensive queries
  • Alert evaluation: Ruler component
  • High cardinality antipatterns: label explosion
  • Metric ingestion costs: active time series billing
  • PromQL performance: query optimization
  • Thanos integration: federation, retention
  • Troubleshooting: query timeout, high cardinality detection

Chap 60: Advanced Cloud Armor WAF Configuration

Tại sao quan trọng: Cloud Armor adalah GCP's Web Application Firewall. Tuning rules properly prevents both false positives/negatives dan DDoS attacks.

Điều kiện tiên quyết: Chap 27, security fundamentals

Mức độ sâu: 3/5

Các chủ đề kon:

  • WAF rule types: OWASP ruleset, custom rules
  • Adaptive Protection: ML-based DDoS detection
  • Rate limiting: threshold configuration
  • Rule evaluation order: deny/allow decision
  • Signed Cookies: custom domain patterns
  • URL field masking: protecting sensitive data dalam logs
  • Google-managed rules: automatic updates

PHẦN XII: SPECIAL PRODUCTION RUNBOOKS & TROUBLESHOOTING


Chap 61: GKE Troubleshooting Runbook — Common Issues & Solutions

Tại sao quan trọng: Common GKE issues require specific debugging steps. Pre-written runbooks mengurangi MTTR.

Điều kiện tiên quyết: Chap 39–43, 53–54

Mức độ sâu: 4/5

Các chủ đề kon:

  • Pod creation failures: troubleshooting checklist
  • Scheduling failures: pending pods resolution
  • Networking issues: connectivity test procedures
  • Storage issues: volume attachment failures
  • Control plane issues: API server latency, etcd health
  • Node issues: NotReady diagnosis
  • Workload Identity failures: token exchange debugging
  • Autoscaling issues: HPA/CA troubleshooting

Chap 62: GKE Cluster Upgrade Runbook — Zero-Downtime Procedures

Tại sao quan trọng: Cluster upgrades dapat disruptive jika tidak dilakukan carefully. Proven runbook essential untuk production.

Điều kiện tiên quyết: Chap 15, Chap 54

Mức độ sâu: 4/5

Các chủ đesse kon:

  • Pre-upgrade validation: compatibility checks
  • Node surge strategy: sizing untuk stable rollout
  • PDB configuration: ensuring disruption budget
  • Control plane upgrade window: monitoring, rollback triggers
  • Node pool upgrade execution: monitoring, health checks
  • Post-upgrade validation: functionality verification
  • Rollback procedures: emergency rollback steps

Chap 63: GCP Network Troubleshooting Methodology

Tại sao quan trọng: Network issues sulit untuk debug. Systematic approach dan tool knowledge essential untuk resolve quickly.

Điều kiện tiên quyết: Chap 3, 19–27

Mức độ sâu: 4/5

Các chủ đề kon:

  • Connectivity Tests: GCP native tool
  • VPC Flow Logs analysis: packet-level debugging
  • firewall rule debugging: evaluation order, matching
  • Route troubleshooting: destination matching, recursive lookup
  • DNS debugging: resolution path, TTL issues
  • NAT issues: port exhaustion diagnosis
  • Load Balancer debugging: backend health, traffic distribution
  • Service Mesh networking: traffic flow through Envoy

B. SUMMARY & COVERAGE VALIDATION

Coverage Statistics:

  • Total chapters: 63 main chapters
  • Total sub-topics: 800+ detailed sub-topics
  • Total parts: 12 major sections
  • Estimated pages: 1,500–2,000 pages (if printed)
  • Estimated study time: 8–12 months for deep mastery

Covered Domains (100% Coverage):

✅ GKE Architecture & Internals (Chapters 5–18) ✅ GCP Networking Foundation & Services (Chapters 2–4, 19–27) ✅ Storage & Persistence (Chapters 28–30) ✅ IAM, Security, Compliance (Chapters 31–34) ✅ Messaging & Distributed Systems (Chapters 35–38) ✅ Observability & SRE (Chapters 39–43) ✅ CI/CD & Automation (Chapters 44–46) ✅ Service Mesh & Multi-Cluster (Chapters 47–48) ✅ Advanced Workloads (Chapters 49–50) ✅ Cost Optimization & DR (Chapters 51–52) ✅ Debugging & Incident Response (Chapters 53–54) ✅ Advanced Deep-Dives (Chapters 55–60) ✅ Production Runbooks (Chapters 61–63)

Advanced "Kill Content" - Staff/Principal Level Topics:

  1. "GKE Packet Path Anatomy: Từ Container veth pair đến Internet" — Complete packet trace mọi layer
  2. "Cluster Autoscaler Decision Engine: Tại sao scale-up chậm 45 giây" — Latency breakdown, node provisioning mechanics
  3. "etcd vs Spanner: GKE Control Plane State Storage" — Backend comparison, consistency implications
  4. "Cloud NAT Port Exhaustion: Silent Killer Production" — 5-tuple exhaustion, monitoring strategy
  5. "Workload Identity Token Exchange: Mỗi bước chi tiết" — JWT validation, STS exchange, security boundaries
  6. "Private Cluster Leak Assumptions: 5 Cách Traffic Exposes" — metadata server, DNS leaks, edge cases
  7. "Pub/Sub Regional Failure: Ordering & Failover Mechanics" — Region-scoped constraints, failover detection
  8. "Binary Authorization dalam Production: Gaps & Vectors" — Break-glass abuse, attestation replay, workarounds
  9. "SLO Error Budget Burn Rate: Multi-Window Alerting Math" — 2% budget trong 1h = 100x burn rate
  10. "Production GKE Upgrade Runbook: Zero-Downtime Playbook" — Proven procedures, validation gates, rollback

C. RECOMMENDED READING SEQUENCE

Phase 1: Foundation (Weeks 1–4)

  • Chap 1: Resource Hierarchy
  • Chap 2: Jupiter Fabric & Andromeda
  • Chap 3: VPC Model
  • Chap 4: Cloud DNS
  • Chap 31: IAM Deep Dive

Phase 2: GKE Essentials (Weeks 5–12)

  • Chap 5: Control Plane Internals
  • Chap 6: Node Lifecycle
  • Chap 7: Networking Internals
  • Chap 8: Scheduler
  • Chap 9: Autoscaling
  • Chap 10: Admission Control

Phase 3: Production Operations (Weeks 13–24)

  • Chap 39: Cloud Monitoring
  • Chap 40: Cloud Logging
  • Chap 42: SRE Practices
  • Chap 43: Debugging Methodology
  • Chap 54: Incident Response

Phase 4: Advanced Topics (Weeks 25–52)

  • Chap 32–34: Security & Secrets
  • Chap 35–38: Messaging Systems
  • Chap 44–46: Automation & CI/CD
  • Chap 47–50: Advanced Workloads
  • Chap 55–63: Deep-Dives & Runbooks