Autopilot Observability

Điểm khác biệt quan trọng về observability trong Autopilot

Autopilot giới hạn những gì bạn có thể observe ở node level, vì nodes là managed infrastructure. Đây là trade-off quan trọng cần hiểu:

Không có access vào node metrics trực tiếp:

Không SSH vào node để chạy top, iostat, sar
Không cài đặt node-level exporters tùy chỉnh
Không xem /proc hoặc /sys filesystem của node

Vẫn có đầy đủ workload observability:

Container và Pod metrics (CPU, memory, network, disk)
Application logs từ stdout/stderr
Custom metrics từ application
Distributed tracing
Kubernetes events và audit logs

Triết lý: bạn không cần quan tâm đến node health (Google lo), nhưng bạn cần quan tâm đến workload health (bạn lo).

Metrics trong Autopilot

System metrics (automatically available)

GKE tự động thu thập và gửi metrics sau vào Cloud Monitoring:

Pod/Container metrics (từ cAdvisor):

Metric	Mô tả
`kubernetes.io/container/cpu/core_usage_time`	CPU usage của container
`kubernetes.io/container/memory/used_bytes`	Memory usage của container
`kubernetes.io/container/memory/limit_bytes`	Memory limit
`kubernetes.io/container/restart_count`	Số lần container restart
`kubernetes.io/container/uptime`	Thời gian container đang chạy
`kubernetes.io/pod/network/received_bytes_count`	Network bytes received
`kubernetes.io/pod/network/sent_bytes_count`	Network bytes sent
`kubernetes.io/pod/volume/total_bytes`	Volume capacity
`kubernetes.io/pod/volume/used_bytes`	Volume used

Cluster-level metrics:

Metric	Mô tả
`kubernetes.io/node/cpu/allocatable_cores`	Allocatable CPUs trên node
`kubernetes.io/node/memory/allocatable_bytes`	Allocatable memory
`kubernetes.io/node_daemon/cpu/core_usage_time`	CPU dùng bởi system daemons

Metrics không available trong Autopilot

Các metrics sau available trong Standard nhưng không available trong Autopilot:

Node-level CPU utilization (không access được node stats trực tiếp)
Disk I/O throughput per node
Network interface stats per node
Process list per node (/proc)

Workaround: Dùng container-level metrics (container_cpu_usage_seconds_total, container_memory_usage_bytes) thay vì node-level metrics cho autoscaling decisions.

Managed Service for Prometheus (MSP)

Autopilot hỗ trợ đầy đủ Managed Service for Prometheus — đây là preferred way để collect custom application metrics:

yaml

# PodMonitoring để scrape metrics từ Pod
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: my-app-monitoring
  namespace: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

MSP collector agent chạy như một managed DaemonSet (Google-managed, không chiếm billable resources), thu thập metrics và gửi vào Cloud Monitoring.

bash

# Kiểm tra PodMonitoring hoạt động
kubectl get podmonitoring -n my-app

# Query metrics từ Cloud Monitoring
gcloud monitoring metrics list --filter="metric.type=prometheus.googleapis.com"

Metrics cho HPA với Autopilot

Autopilot hỗ trợ HPA dùng custom metrics qua Managed Prometheus:

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: custom.googleapis.com|queue_depth  # Custom metric từ MSP
      target:
        type: AverageValue
        averageValue: 100

Quan trọng: Custom Metrics Stackdriver Adapter và Prometheus Adapter không được chạy đồng thời. Trong Autopilot, dùng một trong hai, không phải cả hai.

Logs trong Autopilot

Log collection mặc định

Autopilot tự động thu thập logs từ:

Container stdout/stderr → kubernetes.io/k8s-pod log type
Kubernetes system components → GKE system logs
Audit logs → Cloud Audit Logs

Không cần cài đặt logging agent (Fluentd/Fluent Bit) vì Google đã có managed logging agent trên mỗi node.

Structured logging

Autopilot hưởng lợi nhiều từ structured logging (JSON). Structured logs được parse tự động và cho phép filter theo field:

python

# Python structured logging
import json
import logging

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            "severity": record.levelname,
            "message": record.getMessage(),
            "timestamp": self.formatTime(record),
            "component": record.name,
            # Thêm trace context nếu có
            "trace": getattr(record, 'trace', None),
        }
        return json.dumps(log_entry)

logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)

// Go structured logging với zerolog
import "github.com/rs/zerolog/log"

log.Info().
    Str("pod_name", podName).
    Str("namespace", namespace).
    Int("request_id", requestID).
    Msg("Processing request")

Khi log theo JSON, Cloud Logging tự động:

Lift severity field làm log severity
Lift message field làm log message
Parse timestamp cho time-based queries
Index tất cả fields cho BigQuery export

Querying logs trong Autopilot

bash

# Xem logs của Pod cụ thể
kubectl logs pod/POD_NAME -c CONTAINER_NAME

# Xem logs trong Cloud Logging
gcloud logging read \
  'resource.type="k8s_container" AND resource.labels.pod_name="my-pod"' \
  --limit 100

# Filter logs theo severity
gcloud logging read \
  'resource.type="k8s_container" AND severity>=ERROR AND resource.labels.namespace_name="production"' \
  --limit 50

# Xem logs của Deployment
gcloud logging read \
  'resource.type="k8s_container" AND labels."k8s-pod/app"="my-app"' \
  --format=json

Log-based metrics cho alerting

bash

# Tạo metric đếm errors
gcloud logging metrics create error-count \
  --description="Count of ERROR logs in production" \
  --log-filter='resource.type="k8s_container" AND severity="ERROR" AND resource.labels.namespace_name="production"'

# Alerting policy dựa trên metric
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High error rate" \
  --condition-display-name="Error count > threshold" \
  --condition-filter='metric.type="logging.googleapis.com/user/error-count"' \
  --condition-threshold-value=10 \
  --condition-threshold-duration=60s

Kubernetes Events

Events là observability signal quan trọng trong Autopilot vì cung cấp thông tin về scheduling, resource adjustments, và errors:

bash

# Xem tất cả events trong namespace
kubectl get events -n my-namespace --sort-by='.lastTimestamp'

# Filter events theo type
kubectl get events -n my-namespace --field-selector type=Warning

# Watch events real-time
kubectl get events -n my-namespace -w

Events đặc biệt quan trọng trong Autopilot:

# Resource adjustment event (Autopilot đã thay đổi resources)
Type: Normal
Reason: ScaleUpResources
Message: "Scaled up resources for container my-app from cpu=100m to cpu=250m"

# Spot Pod eviction event
Type: Warning
Reason: SpotEviction  
Message: "Spot Pod evicted due to Spot VM preemption"

# Extended Duration Pod deferral
Type: Normal
Reason: EvictionDeferred
Message: "Eviction deferred for extended duration Pod"

Distributed Tracing với Autopilot

Autopilot hỗ trợ Cloud Trace đầy đủ:

yaml

# Managed OpenTelemetry cho Autopilot
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-app-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
  - tracecontext
  - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"  # Sample 10% traces
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Với Managed OpenTelemetry, bạn không cần tự manage collector infrastructure — Google cung cấp managed OTLP endpoint.

GKE Dashboard và Cost Allocation

Cloud Console cung cấp GKE Dashboard với view về:

Cluster workload health
Resource utilization
Pod status distribution

Cost allocation đặc biệt hữu ích trong Autopilot: Vì billing là per-Pod, bạn có thể thấy chi phí chính xác theo namespace, label, hoặc workload:

bash

# Export cost data ra BigQuery để phân tích
gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Autopilot Budget" \
  --budget-amount=1000USD \
  --threshold-rules-basis=CURRENT_SPEND \
  --threshold-rules-percent=80

GKE Cost Allocation (feature riêng) phân tích chi phí theo:

Namespace → team/project cost attribution
Label → per-service cost
Workload type → Job vs Deployment cost breakdown

Debugging workloads trong Autopilot

Vì không có SSH access vào nodes, debugging phải thực hiện qua Kubernetes abstractions:

Ephemeral debug containers

bash

# Attach debug container vào Pod đang chạy
kubectl debug -it pod/MY_POD --image=busybox --target=MY_CONTAINER

# Debug Pod crash bằng cách dùng image khác
kubectl debug pod/CRASHED_POD -it --copy-to=debug-pod --container=my-container -- /bin/sh

# Debug node (limited trong Autopilot)
kubectl debug node/NODE_NAME -it --image=ubuntu

Troubleshooting common Autopilot issues

Pod Pending vì resource adjustment:

bash

kubectl describe pod POD_NAME
# Tìm Events với reason "ScaleUpResources" hoặc "ComputeClassRejected"

Pod bị reject vì vượt maximum:

bash

# Error message sẽ xuất hiện khi apply
kubectl apply -f pod.yaml
# ERROR: ...resource requests exceeded the maximum allowed...

Network connectivity issue:

bash

# Test connectivity từ Pod
kubectl exec -it pod/MY_POD -- wget -O- http://target-service:port

# Kiểm tra NetworkPolicy
kubectl describe networkpolicy MY_POLICY -n MY_NAMESPACE

# Xem Hubble flows
kubectl exec -it -n kube-system ds/hubble-relay -- hubble observe --namespace MY_NS

Autopilot Observability ​

Điểm khác biệt quan trọng về observability trong Autopilot ​

Metrics trong Autopilot ​

System metrics (automatically available) ​

Metrics không available trong Autopilot ​

Managed Service for Prometheus (MSP) ​

Metrics cho HPA với Autopilot ​

Logs trong Autopilot ​

Log collection mặc định ​

Structured logging ​

Querying logs trong Autopilot ​

Log-based metrics cho alerting ​

Kubernetes Events ​

Distributed Tracing với Autopilot ​

GKE Dashboard và Cost Allocation ​

Debugging workloads trong Autopilot ​

Ephemeral debug containers ​

Troubleshooting common Autopilot issues ​

References ​

Autopilot Observability

Điểm khác biệt quan trọng về observability trong Autopilot

Metrics trong Autopilot

System metrics (automatically available)

Metrics không available trong Autopilot

Managed Service for Prometheus (MSP)

Metrics cho HPA với Autopilot

Logs trong Autopilot

Log collection mặc định

Structured logging

Querying logs trong Autopilot

Log-based metrics cho alerting

Kubernetes Events

Distributed Tracing với Autopilot

GKE Dashboard và Cost Allocation

Debugging workloads trong Autopilot

Ephemeral debug containers

Troubleshooting common Autopilot issues

References