Managed Service for Prometheus — PodMonitoring, Rules, PromQL

Why this matters in production

Prometheus là chuẩn de facto cho application metric trong thế giới Kubernetes — nhưng tự vận hành Prometheus ở quy mô lớn là một trong những bài toán observability khó nhất. Một Prometheus server tập trung scrape toàn cluster trở thành single point of failure, chạm trần memory khi số time series tăng, và đòi hỏi sharding/federation phức tạp (Thanos, Cortex, Mimir) để scale và lưu trữ dài hạn. Google Cloud Managed Service for Prometheus (GMP) giải quyết bằng cách giữ giao diện Prometheus (PromQL, exposition format, CRD kiểu prometheus-operator) nhưng thay backend bằng hạ tầng monitoring toàn cầu của Google (Monarch) — tách rời việc thu thập (trong cluster) khỏi việc lưu trữ và query (managed, không giới hạn cardinality theo cách Prometheus đơn lẻ bị).

Đây là file trung tâm của chương về metric pipeline, vì GMP là nơi gần như mọi metric không phải system mặc định chảy qua: control plane metric (file 02), kube-state-metrics/cAdvisor/DCGM (file 03), và application metric (file 04) đều đi qua GMP. Hiểu kiến trúc và CRD của nó là điều kiện để thu metric đúng cách và — quan trọng không kém — để không bị cardinality explosion đốt cháy ngân sách.

Theo tài liệu Managed Service for Prometheus, GMP được bật mặc định trên Autopilot (1.25+) và Standard (1.27+). Một nguyên tắc kiến trúc cốt lõi cần ghi nhớ ngay: "Google Cloud never directly accesses your cluster to pull or scrape metric data; your collectors push data to Google Cloud." Đây là push model, và nó là lý do GMP scale tốt hơn Prometheus pull tập trung.

Internal model: managed collection và push model

Bốn thành phần của managed collection

Khi bật managed collection, GMP cài bốn thành phần (theo tài liệu setup):

gmp-operator (Deployment): operator quản lý vòng đời các thành phần khác và đọc CRD (PodMonitoring, Rules...) để cấu hình collector.
collector (DaemonSet): thành phần scrape. Điểm thiết kế then chốt — "Managed collection runs Prometheus-based collectors as a Daemonset and ensures scalability by only scraping targets on colocated nodes." Mỗi collector chỉ scrape target trên chính node của nó, rồi push lên Google Cloud.
rule-evaluator (Deployment): chạy recording rule và alerting rule.
alertmanager (StatefulSet): route alert đã kích hoạt tới notification channel.

Vì sao DaemonSet + push model scale tốt hơn Prometheus tập trung

Đây là khác biệt kiến trúc quan trọng nhất và đáng hiểu sâu:

Prometheus tập trung (pull): một (hoặc vài) server pull metric từ mọi target toàn cluster. Server đó phải giữ toàn bộ time series trong memory, phải tự discovery và scrape qua mạng tới mọi Pod, và là điểm nghẽn + SPOF. Khi cluster lớn lên, server phải scale dọc (memory khổng lồ) rồi cuối cùng phải shard thủ công.
GMP DaemonSet (push): mỗi node có một collector chỉ lo target cục bộ trên node đó. Tải scrape phân tán đều theo node — thêm node thì thêm collector, scale ngang tự nhiên. Không collector nào giữ toàn bộ series; chúng push lên backend Monarch lo việc lưu trữ và query ở quy mô toàn cầu. Không SPOF, không scale dọc, không cardinality limit kiểu Prometheus đơn lẻ.

Hệ quả vận hành: bạn gần như không phải lo về việc "Prometheus hết memory" hay "phải dựng Thanos để lưu dài hạn" — backend managed lo phần đó. Đổi lại, chi phí chuyển sang mô hình trả theo sample ingest (xem phần cardinality bên dưới).

GMP có hai chế độ: managed collection (Google quản lý collector, khuyến nghị) và self-deployed collection (bạn tự chạy collector/prometheus-operator nhưng push lên GMP backend — dùng khi đã có setup prometheus-operator phức tạp muốn giữ).

Internal model: PodMonitoring và ClusterPodMonitoring CRD

GMP dùng CRD riêng (tương tự ServiceMonitor/PodMonitor của prometheus-operator nhưng không trùng): PodMonitoring (namespace-scoped) và ClusterPodMonitoring (cluster-wide). Chúng khai báo "scrape target nào, ở port/path nào, tần suất bao nhiêu".

yaml

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: app-metrics
  namespace: production
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: checkout-service
  endpoints:
  - port: metrics          # tên port (hoặc số) expose /metrics
    interval: 30s          # tần suất scrape — ảnh hưởng trực tiếp chi phí
    path: /metrics
    metricRelabeling:      # lọc/biến đổi metric TRƯỚC khi ingest
    - sourceLabels: [__name__]
      regex: "go_gc_.*"    # ví dụ: drop metric GC runtime ít giá trị
      action: drop

Các field quan trọng:

selector: label selector chọn Pod target (namespace-scoped với PodMonitoring).
endpoints[]: mỗi endpoint là một cấu hình scrape — port, interval, path, và tùy chọn scheme/tls/basicAuth/authorization/oauth2 cho target cần xác thực.
interval: tần suất scrape. Đây là một đòn bẩy chi phí trực tiếp: scrape mỗi 10s tạo gấp ba sample so với mỗi 30s. Với metric không cần độ phân giải cao, interval lớn hơn tiết kiệm đáng kể.
metricRelabeling: tập rule áp trước khi ingest để drop/keep metric hoặc biến đổi label — công cụ kiểm soát cardinality và chi phí mạnh nhất ở tầng scrape.

ClusterPodMonitoring có cùng giao diện nhưng discovery Pod across mọi namespace — dùng cho metric hạ tầng cross-namespace (ví dụ DCGM exporter deploy một ClusterPodMonitoring).

Reserved label và `exported_` prefix

Mọi metric tự nhận label dành riêng: project_id, location, cluster, namespace, job, instance. Nếu metric ứng dụng tự định nghĩa label trùng tên (ví dụ cluster), GMP thêm tiền tố exported_ (→ exported_cluster) để bảo toàn label hệ thống. Cần biết điều này khi viết PromQL: nếu một label "biến mất" so với kỳ vọng, nó có thể đã bị đổi thành exported_*.

Internal model: Rules và Alertmanager

GMP hỗ trợ rule kiểu Prometheus qua CRD:

Rules (namespace-scoped): recording rule (precompute biểu thức PromQL tốn kém thành metric mới) và alerting rule (điều kiện kích hoạt alert) phạm vi một namespace.
ClusterRules / GlobalRules: rule phạm vi cluster / global, cho metric cross-namespace.
rule-evaluator chạy các rule này; alertmanager (cấu hình qua AlertmanagerConfig) route alert tới notification channel (Slack, PagerDuty, email).

Recording rule đặc biệt giá trị cho dashboard và alert hay dùng: thay vì query một biểu thức histogram_quantile(...) nặng mỗi lần load dashboard, precompute nó thành một metric :latency_p99: để query rẻ và nhanh.

yaml

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  name: latency-rules
  namespace: production
spec:
  groups:
  - name: slo
    interval: 30s
    rules:
    - record: job:http_request_duration_seconds:p99
      expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
    - alert: HighLatencyP99
      expr: job:http_request_duration_seconds:p99 > 0.5
      for: 5m
      labels:
        severity: page

Internal model: PromQL trong Cloud Monitoring và high cardinality

Query bằng PromQL

GMP cho query metric bằng PromQL trực tiếp trong Cloud Monitoring (Metrics Explorer có chế độ PromQL), hoặc qua Grafana dùng Cloud Monitoring làm Prometheus-compatible datasource (Prometheus HTTP API frontend). Điều này nghĩa là dashboard Grafana hiện có dùng PromQL chạy gần như không sửa đổi — một lý do lớn để chọn GMP thay vì Cloud Monitoring native MQL.

High cardinality — cái bẫy chi phí lớn nhất

Đây là phần quan trọng nhất về vận hành. GMP tính phí theo số sample ingest, mà:

sample/phút ≈ (số time series) × (60 / interval_giây)
số time series = tích các tổ hợp giá trị label duy nhất

Cardinality explosion xảy ra khi một label có số giá trị lớn/unbounded. Ví dụ kinh điển: metric http_requests_total với label user_id. Nếu có 1 triệu user, metric đó tạo 1 triệu time series (nhân thêm với các label khác). Hệ quả ba mặt:

Chi phí: hàng triệu series × tần suất scrape = hàng triệu sample/phút = hóa đơn khổng lồ.
Hiệu năng query: PromQL phải quét nhiều series, query chậm.
Vô dụng: không ai query latency theo từng user_id — label đó không trả lời câu hỏi vận hành nào.

Cách kiểm soát:

Thiết kế metric với label bounded ngay từ code (file 04): chỉ method, route normalize, status_code. Chi tiết per-request vào trace/log.
metricRelabeling drop/keep trong PodMonitoring: drop metric hoặc label cao ngay tại collector trước khi ingest — đây là tuyến phòng thủ khi không sửa được app.
Tăng interval cho metric không cần độ phân giải cao.
Drop metric không dùng: nhiều exporter (kube-state-metrics, runtime) phơi hàng trăm metric; keep chỉ những cái thực sự dùng.

Production architecture patterns

PodMonitoring per-team, ClusterPodMonitoring cho hạ tầng

Pattern multi-tenancy: mỗi team tự khai PodMonitoring trong namespace của họ (RBAC giới hạn họ chỉ tạo được trong namespace mình), platform team quản lý ClusterPodMonitoring cho metric hạ tầng dùng chung. Cách này phân tán quyền khai báo scrape mà vẫn kiểm soát được cross-namespace.

metricRelabeling như một "cost firewall"

Pattern FinOps: platform team áp một bộ metricRelabeling chuẩn drop các metric runtime/debug ít giá trị (go_gc_, process_, các histogram bucket dày) trên mọi PodMonitoring qua policy/template. Đây là "tường lửa chi phí" ở tầng scrape, ngăn cardinality rác chảy vào backend.

Recording rule cho SLO và dashboard nóng

Pattern reliability: precompute mọi biểu thức SLO (error rate, latency percentile) bằng recording rule chạy 30s, để alert và dashboard query metric đã tính sẵn thay vì biểu thức nặng — giảm tải query và tăng tốc dashboard.

Real-world scenarios

Kịch bản 1 — Hóa đơn GMP tăng 10× sau một deploy. Một team thêm label request_path (full URL, chứa ID động) vào metric HTTP. Sau một tuần, số time series tăng từ 50K lên 5 triệu, hóa đơn GMP tăng 10×. Phân tích cardinality (qua metric prometheus_target meta) chỉ thẳng vào request_path. Khắc phục: metricRelabeling drop label request_path tức thời (cầm máu), rồi sửa app normalize path thành route template (/users/{id} thay vì /users/12345).

Kịch bản 2 — Migrate từ self-managed Prometheus. Một tổ chức chạy Prometheus + Thanos tự quản, server thường xuyên OOM khi cluster lớn lên và việc bảo trì Thanos tốn một kỹ sư full-time. Chuyển sang GMP managed collection: giữ nguyên PromQL dashboard và alerting rule (chuyển sang Rules CRD), bỏ hoàn toàn việc vận hành Prometheus/Thanos. Collector DaemonSet scale ngang theo node, không còn OOM, lưu trữ dài hạn miễn phí công sức.

Common mistakes / anti-patterns

Label unbounded trong metric. Nguồn cardinality explosion số một. Đúng: label bounded, drop label cao bằng metricRelabeling.
Scrape interval quá nhỏ cho mọi thứ. 5s interval cho metric không cần độ phân giải đó = 3× chi phí vô ích. Đúng: 30s mặc định, chỉ giảm cho metric cần độ phân giải cao.
keep mọi metric exporter expose. kube-state-metrics và runtime exporter phơi hàng trăm metric; phần lớn không dùng. Đúng: keep chỉ metric thực sự query.
Chạy hai metric adapter đồng thời. Custom Metrics Stackdriver Adapter và Prometheus Adapter có resource definition trùng — chạy cả hai gây lỗi (file 07). Đúng: chọn một.
Không monitor chính cardinality. Không biết hóa đơn tăng vì đâu cho đến khi nhận bill. Đúng: theo dõi số active series, alert khi tăng đột biến.

GCP-native implementation guidance

Kiểm tra managed collection đã bật:

bash

gcloud container clusters describe CLUSTER_NAME \
  --location=LOCATION \
  --format="value(monitoringConfig.managedPrometheusConfig.enabled)"

Bật managed collection (nếu cluster cũ chưa bật):

bash

gcloud container clusters update CLUSTER_NAME \
  --location=LOCATION \
  --enable-managed-prometheus

Áp PodMonitoring (đã ví dụ ở trên), rồi kiểm tra target status bằng cách query meta metric trong Cloud Monitoring (PromQL):

promql

# Đếm active series theo metric name — phát hiện metric cardinality cao
topk(10, count by (__name__)({__name__=~".+"}))

# Latency p99 từ recording rule đã precompute
job:http_request_duration_seconds:p99

Official references

Managed Service for Prometheus overview — kiến trúc, data model
Setup managed collection — bốn thành phần, push model
PodMonitoring API reference — spec CRD đầy đủ
Rules and alerting — Rules/ClusterRules/AlertmanagerConfig
Query with PromQL — PromQL trong Cloud Monitoring
Control metric costs / cardinality — metricRelabeling, cardinality

Tóm lại: Managed Service for Prometheus giữ giao diện Prometheus (PromQL, CRD) nhưng thay backend bằng hạ tầng toàn cầu của Google — collector DaemonSet chỉ scrape colocated node rồi push, loại bỏ SPOF và scale dọc của Prometheus tập trung. PodMonitoring/ClusterPodMonitoring khai báo scrape target; Rules/AlertmanagerConfig lo recording/alerting. Cái bẫy lớn nhất là high cardinality: chi phí tỷ lệ với số sample ingest, nên label bounded + metricRelabeling drop + interval hợp lý là kỷ luật bắt buộc để observability không đốt cháy ngân sách.

Managed Service for Prometheus — PodMonitoring, Rules, PromQL ​

Why this matters in production ​

Internal model: managed collection và push model ​

Bốn thành phần của managed collection ​

Vì sao DaemonSet + push model scale tốt hơn Prometheus tập trung ​

Internal model: PodMonitoring và ClusterPodMonitoring CRD ​

Reserved label và exported_ prefix ​

Internal model: Rules và Alertmanager ​

Internal model: PromQL trong Cloud Monitoring và high cardinality ​

Query bằng PromQL ​

High cardinality — cái bẫy chi phí lớn nhất ​

Production architecture patterns ​

PodMonitoring per-team, ClusterPodMonitoring cho hạ tầng ​

metricRelabeling như một "cost firewall" ​

Recording rule cho SLO và dashboard nóng ​

Real-world scenarios ​

Common mistakes / anti-patterns ​

GCP-native implementation guidance ​

Official references ​