Andromeda: GCP Software-Defined Networking Stack
Vì sao quan trọng trong production
Andromeda là foundation của tất cả mọi thứ networking trong GCP. Mỗi packet gửi từ VM của bạn, mỗi connection tới service khác, mỗi load balancer decision — đều được xử lý bởi Andromeda. Hiểu cách nó hoạt động cho phép bạn:
- Dự báo network behavior thay vì phải trial-and-error
- Debug network issues khi không phải là application code là lỗi
- Optimize network performance thay vì để system run default
- Design security architecture bằng hiểu rõ packet processing pipeline
Một kỹ sư platform hay cloud architect không cần implement Andromeda, nhưng PHẢI biết cách nó xử lý traffic, vì mọi decision bạn làm — firewall rules, routing, NAT — đều phụ thuộc vào model này.
Internal model: Cách Andromeda xử lý packet
Problem: Tại sao Google cần SDN?
Trước khi có Andromeda (trước ~2012), Google phải manually configure từng network device (switch, router) để implement network policy. Khi bạn có hàng triệu VM, network policy thay đổi liên tục (VMs created/deleted, firewall rules updated, routes changed) — manual configuration không scale.
Giải pháp: Software-Defined Networking — tách control plane (policy logic) từ data plane (packet forwarding). Google viết một "network operating system" (Andromeda) chạy trên top of physical hardware, cho phép programmatic management của toàn bộ network.
Architecture Overview
Andromeda có hai components chính:
┌─────────────────────────────────────────────┐
│ CONTROL PLANE (Central) │
│ - Policy Management (Firewall, Routes) │
│ - Service Discovery & Load Balancing │
│ - Monitoring & Telemetry │
└────────────┬────────────────────────────────┘
│
gRPC / Protocol Buffers
│
┌────────────▼────────────────────────────────┐
│ DATA PLANE (Node-Local) │
│ - Packet Forwarding │
│ - Connection Tracking │
│ - NAT/IP Translation │
│ - Firewall Enforcement │
└─────────────────────────────────────────────┘Mô hình này cho phép:
- Centralized policy: Một nơi định nghĩa "traffic từ A tới B được phép", control plane push policy xuống tất cả nodes
- Local enforcement: Mỗi node (VM) enforce policy independently, không phụ thuộc vào central system (high availability)
- Stateful connection tracking: Mỗi node track connections, cho phép asymmetric routing (outgoing traffic có thể exit từ point khác incoming traffic)
Data Plane Execution
Khi packet tới VM của bạn, Andromeda data plane xử lý theo thứ tự:
1. INGRESS FIREWALL RULES
- Layer 4 match (protocol, port)
- Stateful: Nếu kết nối outbound từ trước → allow inbound response
- Default: DENY (implied deny rule)
2. ROUTING DECISION
- Subnet routes (local)
- Custom routes (static)
- Dynamic routes (Cloud Router via BGP)
3. NAT (nếu được config)
- Source NAT (SNAT) cho outgoing
- Destination NAT (DNAT) cho load balancer
4. FORWARDING
- Chuyển packet đến egress interface (physical NIC)
- Tunnel encapsulation (nếu cross-datacenter)
5. EGRESS FIREWALL RULES
- Layer 4 match
- Implicit allow (khác ingress)Critical detail: Firewall rules áp dụng AT THE INSTANCE (node), không ở network edge. Có nghĩa:
- Khi 2 VM trong cùng subnet communicate qua internal IPs, firewall vẫn được enforce (không bypass firewall nếu same subnet)
- Monitoring được centralized (connection logs from all nodes)
- Mỗi VM có full stateful firewall, không cần dedicated firewall appliance
VPC Implementation trong Andromeda
VPC network trong Andromeda là logical overlay trên physical network:
Physical GCP Datacenter
├── Node A (Compute Host)
│ └── VM1 (10.1.0.10/VPC-A)
│ └── VM2 (10.1.0.11/VPC-B)
│
├── Node B (Compute Host)
│ └── VM3 (10.1.0.12/VPC-A)
│
└── Network Switch Fabric
└── Forwards based on physical MAC/IPAndromeda decouples logical network từ physical network:
- Physical network thấy: packets với physical MAC addresses, physical IP addresses (trong GCP backbone)
- VMs thấy: packets với VPC IP addresses, VPC MAC addresses (generated by Andromeda)
Cách này thực hiện:
1. VM1 sends packet: src=10.1.0.10, dst=10.1.0.12
2. Andromeda data plane intercepts
3. Lookup routing table: 10.1.0.12 is on Node B
4. Encapsulate: Add GCP backbone header (physical MAC/IP)
5. Forward via physical network
6. Node B receives, decapsulates, delivers to VM3Result: VM không biết nó sống trên shared physical network. Mỗi VPC là completely isolated logical network, ngay cả khi VMs chạy trên same physical host.
Production Architecture Patterns
Pattern 1: Single-Region VPC (Typical for Stateful Apps)
┌─────────────────────────────────────────────┐
│ Region us-central1 (Iowa) │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ VPC Network (10.0.0.0/8) │ │
│ │ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Subnet 1 │ (10.1.0.0/24) │ │
│ │ │ Zone us-c1-a │ │ │
│ │ │ - VM-1 │ (10.1.0.2) │ │
│ │ │ - VM-2 │ (10.1.0.3) │ │
│ │ └────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Subnet 2 │ (10.2.0.0/24) │ │
│ │ │ Zone us-c1-b │ │ │
│ │ │ - VM-3 │ (10.2.0.2) │ │
│ │ └────────────────┘ │ │
│ │ │ │
│ │ Firewall Rules: │ │
│ │ - allow: tcp:80,443 from 0.0.0.0/0 │ │
│ │ - allow: tcp:3306 from 10.1.0.0/24 │ │
│ │ - deny: all (implicit) │ │
│ │ │ │
│ │ Routes: │ │
│ │ - 10.0.0.0/8 → local (subnet rts) │ │
│ │ - 0.0.0.0/0 → internet gateway │ │
│ └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────┘How Andromeda enforces this:
- Subnet routes (10.1.0.0/24, 10.2.0.0/24) được programmed vào data plane mỗi node
- Firewall rules được compiled thành iptables rules (hoặc eBPF, trong Dataplane V2)
- Each node independently enforce rules — nếu control plane dies, traffic vẫn forward theo cached rules
- New VMs → control plane sends configuration → data plane starts enforcing
Pattern 2: Multi-VPC with VPC Peering
┌──────────────────┐ ┌──────────────────┐
│ VPC Network A │ │ VPC Network B │
│ 10.1.0.0/16 │ │ 10.2.0.0/16 │
│ │ │ │
│ ┌────────────┐ │ │ ┌────────────┐ │
│ │ VM-A1 │ │ │ │ VM-B1 │ │
│ │ 10.1.1.0 │ │ │ │ 10.2.1.0 │ │
│ └────────────┘ │ │ └────────────┘ │
│ │ │ │
│ Firewall Rules: │ │ Firewall Rules: │
│ allow: 10.2/16 │ │ allow: 10.1/16 │
└──────────┬───────┘ └────────┬─────────┘
│ │
└────────VPC Peering─────┘
(subnet routes exchanged)How Andromeda implements peering:
- Control plane: "VPC-A and VPC-B are peered" → compute routes
- Data plane push to all nodes:
- Nodes in VPC-A: 10.2.0.0/16 → peering next-hop → reach VPC-B
- Nodes in VPC-B: 10.1.0.0/16 → peering next-hop → reach VPC-A
- Firewall rules: Each VPC독립적으로 controls ingress/egress
- Key: No transit traffic (A→B→C not allowed, even if B peered with both A and C)
Pattern 3: Shared VPC (Enterprise Multi-Project)
┌──────────────────────────────────────┐
│ Organization │
├──────────────────────────────────────┤
│ HOST PROJECT │
│ └─ VPC Network (10.0.0.0/8) │
│ └─ Subnet (10.1.0.0/24) │
│ ├─ Firewall Rules │
│ └─ Routes │
└──────────────────────────────────────┘
│
VPC Peering (logical, shared)
│
┌───────┴──────────┬────────────────────┐
│ │ │
▼ ▼ ▼
SERVICE-PROJECT-1 SERVICE-PROJECT-2 SERVICE-PROJECT-3
└─ Can create VMs in Shared VPC
└─ Must have IAM role: serviceAccount@svc.id.goog
└─ Firewall rules from host project applyAndromeda's role:
- Host project defines firewall/routes (central policy)
- Service projects thế VMs vào shared subnets
- All data plane enforcement happens uniform across projects
- Billing: tracked per-project (data transfer), but routing is shared
Real-world Scenarios & Trade-offs
Scenario 1: E-commerce Platform (Global, Multi-Tenant)
├─ Frontend Layer (Global)
│ └─ Anycast IP (via Global Load Balancer)
│ └─ Route user→nearest region (using anycast + BGP)
│
├─ Region: us-east1
│ └─ VPC-us-east
│ └─ App servers (internal IPs)
│ └─ Firewall: allow from GLB only
│
├─ Region: eu-west1
│ └─ VPC-eu-west
│ └─ App servers (internal IPs)
│ └─ Firewall: allow from GLB only
│
└─ Database Layer (Regional, but replicated)
└─ Cloud SQL (managed, not in VPC)
└─ Private IP endpoint (uses Private Google Access)Andromeda decisions:
- GLB terminates SSL at nearest PoP
- Interior app-server traffic stays within region (no cross-region latency)
- Database connections use Private Google Access (199.36.153.4/30 magic network)
- Firewall rules: Whitelist GLB health check IPs from specific regions
Scenario 2: Microservices with Strict Egress Control
namespace: payment
├─ Pod: payment-processor (internal IP: 10.10.0.50)
│ └─ Allowed outbound: only to Stripe API (egress rule: stripe.com:443)
│ └─ Denied: internet, other services
│
└─ Pod: payment-audit (internal IP: 10.10.0.51)
└─ Allowed outbound: only to Cloud Logging, BigQuery
└─ Denied: everything elseAndromeda Data Plane:
- Each pod network interface has stateful firewall
- Connection-tracking: egress allows response, ingress blocks unsolicited
- Egress rule for "stripe.com:443" → DNS resolved to IP → firewall matches
- Denies unexpected traffic immediately at pod network interface
Scenario 3: Hybrid Network (On-Prem + Cloud)
┌─────────────┐ Cloud Interconnect ┌──────────────┐
│On-Prem DC │━━━━━━━━━━━━━━━━━━━━━━━│ GCP VPC │
│172.16.0/12 │ (Dedicated VLAN) │10.0.0.0/8 │
└─────────────┘ └──────────────┘
│ │
└─ On-Prem: 172.16.0.0/12 │
└─ GCP: 10.0.0.0/8 │
└─ Cloud Router: BGP advertisement │
└─ Firewall: allow both CIDR blocks │Andromeda's role:
- Cloud Router (GCP side) receives routes via BGP from on-prem
- Routes: "172.16.0.0/12 via Cloud Interconnect attachment"
- Data plane: routes matching 172.16.0.0/12 sent to tunnel interface
- Physical network (Jupiter) handles encapsulation/tunneling
- Firewall: enforce rules for hybrid traffic same as intra-VPC
Common Mistakes & Anti-Patterns
Mistake 1: Assuming Same-Subnet = Auto-Allow
❌ Wrong thinking:
"VMs in same subnet are automatically reachable - firewall rules don't matter"✅ Correct understanding:
- Firewall rules ALWAYS apply, even within same subnet
- Default: implied deny for ingress (security by default)
- Must explicitly allow each traffic direction
- Andromeda enforces at packet level, regardless of subnet
Impact: Dev deploys application, tries curl vm-b from vm-a, fails. "Same subnet should work!" — No, firewall rule needed.
Prevention: Always review firewall rules even for same-subnet communication. Use gcloud compute firewall-rules list to verify.
Mistake 2: Not Understanding Stateful Connection Tracking
❌ Wrong thinking:
"I need separate egress firewall rules for inbound responses"✅ Correct understanding:
- Firewall is stateful: if you allow egress TCP:443 → response TCP:443 automatically allowed back
- Connection state tracked in kernel (ct
racking table)
- ESTABLISHED/RELATED connections bypass firewall rules
Impact: Unnecessary firewall rules bloat, performance confusion
Prevention: Understand connection states (NEW, ESTABLISHED, RELATED). In GCP firewall console, see "Session affinity" and connection matching.
Mistake 3: Forgetting Egress Internet Gateway
❌ Wrong thinking:
"VM has external IP, can reach internet automatically"✅ Correct understanding:
- External IP alone isn't enough
- Need: egress firewall rule + default route (0.0.0.0/0 → internet gateway)
- Default VPC provides these, custom VPC might not
Impact: VM has external IP but can't reach internet. Debugging frustration.
Prevention: When creating custom VPC, explicitly create internet gateway route. Check: gcloud compute routes list.
Mistake 4: Misunderstanding VPC Isolation
❌ Wrong thinking:
"VPC-A and VPC-B are completely isolated - no data leakage possible"✅ Correct understanding:
- VPC-to-VPC isolation depends on peering/sharing configuration
- Firewall rules provide defense-in-depth
- Shared VPC grants cross-project access (intentional architectural decision)
Impact: Security design flaw. Dev enables Shared VPC without understanding implications.
Prevention: Understand IAM boundaries + VPC boundaries. Shared VPC should have explicit network admin role separation.
GCP-native Implementation Guidance
Creating VPC with Andromeda Control
# Create custom VPC (Andromeda will manage)
gcloud compute networks create my-vpc \
--subnet-mode=custom \
--bgp-routing-mode=regional
# Create subnet (Andromeda allocates in data plane)
gcloud compute networks subnets create my-subnet \
--network=my-vpc \
--region=us-central1 \
--range=10.1.0.0/24 \
--enable-flow-logs \
--logging-aggregation-interval=interval-5-sec
# Firewall rule (Andromeda compiles into data plane rules)
gcloud compute firewall-rules create allow-http \
--network=my-vpc \
--direction=INGRESS \
--priority=1000 \
--source-ranges=0.0.0.0/0 \
--target-tags=http-server \
--allow=tcp:80,tcp:443
# Create VM (Andromeda assigns VPC IP, configures firewall)
gcloud compute instances create my-vm \
--zone=us-central1-a \
--network-interface=network=my-vpc,subnet=my-subnet \
--tags=http-serverWhat Andromeda does behind scenes:
- Control plane receives "create subnet" → calculates CIDR allocation
- Push to all nodes in region: "10.1.0.0/24 is local to us-central1-a"
- Control plane receives "create firewall rule" → compile to iptables/eBPF rules
- Push to target nodes: "VM with tag http-server allows 0.0.0.0/0:tcp:80"
- VM starts → assigned 10.1.0.2 → Andromeda data plane configures veth pair → VM can send traffic
Verifying Andromeda Configuration
# Check VPC network properties
gcloud compute networks describe my-vpc
# List firewall rules (control plane state)
gcloud compute firewall-rules list --filter="network=my-vpc"
# Check VM network interface (data plane state)
gcloud compute instances describe my-vm --zone=us-central1-a \
--format='value(networkInterfaces[0])'
# Monitor VPC Flow Logs (capture from data plane)
gcloud logging read "resource.type=gce_instance AND jsonPayload.src_addr=10.1.0.0/24" \
--limit=10
# Trace packet path using Network Connectivity Tests
gcloud compute networks vpc-peering routes list --vpc-network=my-vpcReferences
- Andromeda: Google Cloud's Software-Defined Networking (USENIX NSDI '18) — Original academic paper describing Andromeda architecture
- GCP VPC Networks Documentation — VPC as implemented by Andromeda
- Creating and Managing VPC Networks — Practical guide to Andromeda configuration
- Firewall Rules Documentation — How Andromeda enforces policy
- VPC Flow Logs — Monitor Andromeda data plane activity
Next: Jupiter Fabric: Spine-Leaf Topology & Oversubscription — Hiểu infrastructure vật lý mà Andromeda chạy on top of