Jupiter Fabric: Spine-Leaf Topology & Bandwidth Oversubscription
Vì sao quan trọng trong production
Andromeda là lớp logical networking, nhưng nó chạy trên nền tảng physical network infrastructure — đó chính là Jupiter Fabric. Hiểu Jupiter giúp bạn:
- Dự báo các bottleneck về hiệu suất: Biết single VM có thể đạt throughput tối đa bao nhiêu
- Thiết kế multi-zone deployments: Hiểu rõ latency và throughput giữa các zone
- Xử lý tail latency: Khi performance drop không phải do application code
- Tối ưu hóa sử dụng băng thông: Biết khi nào nên gộp VMs trong zone, khi nào phân tán trên các zone
Jupiter Fabric quyết định physical packet path, và Andromeda quyết định logical routing path. Kết hợp cả hai, bạn mới có cái nhìn toàn diện về hành vi mạng.
Internal Model: Spine-Leaf Datacenter Architecture
Mạng ba tầng truyền thống (On-Prem / Đã lỗi thời)
┌─────────────────────────┐
│ Core Layer │
│ (High-End Switches) │
│ 10-100Gbps capacity │
└────────────┬────────────┘
│
┌────────┴────────┐
│ │
┌───▼──┐ ┌───▼──┐
│Agg1 │ │Agg2 │
│Aggr │ │Aggr │
└───┬──┘ └───┬──┘
│ │
┌─┴────┐ ┌────┴─┐
│ │ │ │
┌─▼─┐ ┌─▼─┐ ┌─▼─┐ ┌─▼─┐
│LS1│ │LS2│ │LS3│ │LS4│ (Leaf Switches)
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
│ │ │ │
[Servers] [Servers]
Problems:
- Oversubscription: Aggregation → Core bottleneck
- 1Gbps to each server, but core only 10Gbps total (1:10 ratio)
- Cross-pod traffic blocked (core limited)Jupiter: Modern Spine-Leaf Architecture
┌─────────────────────────────────────────────────────┐
│ SPINE LAYER (top-of-rack) │
│ Spine1 Spine2 Spine3 Spine4 Spine5 │
│ 200Gbps 200Gbps 200Gbps 200Gbps 200Gbps │
│ ▲ ▲ ▲ ▲ ▲ │
└───┼─────────┼─────────┼─────────┼─────────┼─────────┘
│ │ │ │ │
│ (25Gbps per link) │
│ │ │ │ │
┌───┼─────────┼─────────┼─────────┼─────────┼─────────┐
│ │ │ │ │ │ │
│ Leaf1 Leaf2 Leaf3 Leaf4 Leaf5 │
│ 48×25G 48×25G 48×25G 48×25G 48×25G │
│ │ │ │ │ │ │
│ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ │
│ │ 48 Svrs │ 48 Svrs │ 48 Svrs │ 48 Svrs │ 48 Svrs│ │
│ │ (50Gbps) │ (50Gbps) │ (50Gbps) │ (50Gbps) │(50Gbps)│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────┘
Key Characteristics:
- Non-blocking fabric: Any server can reach any other at full 25Gbps
- Many equal paths: Spine-Leaf-Spine provides ECMP (Equal Cost Multi-Path)
- Oversubscription WITHIN leaf (48×50G → 240Gbps down, 5×25G = 125Gbps up) = ~2:1
- But oversubscription justified: most traffic short-lived, not all servers saturated simultaneouslyTại sao chọn Spine-Leaf (thay vì 3-Tier)
Vấn đề với 3-tier: Bottleneck ở tầng aggregation/core
- Server → Leaf: 50Gbps (full bisection)
- Leaf → Aggregation: 10Gbps (oversubscribed 5:1)
- Cross-pod traffic mất 5x
Giải pháp với Spine-Leaf: Băng thông bằng nhau ở tất cả các tầng
- Server → Leaf: 25Gbps
- Leaf → Spine → Leaf: 25Gbps (via ECMP, load-balanced)
- Any-to-any: cùng băng thông bất kể đích
Trade-off về chi phí:
- 3-tier: Ít spine switch hơn, nhưng core đắt (high-end)
- Spine-Leaf: Nhiều spine switch hơn, nhưng dùng commodity switches ở mọi tầng (rẻ hơn per-port)
Ở quy mô của Google, Spine-Leaf rẻ hơn + hiệu suất tốt hơn.
Jupiter Fabric Details
Per-Server Connectivity
Server (Compute Host)
├─ 2x 100Gbps NICs (redundancy)
│ ├─ NIC1: connected to Leaf1 (25Gbps)
│ ├─ NIC2: connected to Leaf2 (25Gbps)
│ └─ Both same subnet (LAG / bonding)
│
└─ Sustained throughput: 50Gbps (25Gbps per leaf, aggregated)Why 2 NICs?
- Redundancy: If Leaf1 fails, traffic still flows via Leaf2
- Capacity: Combined 50Gbps throughput per server
- Network diversity: Traffic not concentrated on single leaf
Định tuyến bên trong Fabric (Spine-Leaf Forwarding)
Khi server gửi packet:
1. Packet arrives at Leaf switch
- ECMP: Multiple equal-cost paths to destination exist
- Load balancing: Hash on (src_IP, dst_IP, src_port, dst_port)
- Selects one of N spine switches
2. Spine switch receives
- Examines dest MAC → determine destination leaf
- Forwards to destination leaf (via next hop)
3. Destination leaf receives
- Examines dest MAC → determine destination server port
- Forwards to server
Example:
src=10.1.0.2, dst=10.1.0.3
Hash(10.1.0.2, 10.1.0.3, port1, port2) = Spine2
Route: Leaf1 → Spine2 → Leaf1 (can be same leaf!)Bandwidth Oversubscription Implication
Jupiter has 2:1 oversubscription at leaf level:
Leaf Switch Capacity:
├─ Downlinks (to servers): 48 × 50Gbps = 2400Gbps
├─ Uplinks (to spines): 5 × 100Gbps = 500Gbps
└─ Ratio: 2400:500 = 4.8:1 ??? (Wait, seems worse than 2:1...)
Actually more nuanced:
├─ If all 48 servers send traffic locally (within leaf) → up to 2400Gbps
├─ If traffic goes between leaves → limited by uplinks (500Gbps shared)
├─ Realistic: ~48 servers don't saturate simultaneously
└─ Design assumes avg 25% utilization → 600Gbps per leaf sustainedImplications for production:
- ✅ Single VM: Can achieve 25-50Gbps if isolated
- ✅ Many VMs on same leaf, local traffic: Each gets fraction, but aggregate bandwidth-limited
- ⚠️ All VMs sending cross-leaf simultaneously: Congestion, packet drops, tail latency
- Mitigations: Traffic scheduling, bandwidth reservation, burst allowance
Physical vs Logical Topology
Logical (What Andromeda sees):
└─ VPC A: 10.1.0.0/24
├─ Subnet: 10.1.0.0/24
└─ All VMs connected logically (flat)
Physical (Jupiter sees):
└─ Datacenter topology
├─ Leaf switches (rack-level)
├─ Spine switches (fabric-level)
└─ Servers in racks
├─ Server1 (10.1.0.2) @ Rack5, connected to Leaf5
└─ Server2 (10.1.0.3) @ Rack7, connected to Leaf7
Packet from Server1 to Server2:
- Andromeda: Route 10.1.0.2 → 10.1.0.3 (logical, local subnet)
- Jupiter: Forward Leaf5 → Spine → Leaf7 (physical path)
- Result: Packet takes physical path, but VMs unaware of leaf/spine detailsProduction Architecture Patterns
Pattern 1: High-Throughput Workload (Batch Processing)
Requirement: Move 100GB dataset in 1 minute
┌──────────────────────┐
│ GCS Bucket (multi-region)
└────────┬─────────────┘
│ 25Gbps per source
│
┌────────▼──────────────────────────┐
│ Compute VMs (4 instances) │
├─ VM1: 10.1.0.10 @ Leaf1 (25Gbps) │
├─ VM2: 10.1.0.11 @ Leaf2 (25Gbps) │
├─ VM3: 10.1.0.12 @ Leaf1 (25Gbps) │ (Note: VMs distributed across leaves)
└─ VM4: 10.1.0.13 @ Leaf3 (25Gbps) │Throughput breakdown:
- Ideal: 4 × 25Gbps = 100Gbps total
- Reality: IOs contention, spine limits → ~80Gbps sustained
- 100GB ÷ 80Gbps = 1.25 seconds (works!)
Jupiter impact: VMs distributed across leaves to avoid leaf-level oversubscription bottleneck.
Pattern 2: Database Replication (Low Throughput, High Latency-Sensitive)
┌───────────────────────┐
│ Primary DB (Zone A) │
│ IP: 10.1.0.100 │
│ @ Leaf1 │
├───────────────────────┤
│ Replication Channel │
│ Throughput: 10Mbps │
│ Latency SLA: <5ms │
└───────────┬───────────┘
│
│ (Leaf1 → Spine2 → Leaf2: ~50μs physical latency)
│
┌───────────▼───────────┐
│ Replica DB (Zone B) │
│ IP: 10.2.0.100 │
│ @ Leaf2 │
└───────────────────────┘Jupiter impact: Low throughput fine, but latency critical.
- Spines designed for <50μs latency
- ECMP routing provides consistent low latency (no long queues)
- Replication doesn't cause fabric congestion (small data volume)
Pattern 3: Multi-Zone Failover Architecture
Primary (Zone A):
├─ Application @ Leaf1
├─ Database @ Leaf3
└─ Replication link: 10Mbps
Standby (Zone B):
├─ Application @ Leaf2
├─ Database @ Leaf4
└─ Ready to take over
Failover: Primary data center down
├─ Standby Application: DNS updated → connects to Standby Database
├─ Jupiter fabric: Routes traffic from Zone B to Zone B internally
├─ Latency: <5ms (intra-zone via spines)
├─ Throughput: Full 25Gbps available
└─ Recovery: No cross-zone congestionJupiter benefit: Each zone isolated fabric-wise, failover doesn't impact other zones.
Common Mistakes & Anti-Patterns
Mistake 1: Assuming All Servers Equal Throughput to All Others
❌ Wrong thinking:
"All VMs in same zone have equal 25Gbps to all other VMs"✅ Correct understanding:
- Single VM: Can achieve ~25Gbps egress to single destination
- 2 VMs on SAME leaf sending across leaf: Each limited to ~12.5Gbps (sharing uplink)
- 2 VMs on DIFFERENT leaves: Can achieve near 25Gbps each (separate uplinks)
- Many-to-many: Aggregate spine bandwidth becomes bottleneck
Impact: Assumes application scales linearly, but networking limited at higher concurrency.
Prevention: Benchmark actual throughput in your zone. Use iperf3 between VMs, measure across topology.
Mistake 2: Expecting Zero Packet Loss at High Utilization
❌ Wrong thinking:
"Google infrastructure never drops packets, so can rely on 99.9% utilization"✅ Correct understanding:
- Spine switches: Non-blocking, rarely drop packets
- Leaf switches: Oversubscribed, may drop under extreme load
- TCP backoff: Lost packets cause retrans, performance degradation
- Target: <0.01% packet loss across all region pairs (per Google SLA)
Impact: Sustained 90%+ leaf utilization causes visible packet drops, retransmits.
Prevention: Keep sustained utilization <70%, burst temporary. Monitor switch fabric metrics.
Mistake 3: Not Considering Physical Rack Placement
❌ Wrong thinking:
"VMs are placed automatically, no need to think about rack topology"✅ Correct understanding:
- Andromeda: Logical VPC placement
- Jupiter: Physical rack/leaf placement (you typically don't control)
- But: Zones/racks matter for performance
- VMs in same rack/leaf: Lowest latency, but shared fabric
- VMs in different racks: Slightly higher latency, but better isolation
Impact: Application performance unpredictable if 2 critical VMs co-locate on same leaf → bottleneck.
Prevention: Use Pod Affinity/Anti-Affinity (for GKE). Use placement policies for predictable topology.
Mistake 4: Ignoring Oversubscription in Capacity Planning
❌ Wrong thinking:
"1000 servers × 25Gbps = 25Tbps total capacity"✅ Correct understanding:
- Raw capacity: 25Tbps (if all servers sent simultaneously)
- Real sustained: ~5Tbps (accounting for 2:1 oversubscription)
- Burst: Up to 10Tbps for short periods
- Realistic: Plan for ~40% average utilization
Impact: Overspend on compute resources due to network bottleneck.
Prevention: Include network throughput in capacity planning. Monitor egress bottleneck metrics.
GCP-native Implementation Guidance
Understanding Zone Placement
# List all zones in a region
gcloud compute zones list --filter="region:us-central1"
# Create VM and check which leaf/rack (not directly available, but zone is proxy)
gcloud compute instances create vm1 --zone=us-central1-a
gcloud compute instances create vm2 --zone=us-central1-b
# VM in different zone likely on different leaf → different fabric path
# Verify zone assignment:
gcloud compute instances describe vm1 --zone=us-central1-a \
--format='value(zone)'Monitoring Fabric Congestion
# VPC Flow Logs capture packet-level telemetry (from Andromeda)
gcloud compute instances create test-vm \
--zone=us-central1-a \
--network-interface=enable-display-device=true
# Enable flow logs on subnet
gcloud compute networks subnets update my-subnet \
--enable-flow-logs \
--region=us-central1
# Query high-latency flows (indicates congestion)
gcloud logging read "resource.type=gce_instance AND jsonPayload.bytes_sent>1000000" \
--format=json | grep -i latencyMulti-Zone Load Distribution
# Create instance groups across zones (to distribute across leaves)
gcloud compute instance-groups managed create ig-zone-a \
--base-instance-name=vm \
--template=my-template \
--size=10 \
--zone=us-central1-a
gcloud compute instance-groups managed create ig-zone-b \
--base-instance-name=vm \
--template=my-template \
--size=10 \
--zone=us-central1-b
# Load balancer distributes traffic across zones
# Result: VMs in separate zones = separate leaves = better isolationReferences
- Jupiter Rising: A Decade of Clos Topology and Centralized Control in Google's Datacenter Network (Google Tech Report, 2015) — Detailed Jupiter architecture
- Google Datacenter Networking Architecture — High-level overview
- Network Performance Monitoring — Monitor Jupiter fabric congestion
- VPC Flow Logs Documentation — Understand traffic patterns on fabric
Next: GCP Edge Network & Point of Presence (PoP) — How Internet traffic enters GCP fabric