Jupiter Fabric: Spine-Leaf Topology & Bandwidth Oversubscription

Vì sao quan trọng trong production

Andromeda là lớp logical networking, nhưng nó chạy trên nền tảng physical network infrastructure — đó chính là Jupiter Fabric. Hiểu Jupiter giúp bạn:

Dự báo các bottleneck về hiệu suất: Biết single VM có thể đạt throughput tối đa bao nhiêu
Thiết kế multi-zone deployments: Hiểu rõ latency và throughput giữa các zone
Xử lý tail latency: Khi performance drop không phải do application code
Tối ưu hóa sử dụng băng thông: Biết khi nào nên gộp VMs trong zone, khi nào phân tán trên các zone

Jupiter Fabric quyết định physical packet path, và Andromeda quyết định logical routing path. Kết hợp cả hai, bạn mới có cái nhìn toàn diện về hành vi mạng.

Internal Model: Spine-Leaf Datacenter Architecture

Mạng ba tầng truyền thống (On-Prem / Đã lỗi thời)

┌─────────────────────────┐
│    Core Layer           │
│  (High-End Switches)    │
│   10-100Gbps capacity   │
└────────────┬────────────┘
             │
    ┌────────┴────────┐
    │                 │
┌───▼──┐         ┌───▼──┐
│Agg1  │         │Agg2  │
│Aggr  │         │Aggr  │
└───┬──┘         └───┬──┘
    │                │
  ┌─┴────┐      ┌────┴─┐
  │      │      │      │
┌─▼─┐  ┌─▼─┐  ┌─▼─┐  ┌─▼─┐
│LS1│  │LS2│  │LS3│  │LS4│  (Leaf Switches)
└─┬─┘  └─┬─┘  └─┬─┘  └─┬─┘
  │      │      │      │
[Servers]      [Servers]

Problems:
- Oversubscription: Aggregation → Core bottleneck
- 1Gbps to each server, but core only 10Gbps total (1:10 ratio)
- Cross-pod traffic blocked (core limited)

Jupiter: Modern Spine-Leaf Architecture

┌─────────────────────────────────────────────────────┐
│              SPINE LAYER (top-of-rack)              │
│  Spine1    Spine2    Spine3    Spine4    Spine5     │
│  200Gbps   200Gbps   200Gbps   200Gbps   200Gbps    │
│   ▲         ▲         ▲         ▲         ▲         │
└───┼─────────┼─────────┼─────────┼─────────┼─────────┘
    │         │         │         │         │
    │  (25Gbps per link)                   │
    │         │         │         │         │
┌───┼─────────┼─────────┼─────────┼─────────┼─────────┐
│   │         │         │         │         │         │
│ Leaf1    Leaf2      Leaf3     Leaf4     Leaf5      │
│  48×25G   48×25G     48×25G    48×25G    48×25G     │
│   │         │         │         │         │         │
│ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ ┌─┴─────┐ │
│ │ 48 Svrs  │ 48 Svrs  │ 48 Svrs  │ 48 Svrs  │ 48 Svrs│ │
│ │ (50Gbps) │ (50Gbps) │ (50Gbps) │ (50Gbps) │(50Gbps)│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────┘

Key Characteristics:
- Non-blocking fabric: Any server can reach any other at full 25Gbps
- Many equal paths: Spine-Leaf-Spine provides ECMP (Equal Cost Multi-Path)
- Oversubscription WITHIN leaf (48×50G → 240Gbps down, 5×25G = 125Gbps up) = ~2:1
- But oversubscription justified: most traffic short-lived, not all servers saturated simultaneously

Tại sao chọn Spine-Leaf (thay vì 3-Tier)

Vấn đề với 3-tier: Bottleneck ở tầng aggregation/core

Server → Leaf: 50Gbps (full bisection)
Leaf → Aggregation: 10Gbps (oversubscribed 5:1)
Cross-pod traffic mất 5x

Giải pháp với Spine-Leaf: Băng thông bằng nhau ở tất cả các tầng

Server → Leaf: 25Gbps
Leaf → Spine → Leaf: 25Gbps (via ECMP, load-balanced)
Any-to-any: cùng băng thông bất kể đích

Trade-off về chi phí:

3-tier: Ít spine switch hơn, nhưng core đắt (high-end)
Spine-Leaf: Nhiều spine switch hơn, nhưng dùng commodity switches ở mọi tầng (rẻ hơn per-port)

Ở quy mô của Google, Spine-Leaf rẻ hơn + hiệu suất tốt hơn.

Jupiter Fabric Details

Per-Server Connectivity

Server (Compute Host)
├─ 2x 100Gbps NICs (redundancy)
│  ├─ NIC1: connected to Leaf1 (25Gbps)
│  ├─ NIC2: connected to Leaf2 (25Gbps)
│  └─ Both same subnet (LAG / bonding)
│
└─ Sustained throughput: 50Gbps (25Gbps per leaf, aggregated)

Why 2 NICs?

Redundancy: If Leaf1 fails, traffic still flows via Leaf2
Capacity: Combined 50Gbps throughput per server
Network diversity: Traffic not concentrated on single leaf

Định tuyến bên trong Fabric (Spine-Leaf Forwarding)

Khi server gửi packet:

1. Packet arrives at Leaf switch
   - ECMP: Multiple equal-cost paths to destination exist
   - Load balancing: Hash on (src_IP, dst_IP, src_port, dst_port)
   - Selects one of N spine switches

2. Spine switch receives
   - Examines dest MAC → determine destination leaf
   - Forwards to destination leaf (via next hop)

3. Destination leaf receives
   - Examines dest MAC → determine destination server port
   - Forwards to server

Example:
src=10.1.0.2, dst=10.1.0.3
Hash(10.1.0.2, 10.1.0.3, port1, port2) = Spine2
Route: Leaf1 → Spine2 → Leaf1 (can be same leaf!)

Bandwidth Oversubscription Implication

Jupiter has 2:1 oversubscription at leaf level:

Leaf Switch Capacity:
├─ Downlinks (to servers): 48 × 50Gbps = 2400Gbps
├─ Uplinks (to spines): 5 × 100Gbps = 500Gbps
└─ Ratio: 2400:500 = 4.8:1 ??? (Wait, seems worse than 2:1...)

Actually more nuanced:
├─ If all 48 servers send traffic locally (within leaf) → up to 2400Gbps
├─ If traffic goes between leaves → limited by uplinks (500Gbps shared)
├─ Realistic: ~48 servers don't saturate simultaneously
└─ Design assumes avg 25% utilization → 600Gbps per leaf sustained

Implications for production:

✅ Single VM: Can achieve 25-50Gbps if isolated
✅ Many VMs on same leaf, local traffic: Each gets fraction, but aggregate bandwidth-limited
⚠️ All VMs sending cross-leaf simultaneously: Congestion, packet drops, tail latency
Mitigations: Traffic scheduling, bandwidth reservation, burst allowance

Physical vs Logical Topology

Logical (What Andromeda sees):
└─ VPC A: 10.1.0.0/24
   ├─ Subnet: 10.1.0.0/24
   └─ All VMs connected logically (flat)

Physical (Jupiter sees):
└─ Datacenter topology
   ├─ Leaf switches (rack-level)
   ├─ Spine switches (fabric-level)
   └─ Servers in racks
       ├─ Server1 (10.1.0.2) @ Rack5, connected to Leaf5
       └─ Server2 (10.1.0.3) @ Rack7, connected to Leaf7

Packet from Server1 to Server2:
- Andromeda: Route 10.1.0.2 → 10.1.0.3 (logical, local subnet)
- Jupiter: Forward Leaf5 → Spine → Leaf7 (physical path)
- Result: Packet takes physical path, but VMs unaware of leaf/spine details

Production Architecture Patterns

Pattern 1: High-Throughput Workload (Batch Processing)

Requirement: Move 100GB dataset in 1 minute
┌──────────────────────┐
│ GCS Bucket (multi-region)
└────────┬─────────────┘
         │ 25Gbps per source
         │
┌────────▼──────────────────────────┐
│ Compute VMs (4 instances)         │
├─ VM1: 10.1.0.10 @ Leaf1 (25Gbps)  │
├─ VM2: 10.1.0.11 @ Leaf2 (25Gbps)  │
├─ VM3: 10.1.0.12 @ Leaf1 (25Gbps)  │  (Note: VMs distributed across leaves)
└─ VM4: 10.1.0.13 @ Leaf3 (25Gbps)  │

Throughput breakdown:

Ideal: 4 × 25Gbps = 100Gbps total
Reality: IOs contention, spine limits → ~80Gbps sustained
100GB ÷ 80Gbps = 1.25 seconds (works!)

Jupiter impact: VMs distributed across leaves to avoid leaf-level oversubscription bottleneck.

Pattern 2: Database Replication (Low Throughput, High Latency-Sensitive)

┌───────────────────────┐
│ Primary DB (Zone A)   │
│ IP: 10.1.0.100        │
│ @ Leaf1               │
├───────────────────────┤
│ Replication Channel   │
│ Throughput: 10Mbps    │
│ Latency SLA: <5ms     │
└───────────┬───────────┘
            │
            │ (Leaf1 → Spine2 → Leaf2: ~50μs physical latency)
            │
┌───────────▼───────────┐
│ Replica DB (Zone B)   │
│ IP: 10.2.0.100        │
│ @ Leaf2               │
└───────────────────────┘

Jupiter impact: Low throughput fine, but latency critical.

Spines designed for <50μs latency
ECMP routing provides consistent low latency (no long queues)
Replication doesn't cause fabric congestion (small data volume)

Pattern 3: Multi-Zone Failover Architecture

Primary (Zone A):
├─ Application @ Leaf1
├─ Database @ Leaf3
└─ Replication link: 10Mbps

Standby (Zone B):
├─ Application @ Leaf2
├─ Database @ Leaf4
└─ Ready to take over

Failover: Primary data center down
├─ Standby Application: DNS updated → connects to Standby Database
├─ Jupiter fabric: Routes traffic from Zone B to Zone B internally
├─ Latency: <5ms (intra-zone via spines)
├─ Throughput: Full 25Gbps available
└─ Recovery: No cross-zone congestion

Jupiter benefit: Each zone isolated fabric-wise, failover doesn't impact other zones.

Common Mistakes & Anti-Patterns

Mistake 1: Assuming All Servers Equal Throughput to All Others

❌ Wrong thinking:

"All VMs in same zone have equal 25Gbps to all other VMs"

✅ Correct understanding:

Single VM: Can achieve ~25Gbps egress to single destination
2 VMs on SAME leaf sending across leaf: Each limited to ~12.5Gbps (sharing uplink)
2 VMs on DIFFERENT leaves: Can achieve near 25Gbps each (separate uplinks)
Many-to-many: Aggregate spine bandwidth becomes bottleneck

Impact: Assumes application scales linearly, but networking limited at higher concurrency.

Prevention: Benchmark actual throughput in your zone. Use iperf3 between VMs, measure across topology.

Mistake 2: Expecting Zero Packet Loss at High Utilization

❌ Wrong thinking:

"Google infrastructure never drops packets, so can rely on 99.9% utilization"

✅ Correct understanding:

Spine switches: Non-blocking, rarely drop packets
Leaf switches: Oversubscribed, may drop under extreme load
TCP backoff: Lost packets cause retrans, performance degradation
Target: <0.01% packet loss across all region pairs (per Google SLA)

Impact: Sustained 90%+ leaf utilization causes visible packet drops, retransmits.

Prevention: Keep sustained utilization <70%, burst temporary. Monitor switch fabric metrics.

Mistake 3: Not Considering Physical Rack Placement

❌ Wrong thinking:

"VMs are placed automatically, no need to think about rack topology"

✅ Correct understanding:

Andromeda: Logical VPC placement
Jupiter: Physical rack/leaf placement (you typically don't control)
But: Zones/racks matter for performance
VMs in same rack/leaf: Lowest latency, but shared fabric
VMs in different racks: Slightly higher latency, but better isolation

Impact: Application performance unpredictable if 2 critical VMs co-locate on same leaf → bottleneck.

Prevention: Use Pod Affinity/Anti-Affinity (for GKE). Use placement policies for predictable topology.

Mistake 4: Ignoring Oversubscription in Capacity Planning

❌ Wrong thinking:

"1000 servers × 25Gbps = 25Tbps total capacity"

✅ Correct understanding:

Raw capacity: 25Tbps (if all servers sent simultaneously)
Real sustained: ~5Tbps (accounting for 2:1 oversubscription)
Burst: Up to 10Tbps for short periods
Realistic: Plan for ~40% average utilization

Impact: Overspend on compute resources due to network bottleneck.

Prevention: Include network throughput in capacity planning. Monitor egress bottleneck metrics.

GCP-native Implementation Guidance

Understanding Zone Placement

bash

# List all zones in a region
gcloud compute zones list --filter="region:us-central1"

# Create VM and check which leaf/rack (not directly available, but zone is proxy)
gcloud compute instances create vm1 --zone=us-central1-a
gcloud compute instances create vm2 --zone=us-central1-b

# VM in different zone likely on different leaf → different fabric path

# Verify zone assignment:
gcloud compute instances describe vm1 --zone=us-central1-a \
  --format='value(zone)'

Monitoring Fabric Congestion

bash

# VPC Flow Logs capture packet-level telemetry (from Andromeda)
gcloud compute instances create test-vm \
  --zone=us-central1-a \
  --network-interface=enable-display-device=true

# Enable flow logs on subnet
gcloud compute networks subnets update my-subnet \
  --enable-flow-logs \
  --region=us-central1

# Query high-latency flows (indicates congestion)
gcloud logging read "resource.type=gce_instance AND jsonPayload.bytes_sent>1000000" \
  --format=json | grep -i latency

Multi-Zone Load Distribution

bash

# Create instance groups across zones (to distribute across leaves)
gcloud compute instance-groups managed create ig-zone-a \
  --base-instance-name=vm \
  --template=my-template \
  --size=10 \
  --zone=us-central1-a

gcloud compute instance-groups managed create ig-zone-b \
  --base-instance-name=vm \
  --template=my-template \
  --size=10 \
  --zone=us-central1-b

# Load balancer distributes traffic across zones
# Result: VMs in separate zones = separate leaves = better isolation

References

Jupiter Rising: A Decade of Clos Topology and Centralized Control in Google's Datacenter Network (Google Tech Report, 2015) — Detailed Jupiter architecture
Google Datacenter Networking Architecture — High-level overview
Network Performance Monitoring — Monitor Jupiter fabric congestion
VPC Flow Logs Documentation — Understand traffic patterns on fabric

Next: GCP Edge Network & Point of Presence (PoP) — How Internet traffic enters GCP fabric

Jupiter Fabric: Spine-Leaf Topology & Bandwidth Oversubscription ​

Vì sao quan trọng trong production ​

Internal Model: Spine-Leaf Datacenter Architecture ​

Mạng ba tầng truyền thống (On-Prem / Đã lỗi thời) ​

Jupiter: Modern Spine-Leaf Architecture ​

Tại sao chọn Spine-Leaf (thay vì 3-Tier) ​

Jupiter Fabric Details ​

Per-Server Connectivity ​

Định tuyến bên trong Fabric (Spine-Leaf Forwarding) ​

Bandwidth Oversubscription Implication ​

Physical vs Logical Topology ​

Production Architecture Patterns ​

Pattern 1: High-Throughput Workload (Batch Processing) ​

Pattern 2: Database Replication (Low Throughput, High Latency-Sensitive) ​

Pattern 3: Multi-Zone Failover Architecture ​

Common Mistakes & Anti-Patterns ​

Mistake 1: Assuming All Servers Equal Throughput to All Others ​

Mistake 2: Expecting Zero Packet Loss at High Utilization ​

Mistake 3: Not Considering Physical Rack Placement ​

Mistake 4: Ignoring Oversubscription in Capacity Planning ​

GCP-native Implementation Guidance ​

Understanding Zone Placement ​

Monitoring Fabric Congestion ​

Multi-Zone Load Distribution ​

References ​

Jupiter Fabric: Spine-Leaf Topology & Bandwidth Oversubscription

Vì sao quan trọng trong production

Internal Model: Spine-Leaf Datacenter Architecture

Mạng ba tầng truyền thống (On-Prem / Đã lỗi thời)

Jupiter: Modern Spine-Leaf Architecture

Tại sao chọn Spine-Leaf (thay vì 3-Tier)

Jupiter Fabric Details

Per-Server Connectivity

Định tuyến bên trong Fabric (Spine-Leaf Forwarding)

Bandwidth Oversubscription Implication

Physical vs Logical Topology

Production Architecture Patterns

Pattern 1: High-Throughput Workload (Batch Processing)

Pattern 2: Database Replication (Low Throughput, High Latency-Sensitive)

Pattern 3: Multi-Zone Failover Architecture

Common Mistakes & Anti-Patterns

Mistake 1: Assuming All Servers Equal Throughput to All Others

Mistake 2: Expecting Zero Packet Loss at High Utilization

Mistake 3: Not Considering Physical Rack Placement

Mistake 4: Ignoring Oversubscription in Capacity Planning

GCP-native Implementation Guidance

Understanding Zone Placement

Monitoring Fabric Congestion

Multi-Zone Load Distribution

References