Traffic Engineering & Multi-path Load Balancing
Vì sao quan trọng trong production
Cuối cùng, chúng ta tie everything together — làm sao GCP optimize packet paths qua Jupiter fabric + Andromeda + backbone:
- ECMP routing: Distribute load across multiple equal paths
- Capacity planning: Ensure headroom for failures + peaks
- Failure scenarios: How network recovers from failures
- Multi-path resilience: Automatic failover without manual intervention
Hiểu cấp độ này = bạn có thể architect resilient networks at scale.
Internal Model: Multi-path Routing Architecture
ECMP (Equal-Cost Multi-Path) Routing
Standard unicast routing:
┌──────────────┐
│ Source VM │
│ 203.0.113.1 │
└──────┬───────┘
│
┌──▼──────────────────────────────────────────┐
│ Routing table lookup: │
│ Destination: 35.201.123.45 │
│ Result: 1 best path to gateway │
└──────────┬─────────────────────────────────┘
│
┌──────────▼──────────────────────────────┐
│ Single gateway │
│ Packet: →[Router]→[Switch]→[Fiber]→... │
└──────────┬───────────────────────────────┘
│ (Bottleneck if router fails!)
┌──────────▼──────────────┐
│ Destination │
│ 35.201.123.45 (single) │
└───────────────────────┘
Problem:
├─ Single path = single point of failure
└─ If router down: All traffic blocked
ECMP multipath routing:
┌──────────────┐
│ Source VM │
│ 203.0.113.1 │
└──────┬───────┘
│
┌──▼──────────────────────────────────────────────┐
│ Routing table lookup: │
│ Destination: 35.201.123.45 │
│ Result: 4 equal-cost paths (same cost metric) │
│ ├─ Path A via Gateway1 (cost 100) │
│ ├─ Path B via Gateway2 (cost 100) │
│ ├─ Path C via Gateway3 (cost 100) │
│ └─ Path D via Gateway4 (cost 100) │
└──────────┬─────────────────────────────────────┘
│
┌──────────┴────────────┬────────────┬─────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Gateway1 │ │Gateway2 │ │Gateway3 │ │Gateway4 │
│(25%) │ │(25%) │ │(25%) │ │(25%) │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
│ (ECMP: Hash on flow) │ │ │
│ Flow1→Path A │ │ │
│ Flow2→Path B │ │ │
│ Flow3→Path C │ │ │
│ Flow4→Path D │ │ │
│ Flow5→Path A (reuse) │ │ │
│ ▼ ▼ ▼
└──────────────────────┴───────────┴───────────→ Destination
Benefits:
├─ Load balanced: 25% traffic per path
├─ Resilience: If Path A down, Flow1 reroutes to Path B
├─ Bandwidth: 4x capacity vs single path
└─ Automatic: No manual intervention neededECMP Hash Function
When router sees multiple equal-cost paths:
├─ Instead of using "first" path
├─ Calculate: Hash(source_IP, dest_IP, src_port, dst_port)
├─ Result: 32-bit hash value
└─ Use: hash % num_paths = selected path
Example with 4 paths:
├─ Flow (203.0.113.1:8000 → 35.201.123.45:443)
│ └─ Hash: 0x4a5f9c2e % 4 = 2 → Path C
│
├─ Flow (203.0.113.2:8001 → 35.201.123.45:443)
│ └─ Hash: 0x7b3e1d5a % 4 = 0 → Path A
│
└─ Same flows always take same path (for TCP session stability)
Benefit:
└─ Per-flow load balancing (not per-packet, which would reorder)Production Architecture Patterns
Pattern 1: Data Center Network with ECMP
Deployment: 200 VMs processing 50Gbps traffic
Network topology (Jupiter fabric):
├─ Leaf switches: 4 (one per rack)
├─ Spine switches: 8
│
└─ Paths per VM:
├─ To spine: 2 parallel links (LAG to 2 spines)
├─ From spine: 8 possible paths (to any spine → any leaf)
└─ Total paths to any destination: 8 (ECMP)
Traffic distribution:
├─ 50Gbps = 50,000 flows (100Kbps per flow average)
├─ ECMP hash: Distribute flows across 8 paths
├─ Per path: 6.25Gbps (50Gbps ÷ 8)
├─ Per link: Can handle 100Gbps
└─ Utilization: 6% (plenty of headroom)
Failure scenario (1 spine down):
├─ Paths available: 7 (instead of 8)
├─ Per-path load: 7.1Gbps
├─ Still acceptable: 7% utilization
├─ Automatic: No config changes needed, ECMP recalculates
Failure scenario (1 leaf down):
├─ 50 VMs affected (on that leaf)
├─ Traffic rerouted: Via remaining leaves
├─ Temporary: Congestion spike (~15% util for 30 seconds)
├─ Recovery: Workloads reschedule to other leaves (~5 min)Pattern 2: Global Multi-Region with ECMP
Global application: 4 regions (us-central1, eu-west1, asia-southeast1, us-east1)
Peak traffic: 100Gbps to any region
Per-region ECMP setup:
├─ Each region: 8 spines (ECMP paths to other regions)
├─ Regional egress: 4 primary paths to backbone
├─ Plus: 2 backup paths (in case of congestion/failure)
│
└─ Per-region capacity:
├─ 100Gbps to each other region (multiples)
└─ Total: 300Gbps (to all 3 other regions)
Traffic engineering during peak:
├─ Normal: Load distributed via ECMP
├─ Congestion detected: Traffic shifted to secondary paths
├─ Result: Automatic re-balancing (milliseconds)
Failure scenario (transatlantic cable cut):
├─ Primary path (via Cable A): Down
├─ Backup path (via Cable B): Activated
├─ ECMP: Recalculates, includes backup path
├─ Result: Latency increases (longer route) but connectivity maintained
├─ Recovery: <100ms failover (ECMP convergence time)Pattern 3: Capacity Planning for Failure Scenarios
Production SLA: 99.99% (52.6 min downtime/year)
Normal traffic: 10Gbps average
Peak traffic: 20Gbps (2x normal)
Target resilience: Survive any single failure
Capacity formula:
├─ Normal capacity needed: 10Gbps (baseline)
├─ Peak capacity needed: 20Gbps (expected peak)
├─ Failure resilience: Must absorb any 1 failure
├─ Required total: 20 * 1.5 = 30Gbps (1.5x for n+1)
Allocation strategy:
├─ Region 1: 10Gbps capacity (full traffic)
├─ Region 2: 10Gbps capacity (full traffic)
├─ Region 3: 10Gbps capacity (full traffic backup)
├─ Total provisioned: 30Gbps
│
└─ Failure scenarios:
├─ Region 1 fails: Region 2+3 can handle 20Gbps ✓
├─ Region 2 fails: Region 1+3 can handle 20Gbps ✓
├─ Region 3 fails: Region 1+2 can handle 20Gbps ✓
└─ Any single failure: SLA maintained ✓Real-world Failure Scenarios
Scenario 1: Asymmetric Path Failure (ECMP Challenge)
Setup: 4 ECMP paths from source to destination
Symptom:
├─ Throughput: 50Gbps → 25Gbps (50% drop)
├─ No packets lost
├─ No errors in logs
Root cause:
├─ Path A (ingress): Working normally
├─ Path A (egress): Congested (return path bottleneck)
├─ Path B,C,D: Not affected
│
└─ Impact:
├─ Inbound via Paths A,B,C,D: 12.5Gbps each
├─ Outbound via Path A only: 50Gbps (limited!)
└─ Result: Bottleneck at egress
Investigation:
├─ Measure: One-way latency A→dest vs dest→A
├─ Result: Return path 10x slower
├─ Cause: Asymmetric congestion
Resolution:
├─ Shift egress traffic to use other paths
├─ Require: Application-level routing control
├─ Or: Manual traffic engineering (adjust BGP weights)Scenario 2: ECMP Hash Collision (Unbalanced Load)
Symptom:
├─ VM1: 100% CPU (100Gbps traffic)
├─ VM2: 10% CPU (10Gbps traffic)
├─ VM3: 10% CPU
├─ Total traffic: 120Gbps
Expected:
└─ 40Gbps per VM (equal distribution)
Root cause:
├─ ECMP hash function: Hash(flow_tuple) % num_paths
├─ Large flows: All happen to hash to same path
├─ Example:
│ ├─ Flow: (203.0.113.1:50000 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│ ├─ Flow: (203.0.113.1:50001 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│ ├─ Flow: (203.0.113.1:50002 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│ └─ Result: VM1 gets multiple large flows, others get small flows
Detection:
├─ Monitor per-flow distribution
├─ Alert: If any single path >70% of average
└─ Requires: Network telemetry (NetFlow/sFlow)
Resolution:
├─ Short-term: Redistribute flows manually (drain VM1)
├─ Long-term: Improve hash function (use more tuple elements)
├─ Or: Use application-level load balancing instead of ECMPScenario 3: Cascade Failures (Multiple Failures)
Failure sequence:
├─ T+0:00 Path A fails (high utilization)
├─ T+0:05 Remaining paths: 7 (instead of 8)
├─ T+0:10 Congestion: Each path now 14.3% loaded
│
├─ T+0:15 Path B fails (triggered by congestion on Path A's backup)
├─ T+0:20 Remaining paths: 6
├─ T+0:25 Congestion: Each path 16.7% loaded
│
├─ T+0:30 Path C fails (cascade continues)
│ └─ Pattern: Each failure triggers congestion, which triggers next failure
│
└─ Result: Cascading failure (failure chain reaction)
Why cascades happen:
├─ Congestion threshold: Path degrades after ~70% utilization
├─ Multiple failures: Remaining paths saturated
├─ Loss feedback: Lost packets trigger retransmits, more congestion
└─ Doom loop: Congestion → more loss → more congestion
Prevention:
├─ Design: n+2 redundancy (can lose 2 paths)
├─ Monitor: Alert when single path fails
├─ Failover: Drain load from remaining paths to prevent cascade
├─ Burst allowance: Built-in safeguards in GCPCommon Mistakes & Anti-Patterns
Mistake 1: Designing for n Redundancy (Instead of n+1)
❌ Wrong thinking:
"Have multiple paths, any one can fail and we're still ok"✅ Correct understanding:
- n redundancy: Lose 1 path, continue working
- n+1 redundancy: Lose 1 path, no performance degradation
- Need: n+1 to maintain SLA during single failure
Prevention: Design formula: total_capacity = (peak_traffic * 1.5)
Mistake 2: Not Monitoring Per-Path Utilization
❌ Wrong thinking:
"Average utilization 40%, safe even if path fails"✅ Correct understanding:
- Average: Meaningless if one path is 90% and others 10%
- ECMP hash collision: Can cause unbalanced load
- Must monitor: Per-path utilization
Prevention: Alert on any path >70% of average utilization.
Mistake 3: Assuming Automatic Failover "Always Works"
❌ Wrong thinking:
"ECMP automatic, failover handled by network"✅ Correct understanding:
- ECMP failover: Automatic but takes 10-50ms
- Long connections: Might get disrupted
- TCP: Expects packet ordering, ECMP changes paths mid-connection
- Result: Possible packet loss during failover
Prevention: Design applications to handle connection resets. Use connection pooling.
GCP-native Implementation Guidance
Monitoring ECMP Load Distribution
bash
# Enable VPC Flow Logs to observe path distribution
gcloud compute networks subnets update my-subnet \
--enable-flow-logs \
--region=us-central1
# Query flows with path information
gcloud logging read \
'resource.type="gce_instance" AND jsonPayload.traffic_type="TCP"' \
--format='table(jsonPayload.src_ip, jsonPayload.dst_ip, jsonPayload.bytes_sent)' \
--limit=100 | sort -k3 -nr
# Analyze ECMP distribution (should be roughly equal)
gcloud logging read \
'resource.type="gce_instance"' \
--format='table(jsonPayload.end_time, jsonPayload.bytes_sent)' | \
awk '{sum[$1]+=$2} END {for (t in sum) print t, sum[t]/1000000000 " GB"}' | sortCapacity Planning Calculation
bash
#!/bin/bash
# Calculate required capacity for failure resilience
PEAK_TRAFFIC_GBPS=50
RESILIENCE_FACTOR=1.5 # n+1 redundancy
REQUIRED_CAPACITY=$(echo "$PEAK_TRAFFIC_GBPS * $RESILIENCE_FACTOR" | bc)
echo "Peak traffic: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "Resilience factor: $RESILIENCE_FACTOR (n+1)"
echo "Required capacity: ${REQUIRED_CAPACITY}Gbps"
echo ""
echo "Deployment strategy:"
echo "├─ Region 1: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Region 2: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "└─ Total provisioned: ${REQUIRED_CAPACITY}Gbps"
echo ""
echo "Failure scenario:"
echo "├─ 1 region fails: Remaining regions absorb ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Capacity used: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Utilization: $(echo "scale=1; $PEAK_TRAFFIC_GBPS / $REQUIRED_CAPACITY * 100" | bc)%"
echo "└─ Result: SLA maintained ✓"Testing Failover Scenarios
bash
# Simulate failure by blocking a specific egress path
# 1. Identify path (via traceroute or NetFlow)
GATEWAY="192.168.1.1"
# 2. Block path using iptables (on test VM only!)
sudo iptables -A FORWARD -d $GATEWAY -j DROP
# 3. Observe application behavior
# - Does latency increase?
# - Do connections drop?
# - How long to recover?
# 4. Monitor logs
gcloud logging read "resource.type=gce_instance AND severity=WARNING" \
--format='table(severity, timestamp, textPayload)'
# 5. Restore path
sudo iptables -D FORWARD -d $GATEWAY -j DROP
# 6. Verify recovery
# Analysis: How long until normal operation?References
- ECMP Load Balancing in GCP — GCP implementation
- Network Resilience Best Practices — Architecture patterns
- Traffic Engineering with BGP — Advanced routing
- Capacity Planning Guide — Formal methodology
Conclusion: Putting It All Together
You've now covered:
- Andromeda (SDN layer) — How VPCs implemented
- Jupiter (Physical fabric) — Hardware topology
- PoP (Edge) — How traffic enters GCP
- Global backbone — Premium vs Standard routing
- Latency SLA — Fiber path engineering
- Anycast — Global load balancing
- Traffic routing — Cold potato vs hot potato
- Network tiers — SLA implementation
- Bandwidth allocation — Capacity & costs
- Regional services — Data sovereignty
- Traffic engineering — Multi-path resilience
Together: A complete understanding of GCP's physical network architecture for production systems.
Next Steps
- Implement: Choose right tier/region for your workload
- Monitor: Set up per-path monitoring
- Design: Plan for n+1 resilience
- Audit: Verify regional constraints
- Optimize: Balance latency, cost, and compliance