Skip to content

Traffic Engineering & Multi-path Load Balancing

Vì sao quan trọng trong production

Cuối cùng, chúng ta tie everything together — làm sao GCP optimize packet paths qua Jupiter fabric + Andromeda + backbone:

  • ECMP routing: Distribute load across multiple equal paths
  • Capacity planning: Ensure headroom for failures + peaks
  • Failure scenarios: How network recovers from failures
  • Multi-path resilience: Automatic failover without manual intervention

Hiểu cấp độ này = bạn có thể architect resilient networks at scale.

Internal Model: Multi-path Routing Architecture

ECMP (Equal-Cost Multi-Path) Routing

Standard unicast routing:
┌──────────────┐
│ Source VM    │
│ 203.0.113.1  │
└──────┬───────┘

    ┌──▼──────────────────────────────────────────┐
    │ Routing table lookup:                       │
    │ Destination: 35.201.123.45                  │
    │ Result: 1 best path to gateway              │
    └──────────┬─────────────────────────────────┘

    ┌──────────▼──────────────────────────────┐
    │ Single gateway                          │
    │ Packet: →[Router]→[Switch]→[Fiber]→... │
    └──────────┬───────────────────────────────┘
               │ (Bottleneck if router fails!)
    ┌──────────▼──────────────┐
    │ Destination            │
    │ 35.201.123.45 (single) │
    └───────────────────────┘

Problem:
├─ Single path = single point of failure
└─ If router down: All traffic blocked

ECMP multipath routing:
┌──────────────┐
│ Source VM    │
│ 203.0.113.1  │
└──────┬───────┘

    ┌──▼──────────────────────────────────────────────┐
    │ Routing table lookup:                           │
    │ Destination: 35.201.123.45                      │
    │ Result: 4 equal-cost paths (same cost metric)   │
    │ ├─ Path A via Gateway1 (cost 100)               │
    │ ├─ Path B via Gateway2 (cost 100)               │
    │ ├─ Path C via Gateway3 (cost 100)               │
    │ └─ Path D via Gateway4 (cost 100)               │
    └──────────┬─────────────────────────────────────┘

    ┌──────────┴────────────┬────────────┬─────────────┐
    │                       │            │             │
    ▼                       ▼            ▼             ▼
┌─────────┐            ┌─────────┐ ┌─────────┐ ┌─────────┐
│Gateway1 │            │Gateway2 │ │Gateway3 │ │Gateway4 │
│(25%)    │            │(25%)    │ │(25%)    │ │(25%)    │
└────┬────┘            └────┬────┘ └────┬────┘ └────┬────┘
     │                      │           │           │
     │ (ECMP: Hash on flow) │           │           │
     │ Flow1→Path A         │           │           │
     │ Flow2→Path B         │           │           │
     │ Flow3→Path C         │           │           │
     │ Flow4→Path D         │           │           │
     │ Flow5→Path A (reuse) │           │           │
     │                      ▼           ▼           ▼
     └──────────────────────┴───────────┴───────────→ Destination

Benefits:
├─ Load balanced: 25% traffic per path
├─ Resilience: If Path A down, Flow1 reroutes to Path B
├─ Bandwidth: 4x capacity vs single path
└─ Automatic: No manual intervention needed

ECMP Hash Function

When router sees multiple equal-cost paths:
├─ Instead of using "first" path
├─ Calculate: Hash(source_IP, dest_IP, src_port, dst_port)
├─ Result: 32-bit hash value
└─ Use: hash % num_paths = selected path

Example with 4 paths:
├─ Flow (203.0.113.1:8000 → 35.201.123.45:443)
│  └─ Hash: 0x4a5f9c2e % 4 = 2 → Path C

├─ Flow (203.0.113.2:8001 → 35.201.123.45:443)
│  └─ Hash: 0x7b3e1d5a % 4 = 0 → Path A

└─ Same flows always take same path (for TCP session stability)

Benefit:
└─ Per-flow load balancing (not per-packet, which would reorder)

Production Architecture Patterns

Pattern 1: Data Center Network with ECMP

Deployment: 200 VMs processing 50Gbps traffic

Network topology (Jupiter fabric):
├─ Leaf switches: 4 (one per rack)
├─ Spine switches: 8

└─ Paths per VM:
   ├─ To spine: 2 parallel links (LAG to 2 spines)
   ├─ From spine: 8 possible paths (to any spine → any leaf)
   └─ Total paths to any destination: 8 (ECMP)

Traffic distribution:
├─ 50Gbps = 50,000 flows (100Kbps per flow average)
├─ ECMP hash: Distribute flows across 8 paths
├─ Per path: 6.25Gbps (50Gbps ÷ 8)
├─ Per link: Can handle 100Gbps
└─ Utilization: 6% (plenty of headroom)

Failure scenario (1 spine down):
├─ Paths available: 7 (instead of 8)
├─ Per-path load: 7.1Gbps
├─ Still acceptable: 7% utilization
├─ Automatic: No config changes needed, ECMP recalculates

Failure scenario (1 leaf down):
├─ 50 VMs affected (on that leaf)
├─ Traffic rerouted: Via remaining leaves
├─ Temporary: Congestion spike (~15% util for 30 seconds)
├─ Recovery: Workloads reschedule to other leaves (~5 min)

Pattern 2: Global Multi-Region with ECMP

Global application: 4 regions (us-central1, eu-west1, asia-southeast1, us-east1)
Peak traffic: 100Gbps to any region

Per-region ECMP setup:
├─ Each region: 8 spines (ECMP paths to other regions)
├─ Regional egress: 4 primary paths to backbone
├─ Plus: 2 backup paths (in case of congestion/failure)

└─ Per-region capacity:
   ├─ 100Gbps to each other region (multiples)
   └─ Total: 300Gbps (to all 3 other regions)

Traffic engineering during peak:
├─ Normal: Load distributed via ECMP
├─ Congestion detected: Traffic shifted to secondary paths
├─ Result: Automatic re-balancing (milliseconds)

Failure scenario (transatlantic cable cut):
├─ Primary path (via Cable A): Down
├─ Backup path (via Cable B): Activated
├─ ECMP: Recalculates, includes backup path
├─ Result: Latency increases (longer route) but connectivity maintained
├─ Recovery: <100ms failover (ECMP convergence time)

Pattern 3: Capacity Planning for Failure Scenarios

Production SLA: 99.99% (52.6 min downtime/year)
Normal traffic: 10Gbps average
Peak traffic: 20Gbps (2x normal)
Target resilience: Survive any single failure

Capacity formula:
├─ Normal capacity needed: 10Gbps (baseline)
├─ Peak capacity needed: 20Gbps (expected peak)
├─ Failure resilience: Must absorb any 1 failure
├─ Required total: 20 * 1.5 = 30Gbps (1.5x for n+1)

Allocation strategy:
├─ Region 1: 10Gbps capacity (full traffic)
├─ Region 2: 10Gbps capacity (full traffic)
├─ Region 3: 10Gbps capacity (full traffic backup)
├─ Total provisioned: 30Gbps

└─ Failure scenarios:
   ├─ Region 1 fails: Region 2+3 can handle 20Gbps ✓
   ├─ Region 2 fails: Region 1+3 can handle 20Gbps ✓
   ├─ Region 3 fails: Region 1+2 can handle 20Gbps ✓
   └─ Any single failure: SLA maintained ✓

Real-world Failure Scenarios

Scenario 1: Asymmetric Path Failure (ECMP Challenge)

Setup: 4 ECMP paths from source to destination

Symptom:
├─ Throughput: 50Gbps → 25Gbps (50% drop)
├─ No packets lost
├─ No errors in logs

Root cause:
├─ Path A (ingress): Working normally
├─ Path A (egress): Congested (return path bottleneck)
├─ Path B,C,D: Not affected

└─ Impact:
   ├─ Inbound via Paths A,B,C,D: 12.5Gbps each
   ├─ Outbound via Path A only: 50Gbps (limited!)
   └─ Result: Bottleneck at egress

Investigation:
├─ Measure: One-way latency A→dest vs dest→A
├─ Result: Return path 10x slower
├─ Cause: Asymmetric congestion

Resolution:
├─ Shift egress traffic to use other paths
├─ Require: Application-level routing control
├─ Or: Manual traffic engineering (adjust BGP weights)

Scenario 2: ECMP Hash Collision (Unbalanced Load)

Symptom:
├─ VM1: 100% CPU (100Gbps traffic)
├─ VM2: 10% CPU (10Gbps traffic)
├─ VM3: 10% CPU
├─ Total traffic: 120Gbps

Expected:
└─ 40Gbps per VM (equal distribution)

Root cause:
├─ ECMP hash function: Hash(flow_tuple) % num_paths
├─ Large flows: All happen to hash to same path
├─ Example:
│  ├─ Flow: (203.0.113.1:50000 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│  ├─ Flow: (203.0.113.1:50001 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│  ├─ Flow: (203.0.113.1:50002 → 35.201.123.45:443) Hash=0 → Path 0 (VM1)
│  └─ Result: VM1 gets multiple large flows, others get small flows

Detection:
├─ Monitor per-flow distribution
├─ Alert: If any single path >70% of average
└─ Requires: Network telemetry (NetFlow/sFlow)

Resolution:
├─ Short-term: Redistribute flows manually (drain VM1)
├─ Long-term: Improve hash function (use more tuple elements)
├─ Or: Use application-level load balancing instead of ECMP

Scenario 3: Cascade Failures (Multiple Failures)

Failure sequence:
├─ T+0:00 Path A fails (high utilization)
├─ T+0:05 Remaining paths: 7 (instead of 8)
├─ T+0:10 Congestion: Each path now 14.3% loaded

├─ T+0:15 Path B fails (triggered by congestion on Path A's backup)
├─ T+0:20 Remaining paths: 6
├─ T+0:25 Congestion: Each path 16.7% loaded

├─ T+0:30 Path C fails (cascade continues)
│  └─ Pattern: Each failure triggers congestion, which triggers next failure

└─ Result: Cascading failure (failure chain reaction)

Why cascades happen:
├─ Congestion threshold: Path degrades after ~70% utilization
├─ Multiple failures: Remaining paths saturated
├─ Loss feedback: Lost packets trigger retransmits, more congestion
└─ Doom loop: Congestion → more loss → more congestion

Prevention:
├─ Design: n+2 redundancy (can lose 2 paths)
├─ Monitor: Alert when single path fails
├─ Failover: Drain load from remaining paths to prevent cascade
├─ Burst allowance: Built-in safeguards in GCP

Common Mistakes & Anti-Patterns

Mistake 1: Designing for n Redundancy (Instead of n+1)

Wrong thinking:

"Have multiple paths, any one can fail and we're still ok"

Correct understanding:

  • n redundancy: Lose 1 path, continue working
  • n+1 redundancy: Lose 1 path, no performance degradation
  • Need: n+1 to maintain SLA during single failure

Prevention: Design formula: total_capacity = (peak_traffic * 1.5)

Mistake 2: Not Monitoring Per-Path Utilization

Wrong thinking:

"Average utilization 40%, safe even if path fails"

Correct understanding:

  • Average: Meaningless if one path is 90% and others 10%
  • ECMP hash collision: Can cause unbalanced load
  • Must monitor: Per-path utilization

Prevention: Alert on any path >70% of average utilization.

Mistake 3: Assuming Automatic Failover "Always Works"

Wrong thinking:

"ECMP automatic, failover handled by network"

Correct understanding:

  • ECMP failover: Automatic but takes 10-50ms
  • Long connections: Might get disrupted
  • TCP: Expects packet ordering, ECMP changes paths mid-connection
  • Result: Possible packet loss during failover

Prevention: Design applications to handle connection resets. Use connection pooling.

GCP-native Implementation Guidance

Monitoring ECMP Load Distribution

bash
# Enable VPC Flow Logs to observe path distribution
gcloud compute networks subnets update my-subnet \
  --enable-flow-logs \
  --region=us-central1

# Query flows with path information
gcloud logging read \
  'resource.type="gce_instance" AND jsonPayload.traffic_type="TCP"' \
  --format='table(jsonPayload.src_ip, jsonPayload.dst_ip, jsonPayload.bytes_sent)' \
  --limit=100 | sort -k3 -nr

# Analyze ECMP distribution (should be roughly equal)
gcloud logging read \
  'resource.type="gce_instance"' \
  --format='table(jsonPayload.end_time, jsonPayload.bytes_sent)' | \
  awk '{sum[$1]+=$2} END {for (t in sum) print t, sum[t]/1000000000 " GB"}' | sort

Capacity Planning Calculation

bash
#!/bin/bash
# Calculate required capacity for failure resilience

PEAK_TRAFFIC_GBPS=50
RESILIENCE_FACTOR=1.5  # n+1 redundancy

REQUIRED_CAPACITY=$(echo "$PEAK_TRAFFIC_GBPS * $RESILIENCE_FACTOR" | bc)

echo "Peak traffic: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "Resilience factor: $RESILIENCE_FACTOR (n+1)"
echo "Required capacity: ${REQUIRED_CAPACITY}Gbps"
echo ""
echo "Deployment strategy:"
echo "├─ Region 1: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Region 2: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "└─ Total provisioned: ${REQUIRED_CAPACITY}Gbps"
echo ""
echo "Failure scenario:"
echo "├─ 1 region fails: Remaining regions absorb ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Capacity used: ${PEAK_TRAFFIC_GBPS}Gbps"
echo "├─ Utilization: $(echo "scale=1; $PEAK_TRAFFIC_GBPS / $REQUIRED_CAPACITY * 100" | bc)%"
echo "└─ Result: SLA maintained ✓"

Testing Failover Scenarios

bash
# Simulate failure by blocking a specific egress path
# 1. Identify path (via traceroute or NetFlow)
GATEWAY="192.168.1.1"

# 2. Block path using iptables (on test VM only!)
sudo iptables -A FORWARD -d $GATEWAY -j DROP

# 3. Observe application behavior
# - Does latency increase?
# - Do connections drop?
# - How long to recover?

# 4. Monitor logs
gcloud logging read "resource.type=gce_instance AND severity=WARNING" \
  --format='table(severity, timestamp, textPayload)'

# 5. Restore path
sudo iptables -D FORWARD -d $GATEWAY -j DROP

# 6. Verify recovery
# Analysis: How long until normal operation?

References


Conclusion: Putting It All Together

You've now covered:

  1. Andromeda (SDN layer) — How VPCs implemented
  2. Jupiter (Physical fabric) — Hardware topology
  3. PoP (Edge) — How traffic enters GCP
  4. Global backbone — Premium vs Standard routing
  5. Latency SLA — Fiber path engineering
  6. Anycast — Global load balancing
  7. Traffic routing — Cold potato vs hot potato
  8. Network tiers — SLA implementation
  9. Bandwidth allocation — Capacity & costs
  10. Regional services — Data sovereignty
  11. Traffic engineering — Multi-path resilience

Together: A complete understanding of GCP's physical network architecture for production systems.

Next Steps

  • Implement: Choose right tier/region for your workload
  • Monitor: Set up per-path monitoring
  • Design: Plan for n+1 resilience
  • Audit: Verify regional constraints
  • Optimize: Balance latency, cost, and compliance