Skip to content

Dynamic Routes & Cloud Router — BGP-based Routing Automation

Executive Summary

Cloud Router = managed BGP router tại VPC level, cho phép dynamic route learning từ on-premises.

Key points:

  • ✅ Automatic failover (không cần manual route updates)
  • ✅ Bi-directional route advertisement (GCP routes → on-prem, on-prem → GCP)
  • ✅ Multi-region hub-and-spoke topologies possible
  • ❌ Adds latency (distributed BGP state management)
  • ❌ Requires BGP expertise (ASN configuration, communities, filtering)

Cloud Router Architecture

Regional Scope

Cloud Router là regional resource:

Organization
├── VPC prod-vpc (global)
│   ├── Cloud Router us-central1 (regional)
│   ├── Cloud Router europe-west1 (regional)
│   └── Cloud Router asia-southeast1 (regional)

Mỗi region có riêng BGP session, independently learns/advertises routes.

Implication:
  Route learned in us-central1 via BGP
  → NOT automatically propagated to europe-west1
  
  Must configure separate BGP session per region
  để advertise same route

BGP Session Components

yaml
Cloud Router (us-central1):
  - ASN: 64514 (Google ASN, private range 64512-65534)
  - Interface IP: 169.254.1.1/30 (BGP session interface)
  
On-premises:
  - ASN: 65001 (Customer ASN)
  - Interface IP: 169.254.1.2/30
  
BGP Session:
  - Neighbors: 169.254.1.1 ↔ 169.254.1.2
  - OPEN handshake
  - KEEPALIVE every 60 seconds
  - UPDATE messages: route advertisements

BGP Configuration

Creating Cloud Router

bash
gcloud compute routers create router-us-central1 \
  --network=prod-vpc \
  --region=us-central1 \
  --asn=64514

# Verify:
gcloud compute routers describe router-us-central1 \
  --region=us-central1

Creating BGP Peer (On-Premises)

bash
# For Cloud VPN tunnel:
gcloud compute routers add-bgp-peer router-us-central1 \
  --peer-name=bgp-site-a \
  --interface=vpn-interface-site-a \
  --peer-asn=65001 \
  --region=us-central1

# Alternatively, for Cloud Interconnect VLAN attachment:
gcloud compute routers add-bgp-peer router-us-central1 \
  --peer-name=bgp-interconnect \
  --interface=ic-vlan-interface \
  --peer-asn=65000 \
  --region=us-central1

Route Advertisement

bash
# Advertise specific subnets (instead of all VPC subnets):
gcloud compute routers update router-us-central1 \
  --region=us-central1 \
  --advertisement-mode=custom \
  --set-advertisement-groups=all_subnets \
  --set-advertisement-ranges=10.0.1.0/24,10.0.2.0/24

# Or import learned routes:
gcloud compute routers update router-us-central1 \
  --region=us-central1 \
  --advertisement-mode=custom \
  --set-advertisement-groups=all_routes

Route Learning & Propagation

BGP UPDATE Messages

On-prem BGP peer sends UPDATE:

NLRI: 192.168.0.0/24
AS_PATH: 65001
NEXT_HOP: 169.254.1.2

Cloud Router receives:
  - Learns 192.168.0.0/24 reachable via AS 65001
  - Converts to GCP route:
    Destination: 192.168.0.0/24
    Next Hop: Cloud Router
    Type: Dynamic (BGP)
    Priority: 200 (high)

Propagates within VPC:
  - Instances in all regions can reach 192.168.0.0/24
  - Return path automatic (symmetric routing)

Route Propagation Delays

Timeline:

t=0: On-prem advertises 192.168.0.0/24 via BGP
t=1: Cloud Router receives BGP UPDATE
t=1-5: Route propagates within GCP (eventual consistency)
t=5: Instances in other regions see route

Result: 1-5 second delay from BGP learn to VM seeing route

Impact:
  - Not instantaneous failover
  - Queries fail briefly during propagation
  - Use TCP retry logic to tolerate

Regional vs Global Modes

Regional Mode (Default)

Scenario: Separate BGP session per region

Region: us-central1
  Cloud Router: learning 192.168.0.0/16
  Routes: available to us-central1 VMs

Region: europe-west1
  Cloud Router: NOT learning anything (no BGP peer)
  Routes: NOT available to europe-west1 VMs

Solution: Set up separate BGP peer in europe-west1

Global Mode (Preview)

Scenario: Global route learning

Cloud Router us-central1: learns 192.168.0.0/16 via BGP
  → Propagates to ALL regions (not just us-central1)
  
Cloud Router europe-west1: NOT configured
  → Inherits routes from us-central1

Advantage: Single BGP session for multi-region
Disadvantage: Asymmetric traffic (may exit/enter different regions)

Example packet flow:
  us-central1 VM → on-prem: exits us-central1 region
  on-prem → europe-west1 VM: ingress europe-west1 region
  
Result: Cross-region hop (latency, costs)

Production Patterns

Pattern 1: Hub-and-Spoke with Cloud Router

Architecture:

                    On-Premises (Site A)
                    192.168.0.0/16
                           ↑ BGP

                        VPN Tunnel

    ┌──────────────────────┴──────────────────────┐
    │      prod-vpc (GCP)                         │
    │      10.0.0.0/16                           │
    │                                            │
    │  Cloud Router us-central1 (HUB)            │
    │  ASN: 64514                                │
    │  BGP session → Site A                      │
    │                                            │
    │  Advertises:                               │
    │    - 10.0.1.0/24 (us-central1 subnet)     │
    │    - 10.0.2.0/24 (europe-west1 subnet)    │
    └──────────────────────┬──────────────────────┘

Routing:

us-central1 VM:
  Route: 192.168.0.0/16 → Cloud Router us-central1
  Exit: VPN tunnel
  
europe-west1 VM:
  Route: 192.168.0.0/16 → Cloud Router us-central1 (dynamic)
  Transit: Cross-region through VPC backbone
  Exit: VPN tunnel via us-central1
  
Result: All traffic → on-prem exits us-central1
        Asymmetric routing (ingress ≠ egress region)
        
Fix: Set up Cloud Router europe-west1 with separate BGP session
    Or use global mode (if available)

Pattern 2: Multi-Site BGP with Failover

Architecture:

  Site A (192.168.1.0/16)
        ↑ BGP
        │ AS_PATH: 65001

  Cloud Router us-central1

  Site B (192.168.2.0/16)
        ↑ BGP
        │ AS_PATH: 65002

  Cloud Router us-central1

GCP receives two different routes:
  - 192.168.1.0/16 via AS 65001 (Site A)
  - 192.168.2.0/16 via AS 65002 (Site B)

If Site A fails:
  - BGP session down
  - Route 192.168.1.0/16 withdrawn
  - Traffic to Site A drops
  
Automatic failover NOT possible (different CIDRs)

Better pattern: Advertise same CIDR from both sites
  - Site A: advertises 192.168.0.0/16 (AS_PATH: 65001)
  - Site B: advertises 192.168.0.0/16 (AS_PATH: 65002 65001)
  
GCP BGP path selection: shortest AS_PATH wins
  - Prefers Site A (65001 < 65002 65001)
  - If Site A fails, falls back to Site B
  
Automatic failover happens!

Pattern 3: Route Filtering with Communities

Scenario: Different routing policies per region

On-premises advertises:
  - 192.168.1.0/24 (critical apps) + community 65001:100
  - 192.168.2.0/24 (dev apps) + community 65001:200

GCP Cloud Router can filter:

us-central1 (production):
  Import: Accept community 65001:100 only
  Effect: Only critical routes learned
  
europe-west1 (development):
  Import: Accept community 65001:200 only
  Effect: Only dev routes learned

Result: Different routing policies per region

Pattern 4: Dynamic Failover with Multiple Tunnels

Setup:

Primary VPN tunnel: us-central1 → Site A
  BGP session 1: Cloud Router us-central1 ← Site A
  ASN: 65001

Backup VPN tunnel: europe-west1 → Site B
  BGP session 2: Cloud Router europe-west1 ← Site B
  ASN: 65002
  
Routes:
  Route 192.168.0.0/16 via AS 65001 (primary)
  Route 192.168.0.0/16 via AS 65002 (backup)

Failover:

Normal:
  Packets 192.168.0.0/16 → VPN us-central1 → Site A

Site A down:
  BGP session 1 times out
  Route via 65001 withdrawn
  Fall back to route via 65002
  Packets → VPN europe-west1 → Site B
  
  Automatic failover (1-3 second detection)

Advantage over static routes:
  - No manual intervention needed
  - Health check built-in (BGP KEEPALIVE)
  - Fast failover

BGP Best Practices

BGP Communities for Policy

bash
# Tag routes with community for filtering:

On-premises:
  interface BGP 65001
    address-family ipv4
      route-map ADD-COMMUNITY out
      
  route-map ADD-COMMUNITY permit 10
    set community 65001:100  (critical)
    
  route-map ADD-COMMUNITY permit 20
    set community 65001:200  (noncritical)

GCP Cloud Router import policy:
  (Configure via custom import/export policies)

Graceful Shutdown

bash
# When taking down BGP session:

gcloud compute routers update-bgp-peer router-us-central1 \
  --peer-name=bgp-site-a \
  --region=us-central1 \
  --bfd-mode=enabled  # Fast failure detection

# Gracefully disable:
gcloud compute routers update-bgp-peer router-us-central1 \
  --peer-name=bgp-site-a \
  --region=us-central1 \
  --clear-advertised-ranges  # Stop advertising routes

# Then:
gcloud compute routers delete-bgp-peer router-us-central1 \
  --peer-name=bgp-site-a \
  --region=us-central1

Troubleshooting BGP

Symptom: Routes Not Appearing

bash
Diagnosis:

1. Check BGP session status:
gcloud compute routers get-status router-us-central1 \
  --region=us-central1 \
  --format=json

Output: 
  "bgpPeerStatus": [{
    "name": "bgp-site-a",
    "state": "UP",  # or "DOWN"
    "uptime": "3600s",
    "prefixesReceived": 5
  }]

2. If state=DOWN:
  - Check VPN tunnel: gcloud compute vpn-tunnels list
  - Check BGP peer configuration: gcloud compute routers describe
  - Check on-prem BGP neighbor state
  
3. If state=UP but prefixesReceived=0:
  - Check import policy (maybe filtering routes)
  - Check on-prem is advertising routes
  - Check BGP ASN/interface IPs match

4. Check advertised routes:
gcloud compute routers get-status router-us-central1 \
  --region=us-central1 \
  --format="table(bestRoutesForRouter[].destination)"

Symptom: Asymmetric Routing

Problem: GCP → on-prem works, on-prem → GCP fails

Cause: GCP advertises subnets, on-prem doesn't receive

Solution:
1. Check what GCP is advertising:
   gcloud compute routers get-status router-us-central1 \
     --region=us-central1
   
2. Check on-prem BGP table:
   (On-prem router) show ip bgp
   
3. If on-prem shows GCP routes:
   → BGP working, problem is GCP firewall/route
   
   Check ingress firewall: gcloud compute firewall-rules list
   Check destination routing tables

4. If on-prem doesn't show GCP routes:
   → BGP not advertising properly
   
   Check export policy
   gcloud compute routers describe router-us-central1 \
     --region=us-central1

Symptom: BGP Session Flapping

Problem: BGP session going up/down repeatedly

Symptoms:
  - Routes disappear/reappear
  - High latency
  - Packet loss

Causes:
1. Network instability: packet loss in VPN tunnel
   → BGP KEEPALIVE timeout (180 seconds default)
   
2. MTU mismatch: packets fragmented, BGP UPDATE dropped
   Check: GCP VPN MTU (1460), on-prem MTU (must match)
   
3. BGP configuration mismatch: ASN/interface IP differs

Solution:
  - Enable BFD for fast failure detection
  - Check tunnel throughput/packet loss
  - Verify MTU end-to-end
  - Increase BGP timers if stable but slow

Route Limits

Per Cloud Router:
  - Max BGP peers: 100
  - Max learned routes: 10,000
  - Max advertised routes: 10,000

Per VPC:
  - Max dynamic routes: 10,000 (combined from all routers)
  - Plus max 500 static routes

Quota increases: Contact Google Cloud support

If approaching limits:
  - Use route summarization (advertise 10.0.0.0/8 instead of individual subnets)
  - Split into multiple VPCs
  - Use VPC Peering instead of routing

Monitoring BGP

bash
# Monitor route changes:
gcloud logging read \
  "resource.type=Cloud Router AND resource.labels.router_id=router-us-central1" \
  --limit=50 \
  --format=json

# List BGP peer status:
gcloud compute routers get-status router-us-central1 \
  --region=us-central1 \
  --format=table

# Monitor specific peer:
watch -n 5 'gcloud compute routers get-status router-us-central1 \
  --region=us-central1 \
  --format="table(bgpPeerStatus[name, state, uptime, prefixesReceived])"'

Conclusion

Cloud Router provides enterprise-grade routing automation:

Advantages:

  • Automatic failover via BGP
  • Bi-directional routing
  • Multi-site connectivity
  • Route filtering via policies

Disadvantages:

  • Extra complexity (BGP configuration)
  • Potential for asymmetric routing (multi-region)
  • BGP flapping can destabilize network

Best for: Production environments with on-premises connectivity requiring automatic failover

Not needed: Simple VPN-only scenarios (static routes sufficient)

For large-scale multi-site deployments, Cloud Router is essential.