Skip to content

Network Service Tiers: Practical Datapath Implications

Vì sao quan trọng trong production

Bạn đã biết: Premium = 99.99% SLA, Standard = 99.9% SLA. Nhưng cách GCP achieve SLA khác nhau trong datapath:

  • Premium: Packet handling prioritized, dedicated capacity, backup paths
  • Standard: Best-effort, shared capacity, single path, less resilience

Hiểu cách GCP implement tiers này giúp bạn:

  • Design realistic expectations
  • Architect SLA-compliant systems
  • Avoid false assumptions about tier differences

Internal Model: Datapath Processing per Tier

Premium Tier Datapath

┌─────────────────────────────────────────────┐
│ User sends packet to Premium Tier resource  │
│ (Global Load Balancer, Premium IP)          │
└────────┬────────────────────────────────────┘


    ┌──────────────────────┐
    │ PoP (any tier)       │
    │ DDoS scrubbing       │
    │ SSL/TLS termination  │
    └──────────┬───────────┘

               ▼ (Enter GCP backbone)
         ┌──────────────────────────┐
         │ Premium Tier Backbone    │
         │ - Dedicated capacity     │
         │ - ECMP (multiple paths)  │
         │ - Reroute on congestion  │
         │ - <0.01% packet loss SLA │
         └──────────┬───────────────┘


         ┌──────────────────────────┐
         │ GCP Region (destination) │
         │ - Premium ingress        │
         │ - Priority queuing       │
         │ - Backup paths available │
         └──────────┬───────────────┘


         ┌──────────────────────────┐
         │ VM/Service               │
         │ (receives packet)         │
         └──────────────────────────┘

Key characteristics:
├─ Ingress: PoP to GCP backbone prioritized
├─ Routing: ECMP (equal-cost multi-path)
├─ Backup: Automatic failover if path degraded
├─ Egress: Premium egress point selection (cold potato)
├─ SLA: 99.99% availability = <52.6 min downtime/year
└─ Monitoring: Constant health checks on all paths

Standard Tier Datapath

┌──────────────────────────────────────────┐
│ User sends packet to Standard Tier IP     │
│ (Regional resource only)                 │
└────────┬─────────────────────────────────┘


    ┌──────────────────────┐
    │ Regional PoP         │
    │ Basic DDoS           │
    │ (no priority queue)  │
    └──────────┬───────────┘

               ▼ (via public internet)
         ┌──────────────────────────┐
         │ ISP public internet path │
         │ - Best effort            │
         │ - Single preferred path  │
         │ - No ECMP               │
         │ - Congestion possible   │
         └──────────┬───────────────┘


         ┌──────────────────────────┐
         │ GCP Region (same region) │
         │ - Standard ingress       │
         │ - Basic queuing (FIFO)   │
         │ - Limited backup paths   │
         └──────────┬───────────────┘


         ┌──────────────────────────┐
         │ VM/Service               │
         │ (receives packet)         │
         └──────────────────────────┘

Key characteristics:
├─ Ingress: PoP to region via ISP (not guaranteed)
├─ Routing: Single path (no ECMP)
├─ Backup: Manual intervention required if degraded
├─ Egress: Hot potato (exit early from origin region)
├─ SLA: 99.9% availability = 8.76 hours downtime/year
└─ Monitoring: Basic only, less comprehensive

Queue Handling & Priority

Premium Tier Queuing

Ingress queue at PoP (Premium):
┌──────────────────────────┐
│ Incoming packets         │
│ (mix of Premium/Standard) │
└──────────┬───────────────┘

    ┌──────▼──────┐
    │ Classifier  │
    │ (check tier)│
    └──────┬──────┘

     ┌─────┴─────┐
     │            │
  ┌──▼──┐      ┌──▼──┐
  │Prem │      │Std  │
  │High │      │Low  │
  │Pri  │      │Pri  │
  └──┬──┘      └──┬──┘
     │            │
  ┌──▼──────────────▼──┐
  │ GCP Backbone/Egress│
  │ (Premium first)    │
  └───────────────────┘

During congestion:
├─ Premium packets: <1% dropped
├─ Standard packets: 2-5% dropped
├─ Result: Premium gets priority

Standard Tier Queuing (FIFO)

Ingress queue at PoP (Standard):
┌──────────────────────────┐
│ Incoming packets (FIFO)  │
│ (Standard only)          │
└──────────┬───────────────┘

  ┌────────▼────────┐
  │ Regional egress │
  │ (first-come)    │
  │ (no priority)   │
  └────────┬────────┘

  ┌────────▼────────┐
  │ Public internet │
  │ (ISP routes)    │
  └────────────────┘

During congestion:
├─ All packets: Same drop rate
├─ Loss: 0.1-5% depending on ISP
└─ Result: No differentiation

SLA Implementation: How Google Achieves It

Premium Tier SLA Mechanics (99.99%)

To achieve 99.99% uptime (52.6 min downtime allowed/year):

1. Redundancy:
   ├─ Multiple paths from origin to destination
   ├─ Each path monitored independently
   ├─ Failure of single path: Traffic rerouted (<1 second)
   └─ Needs: At least 2 diverse paths per region pair

2. Health checking:
   ├─ Probes sent every 5 seconds
   ├─ Detect failures in <5 seconds
   ├─ Trigger reroute within 10 seconds
   └─ Result: <15 second max outage per failure

3. Capacity planning:
   ├─ Design for n+1 redundancy
   ├─ Peak capacity: 80% of total (20% headroom)
   ├─ Can absorb one path failure + maintain SLA
   └─ Monitoring: Real-time utilization

4. Availability calculation:
   ├─ Downtime: Time unable to reach region
   ├─ Partial degradation: Doesn't count if <0.01% packet loss
   ├─ Measurement: Continuous synthetic tests
   └─ Reporting: Published monthly

Standard Tier SLA Mechanics (99.9%)

To achieve 99.9% uptime (8.76 hours downtime allowed/year):

1. Single path reliance:
   ├─ Primary path: ISP-dependent
   ├─ Failure: May require manual intervention
   ├─ Recovery time: Hours possible
   └─ Trade-off: Cost savings justify less redundancy

2. Basic health checking:
   ├─ Probes less frequent
   ├─ Detection: 30-60 seconds
   ├─ Reroute: Manual or very slow automatic
   └─ Result: Possible temporary loss of connectivity

3. Capacity planning:
   ├─ Design for n redundancy (not n+1)
   ├─ Peak capacity: Can reach 90-95%
   ├─ Risk: Congestion during peak
   └─ Mitigation: Monitor, advise scaling

4. Availability calculation:
   ├─ Downtime: Complete loss of connectivity
   ├─ Degradation: Tolerate 0.1-1% packet loss
   ├─ Measurement: Periodic tests (not continuous)
   └─ Reporting: Published quarterly

Common Real-world Differences

ScenarioPremiumStandard
Region down (maintenance)Automatic failover to other regionManual intervention or regional outage
ISP route degradationReroute via backup path (seconds)Wait for ISP fix (hours)
Peak traffic spikeAbsorbed via ECMP load balancingMay cause congestion, latency increase
PoP failureTraffic automatically uses alternate PoPUsers in that area affected (ISP dependent)
DDoS attackScrubbed at PoP with priority to legit trafficBest effort, legitimate traffic may be dropped
BGP route flapRapid convergence, minimal impactMay experience longer outages

Production Architecture Patterns

Pattern 1: Mixed Tier Deployment (Hybrid)

Architecture: Use both tiers strategically
├─ Customer-facing APIs: Premium tier (SLA-critical)
├─ Internal batch jobs: Standard tier (cost-optimized)
├─ Data analytics pipeline: Standard tier
└─ Real-time notifications: Premium tier

Cost optimization:
├─ Premium: 20% of traffic (customer-facing)
├─ Standard: 80% of traffic (internal)
├─ Cost reduction: ~50% vs all-Premium
└─ SLA still met for critical paths

Implementation:
├─ Create separate load balancers per tier
├─ Route customer traffic to Premium
├─ Route internal traffic to Standard
└─ DNS/application logic chooses tier

Pattern 2: Gradual Failover (Premium Tier)

Primary datacenter: asia-southeast1 (Premium)
Secondary datacenter: us-central1 (Premium)
Tertiary datacenter: eu-west1 (Premium)

Traffic distribution:
├─ Normal: 100% to asia-southeast1 (Premium SLA maintained)
├─ asia-southeast1 down: 100% to us-central1 (Premium SLA maintained)
├─ us-central1 down: 100% to eu-west1 (Premium SLA maintained)

Result:
└─ Any single region failure: SLA maintained via automatic failover

Cost: Premium tier globally (expensive but mandatory for SLA)

Common Mistakes & Anti-Patterns

Mistake 1: Assuming Standard Tier Can Handle Everything

Wrong thinking:

"Standard Tier costs less, sufficient for any workload"

Correct understanding:

  • Standard: Regional only, 99.9% SLA, single path
  • Premium: Global, 99.99% SLA, multi-path
  • If SLA required: Premium necessary
  • Cost justification: Depends on business impact of downtime

Prevention: Calculate cost of downtime for your application. Compare to Premium tier cost.

Mistake 2: Mixing Tier Expectations

Wrong thinking:

"Standard Tier with ECMP load balancing = Premium Tier resilience"

Correct understanding:

  • Standard Tier: Single path fundamentally
  • ECMP: Multi-path distribution
  • Combining them: No, ECMP not available on Standard Tier
  • Result: Standard is still single-path

Prevention: Verify tier characteristics in GCP documentation before designing.

Mistake 3: Not Monitoring SLA Compliance

Wrong thinking:

"Bought Premium Tier, SLA automatically met"

Correct understanding:

  • Premium: GCP SLA with them
  • Your SLA: Depends on your application design
  • Example: Premium network SLA doesn't guarantee app availability
  • Must monitor: End-to-end application health

Prevention: Implement comprehensive monitoring beyond network.

GCP-native Implementation Guidance

Monitoring Tier-specific Metrics

bash
# Check current tier status
gcloud compute addresses list --global --format='table(name, network_tier, address)'

# Monitor premium tier backup paths
gcloud compute routes list --filter="dest_range~YOUR_IP" \
  --format='table(dest_range, next_hop_gateway, priority)'

# Track SLA compliance
gcloud monitoring dashboards create \
  --config='{"displayName": "Network Tier SLA Monitoring", ...}'

# Create alerting on latency increases (indicator of tier degradation)
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --alert-strategy='threshold: 100ms, comparison: GREATER'

Verifying Tier Characteristics

bash
# Test Premium path resilience
# 1. Start ping to Premium IP
ping 35.201.123.45 &

# 2. Simulate network issue
gcloud compute networks update my-network --enable-vpc-flow-logs

# 3. Monitor: Observe <1 second disruption during failover

# Test Standard path behavior
ping 35.202.123.45 &
# 1. Observe: Longer recovery times on path issues

References


Next: Bandwidth Allocation & Egress Pricing — How GCP manages capacity per zone