Network Service Tiers: Practical Datapath Implications
Vì sao quan trọng trong production
Bạn đã biết: Premium = 99.99% SLA, Standard = 99.9% SLA. Nhưng cách GCP achieve SLA khác nhau trong datapath:
- Premium: Packet handling prioritized, dedicated capacity, backup paths
- Standard: Best-effort, shared capacity, single path, less resilience
Hiểu cách GCP implement tiers này giúp bạn:
- Design realistic expectations
- Architect SLA-compliant systems
- Avoid false assumptions about tier differences
Internal Model: Datapath Processing per Tier
Premium Tier Datapath
┌─────────────────────────────────────────────┐
│ User sends packet to Premium Tier resource │
│ (Global Load Balancer, Premium IP) │
└────────┬────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ PoP (any tier) │
│ DDoS scrubbing │
│ SSL/TLS termination │
└──────────┬───────────┘
│
▼ (Enter GCP backbone)
┌──────────────────────────┐
│ Premium Tier Backbone │
│ - Dedicated capacity │
│ - ECMP (multiple paths) │
│ - Reroute on congestion │
│ - <0.01% packet loss SLA │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ GCP Region (destination) │
│ - Premium ingress │
│ - Priority queuing │
│ - Backup paths available │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ VM/Service │
│ (receives packet) │
└──────────────────────────┘
Key characteristics:
├─ Ingress: PoP to GCP backbone prioritized
├─ Routing: ECMP (equal-cost multi-path)
├─ Backup: Automatic failover if path degraded
├─ Egress: Premium egress point selection (cold potato)
├─ SLA: 99.99% availability = <52.6 min downtime/year
└─ Monitoring: Constant health checks on all pathsStandard Tier Datapath
┌──────────────────────────────────────────┐
│ User sends packet to Standard Tier IP │
│ (Regional resource only) │
└────────┬─────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Regional PoP │
│ Basic DDoS │
│ (no priority queue) │
└──────────┬───────────┘
│
▼ (via public internet)
┌──────────────────────────┐
│ ISP public internet path │
│ - Best effort │
│ - Single preferred path │
│ - No ECMP │
│ - Congestion possible │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ GCP Region (same region) │
│ - Standard ingress │
│ - Basic queuing (FIFO) │
│ - Limited backup paths │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ VM/Service │
│ (receives packet) │
└──────────────────────────┘
Key characteristics:
├─ Ingress: PoP to region via ISP (not guaranteed)
├─ Routing: Single path (no ECMP)
├─ Backup: Manual intervention required if degraded
├─ Egress: Hot potato (exit early from origin region)
├─ SLA: 99.9% availability = 8.76 hours downtime/year
└─ Monitoring: Basic only, less comprehensiveQueue Handling & Priority
Premium Tier Queuing
Ingress queue at PoP (Premium):
┌──────────────────────────┐
│ Incoming packets │
│ (mix of Premium/Standard) │
└──────────┬───────────────┘
│
┌──────▼──────┐
│ Classifier │
│ (check tier)│
└──────┬──────┘
│
┌─────┴─────┐
│ │
┌──▼──┐ ┌──▼──┐
│Prem │ │Std │
│High │ │Low │
│Pri │ │Pri │
└──┬──┘ └──┬──┘
│ │
┌──▼──────────────▼──┐
│ GCP Backbone/Egress│
│ (Premium first) │
└───────────────────┘
During congestion:
├─ Premium packets: <1% dropped
├─ Standard packets: 2-5% dropped
├─ Result: Premium gets priorityStandard Tier Queuing (FIFO)
Ingress queue at PoP (Standard):
┌──────────────────────────┐
│ Incoming packets (FIFO) │
│ (Standard only) │
└──────────┬───────────────┘
│
┌────────▼────────┐
│ Regional egress │
│ (first-come) │
│ (no priority) │
└────────┬────────┘
│
┌────────▼────────┐
│ Public internet │
│ (ISP routes) │
└────────────────┘
During congestion:
├─ All packets: Same drop rate
├─ Loss: 0.1-5% depending on ISP
└─ Result: No differentiationSLA Implementation: How Google Achieves It
Premium Tier SLA Mechanics (99.99%)
To achieve 99.99% uptime (52.6 min downtime allowed/year):
1. Redundancy:
├─ Multiple paths from origin to destination
├─ Each path monitored independently
├─ Failure of single path: Traffic rerouted (<1 second)
└─ Needs: At least 2 diverse paths per region pair
2. Health checking:
├─ Probes sent every 5 seconds
├─ Detect failures in <5 seconds
├─ Trigger reroute within 10 seconds
└─ Result: <15 second max outage per failure
3. Capacity planning:
├─ Design for n+1 redundancy
├─ Peak capacity: 80% of total (20% headroom)
├─ Can absorb one path failure + maintain SLA
└─ Monitoring: Real-time utilization
4. Availability calculation:
├─ Downtime: Time unable to reach region
├─ Partial degradation: Doesn't count if <0.01% packet loss
├─ Measurement: Continuous synthetic tests
└─ Reporting: Published monthlyStandard Tier SLA Mechanics (99.9%)
To achieve 99.9% uptime (8.76 hours downtime allowed/year):
1. Single path reliance:
├─ Primary path: ISP-dependent
├─ Failure: May require manual intervention
├─ Recovery time: Hours possible
└─ Trade-off: Cost savings justify less redundancy
2. Basic health checking:
├─ Probes less frequent
├─ Detection: 30-60 seconds
├─ Reroute: Manual or very slow automatic
└─ Result: Possible temporary loss of connectivity
3. Capacity planning:
├─ Design for n redundancy (not n+1)
├─ Peak capacity: Can reach 90-95%
├─ Risk: Congestion during peak
└─ Mitigation: Monitor, advise scaling
4. Availability calculation:
├─ Downtime: Complete loss of connectivity
├─ Degradation: Tolerate 0.1-1% packet loss
├─ Measurement: Periodic tests (not continuous)
└─ Reporting: Published quarterlyCommon Real-world Differences
| Scenario | Premium | Standard |
|---|---|---|
| Region down (maintenance) | Automatic failover to other region | Manual intervention or regional outage |
| ISP route degradation | Reroute via backup path (seconds) | Wait for ISP fix (hours) |
| Peak traffic spike | Absorbed via ECMP load balancing | May cause congestion, latency increase |
| PoP failure | Traffic automatically uses alternate PoP | Users in that area affected (ISP dependent) |
| DDoS attack | Scrubbed at PoP with priority to legit traffic | Best effort, legitimate traffic may be dropped |
| BGP route flap | Rapid convergence, minimal impact | May experience longer outages |
Production Architecture Patterns
Pattern 1: Mixed Tier Deployment (Hybrid)
Architecture: Use both tiers strategically
├─ Customer-facing APIs: Premium tier (SLA-critical)
├─ Internal batch jobs: Standard tier (cost-optimized)
├─ Data analytics pipeline: Standard tier
└─ Real-time notifications: Premium tier
Cost optimization:
├─ Premium: 20% of traffic (customer-facing)
├─ Standard: 80% of traffic (internal)
├─ Cost reduction: ~50% vs all-Premium
└─ SLA still met for critical paths
Implementation:
├─ Create separate load balancers per tier
├─ Route customer traffic to Premium
├─ Route internal traffic to Standard
└─ DNS/application logic chooses tierPattern 2: Gradual Failover (Premium Tier)
Primary datacenter: asia-southeast1 (Premium)
Secondary datacenter: us-central1 (Premium)
Tertiary datacenter: eu-west1 (Premium)
Traffic distribution:
├─ Normal: 100% to asia-southeast1 (Premium SLA maintained)
├─ asia-southeast1 down: 100% to us-central1 (Premium SLA maintained)
├─ us-central1 down: 100% to eu-west1 (Premium SLA maintained)
Result:
└─ Any single region failure: SLA maintained via automatic failover
Cost: Premium tier globally (expensive but mandatory for SLA)Common Mistakes & Anti-Patterns
Mistake 1: Assuming Standard Tier Can Handle Everything
❌ Wrong thinking:
"Standard Tier costs less, sufficient for any workload"✅ Correct understanding:
- Standard: Regional only, 99.9% SLA, single path
- Premium: Global, 99.99% SLA, multi-path
- If SLA required: Premium necessary
- Cost justification: Depends on business impact of downtime
Prevention: Calculate cost of downtime for your application. Compare to Premium tier cost.
Mistake 2: Mixing Tier Expectations
❌ Wrong thinking:
"Standard Tier with ECMP load balancing = Premium Tier resilience"✅ Correct understanding:
- Standard Tier: Single path fundamentally
- ECMP: Multi-path distribution
- Combining them: No, ECMP not available on Standard Tier
- Result: Standard is still single-path
Prevention: Verify tier characteristics in GCP documentation before designing.
Mistake 3: Not Monitoring SLA Compliance
❌ Wrong thinking:
"Bought Premium Tier, SLA automatically met"✅ Correct understanding:
- Premium: GCP SLA with them
- Your SLA: Depends on your application design
- Example: Premium network SLA doesn't guarantee app availability
- Must monitor: End-to-end application health
Prevention: Implement comprehensive monitoring beyond network.
GCP-native Implementation Guidance
Monitoring Tier-specific Metrics
bash
# Check current tier status
gcloud compute addresses list --global --format='table(name, network_tier, address)'
# Monitor premium tier backup paths
gcloud compute routes list --filter="dest_range~YOUR_IP" \
--format='table(dest_range, next_hop_gateway, priority)'
# Track SLA compliance
gcloud monitoring dashboards create \
--config='{"displayName": "Network Tier SLA Monitoring", ...}'
# Create alerting on latency increases (indicator of tier degradation)
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--alert-strategy='threshold: 100ms, comparison: GREATER'Verifying Tier Characteristics
bash
# Test Premium path resilience
# 1. Start ping to Premium IP
ping 35.201.123.45 &
# 2. Simulate network issue
gcloud compute networks update my-network --enable-vpc-flow-logs
# 3. Monitor: Observe <1 second disruption during failover
# Test Standard path behavior
ping 35.202.123.45 &
# 1. Observe: Longer recovery times on path issuesReferences
- Network Service Tiers Comparison — Official comparison
- SLA Details and Availability — Formal SLA documentation
- Reliability Best Practices — Design for SLA
- Network Monitoring & Alerting — Monitor tier health
Next: Bandwidth Allocation & Egress Pricing — How GCP manages capacity per zone