Subnet Design & CIDR Planning — Masterclass at Scale
Executive Summary
IP address planning ngdengar boring tapi adalah critical infrastructure decision yang bisa:
- ❌ Strand your entire application (IP exhaustion = cannot scale)
- ✅ Prevent security incidents (overlapping ranges prevent accidental peering)
- ✅ Simplify multi-region operations (global IP uniqueness)
Subnet: More Than Just CIDR
Subnet dalam GCP = regional network segment dengan:
- Primary CIDR range: VM primary IP addresses
- Secondary ranges: GKE pods, services, future use-cases
- Region: us-west1, europe-west1, etc. (but all AZs in region)
- Availability: All AZs in region (us-west1-a, us-west1-b, us-west1-c)
Subnet "production"
├── Region: us-west1 (applies to us-west1-a, b, c)
├── Primary CIDR: 10.0.0.0/20 (4096 IPs)
│ ├── 10.0.0.1: Gateway (reserved)
│ ├── 10.0.0.2 - 10.0.15.254: Usable (4094 IPs)
│ └── 10.0.15.255: Broadcast (reserved)
├── Secondary "gke-pods": 10.2.0.0/16 (65536 IPs)
└── Secondary "gke-services": 10.3.0.0/16 (65536 IPs)Understanding CIDR Allocation
Quick Reference: IP Count by Prefix
| CIDR | Suffix | Total IPs | Usable | Notes |
|---|---|---|---|---|
| /32 | Single host | 1 | 1 | One IP |
| /29 | Minimum subnet | 8 | 6 | Rarely used |
| /28 | Small pod | 16 | 14 | Dev environments |
| /27 | Small cluster | 32 | 30 | ~20 pods |
| /26 | Subnet | 64 | 62 | ~50 pods |
| /25 | Medium subnet | 128 | 126 | ~100 pods |
| /24 | Standard subnet | 256 | 254 | ~200 pods |
| /23 | Large subnet | 512 | 510 | ~400 pods |
| /22 | Very large | 1024 | 1022 | ~800 pods |
| /21 | Huge subnet | 2048 | 2046 | ~1600 pods |
| /20 | Default GCP | 4096 | 4094 | ~3200 pods |
| /16 | Secondary range | 65536 | 65534 | 50K+ pods |
| /12 | Region block | 1M | ~1M | 800K+ pods |
Formula: Usable IPs = 2^(32 - prefix) - 2 (exclude network + broadcast)
Primary vs Secondary Ranges
Primary Range (Mandatory)
Subnet primary range = VM IP addresses ONLY
subnet "tier1":
primary: 10.0.0.0/20 (VMs get IPs from here)
VM creation:
gcloud compute instances create vm1 --subnet=tier1
→ Gets 10.0.1.5 from primary rangeConstraints:
- Must be unique within VPC
- Applies to all AZs in region
- Cannot resize smaller (only expand)
- GCP reserves: first IP (network), last IP (broadcast), one gateway IP
Secondary Ranges (Optional but Recommended)
Subnet "tier1":
primary: 10.0.0.0/20 (VMs)
secondary-1: 10.2.0.0/16 (GKE pods)
secondary-2: 10.3.0.0/16 (GKE services)Use cases:
- GKE pod CIDR (most important)
- GKE service CIDR
- Canary deployments (separate network)
- Future use cases (reserve now, use later)
Benefit: VMs and pods have separate CIDRs
- Firewall rules: allow traffic to pods (10.2.0.0/16) but not VMs
- Load balancer: routes traffic to pod endpoint (10.2.0.0/16)
- Monitoring: metrics distinguished by primary vs secondary
Production Pattern: Multi-tier Network Design
Scenario: E-commerce Platform
VPC: prod (10.0.0.0/8)
Region: us-west1 (10.0.0.0/11 = 2M IPs available)
Subnet "lb-tier":
Primary: 10.0.0.0/24 (Load Balancer VMs)
Secondary-pods: 10.1.0.0/16 (GKE LB cluster)
Secondary-services: 10.1.128.0/17
AZs: all
Subnet "app-tier":
Primary: 10.0.1.0/24 (App Server VMs)
Secondary-pods: 10.2.0.0/16 (GKE App cluster)
Secondary-services: 10.2.128.0/17
AZs: all
Subnet "db-tier":
Primary: 10.0.2.0/24 (Database VMs - very few)
Secondary-pods: 10.3.0.0/16 (CloudSQL, no pods really)
AZs: all (but limited)
Subnet "internal":
Primary: 10.0.3.0/24 (Internal tools, monitoring)
Secondary-shared: 10.10.0.0/16 (Shared services)
AZs: all
Region: europe-west1 (10.32.0.0/11)
Subnet "lb-tier":
Primary: 10.32.0.0/24
Secondary-pods: 10.33.0.0/16
Secondary-services: 10.33.128.0/17
[Same pattern for app, db, internal tiers]
Reserve: 10.64.0.0/7 (future regions, emergency expansion)Firewall Strategy for Multi-tier
Firewall rules:
rule-100: allow ingress to lb-pods (10.1.0.0/16)
from 0.0.0.0/0, port 443
target: tag:public-lb
rule-200: allow ingress to app-pods (10.2.0.0/16)
from lb-pods (10.1.0.0/16), port 8080
target: tag:app
rule-300: allow ingress to db (10.0.2.0/24)
from app-pods (10.2.0.0/16), port 3306
target: tag:database
rule-999: deny all other (implicit)Advantage: Clear traffic patterns, easy to audit
GKE Pod CIDR Planning
GKE clusters consume secondary ranges at scale:
Single Cluster
GKE cluster "main-app" in subnet "app-tier":
Pod CIDR: 10.2.0.0/16 (secondary range)
Service CIDR: 10.2.128.0/17
Pods: 1000 pods = ~1000 IPs consumed from 65536
Services: 100 services = ~100 IPs consumed from 32768
Remaining: 64436 pod IPs available for growth
Growth potential: 60x current size before range exhaustionMulti-cluster in Same Region
Region "us-west1" subnet "app-tier":
Cluster "canary":
Pod CIDR: 10.2.0.0/24 (256 IPs)
Services: 10.2.1.0/24
Cluster "staging":
Pod CIDR: 10.2.2.0/24
Services: 10.2.3.0/24
Cluster "production":
Pod CIDR: 10.2.4.0/20 (4096 IPs)
Services: 10.2.20.0/20
Layout:
10.2.0.0/16 (reserved for clusters)
├── 10.2.0.0/24 (canary pods)
├── 10.2.1.0/24 (canary services)
├── 10.2.2.0/24 (staging pods)
├── 10.2.3.0/24 (staging services)
├── 10.2.4.0/20 (prod pods)
├── 10.2.20.0/20 (prod services)
└── 10.2.36.0/12 (reserved for future clusters)Design principle: Reserved secondary range >> sum of all clusters
- Sum of clusters: 10.2.4.0/20 + others = ~4500 IPs
- Secondary range: 10.2.0.0/16 = 65536 IPs
- Safety margin: 14x (allows growth, mistakes, spikes)
Real-world Pitfall: Secondary Range Exhaustion
Year 1:
App tier secondary: 10.2.0.0/16
Clusters: canary (50 pods), staging (100 pods), prod (500 pods)
Used: ~650 IPs
Available: 64886 IPs ← "plenty of room"
Year 2:
Product success → 5000 pods in production
New canary: 1000 pods
New staging: 2000 pods
Used: 8000 IPs
Available: 57536 IPs ← "still fine"
Year 3:
Multi-tier rollout: 20000 pods across regions
Per-region usage: 20000 / 4 regions = 5000 pods per region
US-west1 allocated: 5000 pods
Problem: 10.2.0.0/16 insufficient!
Fix required:
Option 1: Add new secondary range 10.4.0.0/16
Problem: existing clusters point to 10.2.0.0/16
Requires: new cluster + data migration
Option 2: Expand existing range
Impossible in GCP! Ranges are immutablePrevention: Reserve larger secondary ranges
✅ Correct planning:
Primary VMs: 10.0.0.0/20 (4096)
Pods: 10.2.0.0/12 (1M IPs)
→ Multiple clusters can fit
→ Room for 100x growth
vs.
❌ Incorrect planning:
Primary VMs: 10.0.0.0/20 (4096)
Pods: 10.2.0.0/18 (16K IPs)
→ Fits 10 clusters only
→ 10x growth = exhaustionCIDR Allocation Strategy: The Spreadsheet Method
Step 1: Determine Organization Scope
Question: How many regions will we deploy to?
Answer 1: Just US (4 regions)
VPC: 10.0.0.0/12 (1M IPs) - sufficient
Answer 2: US + EU + Asia (12 regions)
VPC: 10.0.0.0/10 (4M IPs) - necessary
Answer 3: Global + future (possible 20 regions)
VPC: 10.0.0.0/8 (16M IPs) - prudentStep 2: Allocate Per-Region Block
VPC: 10.0.0.0/8 (16M IPs)
Region us-west1: 10.0.0.0/11 (2M IPs)
Region us-central1: 10.32.0.0/11
Region us-east1: 10.64.0.0/11
Region us-south1: 10.96.0.0/11
Region europe-west1: 10.128.0.0/11
Region europe-west2: 10.160.0.0/11
Region europe-west3: 10.192.0.0/11
Region asia-northeast1: 10.224.0.0/11
Region asia-northeast2: [Reserve for future]
Formula: Each region gets /11 (2M IPs) from global /8
Regions 1-8: 10.0.0.0/11 through 10.224.0.0/11
Regions 9+: Emergency allocation if neededStep 3: Per-Subnet Allocation Within Region
Region us-west1: 10.0.0.0/11 (2M IPs)
Tier 1 (LB): 10.0.0.0/15 (128K)
Subnet A: 10.0.0.0/21 (2K)
Subnet B: 10.0.8.0/21 (2K)
[Reserve: 10.0.16.0/15 for growth]
Tier 2 (App): 10.0.32.0/13 (64K)
Subnet A: 10.0.32.0/19 (8K)
Subnet B: 10.0.64.0/19 (8K)
[Reserve: 10.0.96.0/13 for future tiers]
Secondary ranges:
GKE pods: 10.2.0.0/12 (1M)
[Reserve rest of region for cache, spillover]Constraint: No Overlapping CIDRs Across Peered Networks
Critical rule for Shared VPC + multi-VPC peering:
❌ INVALID:
VPC-A subnet: 10.0.0.0/16
VPC-B subnet: 10.0.0.0/24 (OVERLAPS!)
Peering attempt: FAILS
✅ VALID:
VPC-A subnet: 10.0.0.0/16
VPC-B subnet: 10.16.0.0/12 (NO OVERLAP)
Peering: SUCCESS
✅ VALID:
VPC-A subnet: 10.0.0.0/24
VPC-B subnet: 10.0.1.0/24 (Different)
Peering: SUCCESSMulti-VPC Planning
Organization with 3 teams, each needs own VPC for autonomy:
VPC "team-a" (10.0.0.0/12)
- 1M IPs
- Can fit 100+ subnets
- Peering with others
VPC "team-b" (10.16.0.0/12)
- Different CIDR block
- No overlap with team-a
VPC "team-c" (10.32.0.0/12)
- Can peer with team-a, team-b
- Full mesh peering enabledRule: Each org must maintain CIDR allocation spreadsheet
VPCs:
├── team-a: 10.0.0.0/12
├── team-b: 10.16.0.0/12
├── team-c: 10.32.0.0/12
├── shared-services: 10.48.0.0/12
└── reserve: 10.64.0.0/7 (future acquisition, DR)Expanding Subnets: The Limited Options
Important: Primary ranges can expand, secondary ranges cannot.
Expanding Primary Range
Current: 10.0.0.0/20 (4096 IPs)
Need: 8192 IPs (double)
Option: Expand primary to /19
gcloud compute networks subnets expand-ip-range SUBNET \
--prefix-length=19
Result:
Before: 10.0.0.0/20 (4096)
After: 10.0.0.0/19 (8192)
Constraint: CAN ONLY EXPAND, NOT SHRINK
✅ /20 → /19 OK
✅ /20 → /18 OK
❌ /19 → /20 INVALIDGotcha: Expansion locks in new size forever
Plan: 10.0.0.0/20 (4096 IPs) - thought enough
Reality: After 6 months, 5000 pods
Action: Expand to 10.0.0.0/19 (8192)
New reality: After 12 months, 10K pods
Needed: 10.0.0.0/18 (16K)
Action: Expand again
Lesson: Plan larger upfront, avoid repeated expansionsSecondary Ranges: Immutable Approach
Cannot modify secondary range once created:
✅ Add new secondary range
✅ Delete unused secondary range
❌ Resize existing secondary range
❌ Change CIDR of existing range
Solution for growth:
Step 1: Create secondary range 2 (10.4.0.0/16)
Step 2: Create new GKE cluster with range 2
Step 3: Migrate workloads from cluster 1 to cluster 2
Step 4: Delete old cluster, optionally delete secondary range 1Production Sizing Examples
Small Organization (1-5 regions)
VPC: 10.0.0.0/12 (1M IPs)
Per-region: /16 (65K)
├── Subnets (primary): /24 each (256 IPs)
├── Secondary GKE: /18 per cluster (16K pods potential)
└── Reserve: /17 (128K per region, for growth)
Rationale:
- Single /12 sufficient for 15+ regions if needed
- Per-region /16 allows 4 substantial GKE clusters
- Simple to manageMedium Organization (5-15 regions)
VPC: 10.0.0.0/10 (4M IPs)
Per-region: /13 (8K subnets × several)
├── Primary subnets: /19 (512)
├── GKE secondaries: /11 each (2K pods potential per secondary)
└── Reserve significant blocks
Rationale:
- Covers all major GCP regions
- Per-region allocation flexible
- Multi-VPC peering possible if teams separatedLarge Organization (15+ regions, multi-VPC)
VPC 1 "production" (10.0.0.0/9)
VPC 2 "staging" (10.128.0.0/10)
VPC 3 "sandbox" (10.192.0.0/10)
Spreadsheet:
├── Per-VPC CIDR reservation
├── Per-region allocation
├── Per-tier subnet assignment
├── GKE secondary range allocation
├── On-premises IP ranges (for Interconnect)
└── ISP allocation (for public IPs, if own AS)Disaster Recovery: Backup CIDR Block
For critical systems, maintain second VPC:
Primary VPC: "prod" (10.0.0.0/8)
├── All active workloads
DR VPC: "prod-dr" (172.16.0.0/12)
├── Warm standby, synchronized data
├── Non-overlapping CIDR (different prefix)
├── Connected via Interconnect (low-latency)
Failover:
Step 1: Update DNS to point to DR VPC IPs
Step 2: Activate services in DR
Step 3: Verify 100% traffic on DR
Step 4: Investigate primary VPC
Cost: 1.5x-2x during DR preparation
Risk mitigation: Worth it for critical systemsTroubleshooting CIDR Issues
Symptom: Cannot Peer Networks
Error: "Cannot establish peering, overlapping subnet ranges"
Diagnosis:
gcloud compute networks peerings list --network=vpc-a
Identify conflicting ranges:
VPC-A subnet 1: 10.0.0.0/16
VPC-B subnet 1: 10.0.0.0/20 ← OVERLAP!
Fix:
Option 1: Recreate VPC-B with different CIDR (delete/rebuild)
Option 2: Add secondary range to VPC-B if possible
Option 3: Use Interconnect instead of peering (more complex)Symptom: IP Address Exhaustion
Error: "Failed to allocate IP, no addresses available in subnet"
Diagnosis:
gcloud compute networks subnets describe SUBNET --region=REGION
shows: "10.0.0.0/24 with 2 IPs remaining out of 254"
Prevention:
- Monitor IP usage regularly
- Alert when >80% utilized
- Expand proactively
Fix:
Immediate: Expand primary range (gcloud ... expand-ip-range)
Long-term: Plan secondary range growthConclusion
CIDR planning is like network design's foundation:
- ✅ Get it right: scales smoothly for years
- ❌ Get it wrong: costly migration, downtime
Key takeaways:
- Primary ranges expandable, secondary ranges immutable - plan for 10x growth
- Global uniqueness required for multi-VPC peering - spreadsheet is your friend
- Reserve space for future regions - cheaper than migration
- GKE pod CIDR is not VM primary CIDR - separate planning tracks
- Test CIDR layout before rollout - peering verification prevents disasters
Invest 4 hours in spreadsheet now >> 40 hours recovering from IP exhaustion later.