Subnet Design & CIDR Planning — Masterclass at Scale

Executive Summary

IP address planning ngdengar boring tapi adalah critical infrastructure decision yang bisa:

❌ Strand your entire application (IP exhaustion = cannot scale)
✅ Prevent security incidents (overlapping ranges prevent accidental peering)
✅ Simplify multi-region operations (global IP uniqueness)

Subnet: More Than Just CIDR

Subnet dalam GCP = regional network segment dengan:

Primary CIDR range: VM primary IP addresses
Secondary ranges: GKE pods, services, future use-cases
Region: us-west1, europe-west1, etc. (but all AZs in region)
Availability: All AZs in region (us-west1-a, us-west1-b, us-west1-c)

Subnet "production"
├── Region: us-west1 (applies to us-west1-a, b, c)
├── Primary CIDR: 10.0.0.0/20 (4096 IPs)
│   ├── 10.0.0.1: Gateway (reserved)
│   ├── 10.0.0.2 - 10.0.15.254: Usable (4094 IPs)
│   └── 10.0.15.255: Broadcast (reserved)
├── Secondary "gke-pods": 10.2.0.0/16 (65536 IPs)
└── Secondary "gke-services": 10.3.0.0/16 (65536 IPs)

Understanding CIDR Allocation

Quick Reference: IP Count by Prefix

CIDR	Suffix	Total IPs	Usable	Notes
/32	Single host	1	1	One IP
/29	Minimum subnet	8	6	Rarely used
/28	Small pod	16	14	Dev environments
/27	Small cluster	32	30	~20 pods
/26	Subnet	64	62	~50 pods
/25	Medium subnet	128	126	~100 pods
/24	Standard subnet	256	254	~200 pods
/23	Large subnet	512	510	~400 pods
/22	Very large	1024	1022	~800 pods
/21	Huge subnet	2048	2046	~1600 pods
/20	Default GCP	4096	4094	~3200 pods
/16	Secondary range	65536	65534	50K+ pods
/12	Region block	1M	~1M	800K+ pods

Formula: Usable IPs = 2^(32 - prefix) - 2 (exclude network + broadcast)

Primary vs Secondary Ranges

Primary Range (Mandatory)

Subnet primary range = VM IP addresses ONLY

subnet "tier1":
  primary: 10.0.0.0/20 (VMs get IPs from here)
  
VM creation:
  gcloud compute instances create vm1 --subnet=tier1
  → Gets 10.0.1.5 from primary range

Constraints:

Must be unique within VPC
Applies to all AZs in region
Cannot resize smaller (only expand)
GCP reserves: first IP (network), last IP (broadcast), one gateway IP

Secondary Ranges (Optional but Recommended)

Subnet "tier1":
  primary: 10.0.0.0/20 (VMs)
  secondary-1: 10.2.0.0/16 (GKE pods)
  secondary-2: 10.3.0.0/16 (GKE services)

Use cases:

GKE pod CIDR (most important)
GKE service CIDR
Canary deployments (separate network)
Future use cases (reserve now, use later)

Benefit: VMs and pods have separate CIDRs

Firewall rules: allow traffic to pods (10.2.0.0/16) but not VMs
Load balancer: routes traffic to pod endpoint (10.2.0.0/16)
Monitoring: metrics distinguished by primary vs secondary

Production Pattern: Multi-tier Network Design

Scenario: E-commerce Platform

VPC: prod (10.0.0.0/8)

Region: us-west1 (10.0.0.0/11 = 2M IPs available)

  Subnet "lb-tier":
    Primary: 10.0.0.0/24 (Load Balancer VMs)
    Secondary-pods: 10.1.0.0/16 (GKE LB cluster)
    Secondary-services: 10.1.128.0/17
    AZs: all

  Subnet "app-tier":
    Primary: 10.0.1.0/24 (App Server VMs)
    Secondary-pods: 10.2.0.0/16 (GKE App cluster)
    Secondary-services: 10.2.128.0/17
    AZs: all

  Subnet "db-tier":
    Primary: 10.0.2.0/24 (Database VMs - very few)
    Secondary-pods: 10.3.0.0/16 (CloudSQL, no pods really)
    AZs: all (but limited)

  Subnet "internal":
    Primary: 10.0.3.0/24 (Internal tools, monitoring)
    Secondary-shared: 10.10.0.0/16 (Shared services)
    AZs: all

Region: europe-west1 (10.32.0.0/11)

  Subnet "lb-tier":
    Primary: 10.32.0.0/24
    Secondary-pods: 10.33.0.0/16
    Secondary-services: 10.33.128.0/17

  [Same pattern for app, db, internal tiers]

Reserve: 10.64.0.0/7 (future regions, emergency expansion)

Firewall Strategy for Multi-tier

Firewall rules:

rule-100: allow ingress to lb-pods (10.1.0.0/16)
  from 0.0.0.0/0, port 443
  target: tag:public-lb

rule-200: allow ingress to app-pods (10.2.0.0/16)
  from lb-pods (10.1.0.0/16), port 8080
  target: tag:app

rule-300: allow ingress to db (10.0.2.0/24)
  from app-pods (10.2.0.0/16), port 3306
  target: tag:database

rule-999: deny all other (implicit)

Advantage: Clear traffic patterns, easy to audit

GKE Pod CIDR Planning

GKE clusters consume secondary ranges at scale:

Single Cluster

GKE cluster "main-app" in subnet "app-tier":
  Pod CIDR: 10.2.0.0/16 (secondary range)
  Service CIDR: 10.2.128.0/17
  
  Pods: 1000 pods = ~1000 IPs consumed from 65536
  Services: 100 services = ~100 IPs consumed from 32768
  
  Remaining: 64436 pod IPs available for growth
  Growth potential: 60x current size before range exhaustion

Multi-cluster in Same Region

Region "us-west1" subnet "app-tier":

Cluster "canary":
  Pod CIDR: 10.2.0.0/24 (256 IPs)
  Services: 10.2.1.0/24

Cluster "staging":
  Pod CIDR: 10.2.2.0/24
  Services: 10.2.3.0/24

Cluster "production":
  Pod CIDR: 10.2.4.0/20 (4096 IPs) 
  Services: 10.2.20.0/20

Layout:
  10.2.0.0/16 (reserved for clusters)
  ├── 10.2.0.0/24 (canary pods)
  ├── 10.2.1.0/24 (canary services)
  ├── 10.2.2.0/24 (staging pods)
  ├── 10.2.3.0/24 (staging services)
  ├── 10.2.4.0/20 (prod pods)
  ├── 10.2.20.0/20 (prod services)
  └── 10.2.36.0/12 (reserved for future clusters)

Design principle: Reserved secondary range >> sum of all clusters

Sum of clusters: 10.2.4.0/20 + others = ~4500 IPs
Secondary range: 10.2.0.0/16 = 65536 IPs
Safety margin: 14x (allows growth, mistakes, spikes)

Real-world Pitfall: Secondary Range Exhaustion

Year 1:
  App tier secondary: 10.2.0.0/16
  Clusters: canary (50 pods), staging (100 pods), prod (500 pods)
  Used: ~650 IPs
  Available: 64886 IPs ← "plenty of room"

Year 2:
  Product success → 5000 pods in production
  New canary: 1000 pods
  New staging: 2000 pods
  Used: 8000 IPs
  Available: 57536 IPs ← "still fine"

Year 3:
  Multi-tier rollout: 20000 pods across regions
  Per-region usage: 20000 / 4 regions = 5000 pods per region
  US-west1 allocated: 5000 pods
  Problem: 10.2.0.0/16 insufficient!
  
  Fix required:
    Option 1: Add new secondary range 10.4.0.0/16
      Problem: existing clusters point to 10.2.0.0/16
      Requires: new cluster + data migration
    
    Option 2: Expand existing range
      Impossible in GCP! Ranges are immutable

Prevention: Reserve larger secondary ranges

✅ Correct planning:
  Primary VMs: 10.0.0.0/20 (4096)
  Pods: 10.2.0.0/12 (1M IPs)
    → Multiple clusters can fit
    → Room for 100x growth

vs.

❌ Incorrect planning:
  Primary VMs: 10.0.0.0/20 (4096)
  Pods: 10.2.0.0/18 (16K IPs)
    → Fits 10 clusters only
    → 10x growth = exhaustion

CIDR Allocation Strategy: The Spreadsheet Method

Step 1: Determine Organization Scope

Question: How many regions will we deploy to?

Answer 1: Just US (4 regions)
  VPC: 10.0.0.0/12 (1M IPs) - sufficient

Answer 2: US + EU + Asia (12 regions)
  VPC: 10.0.0.0/10 (4M IPs) - necessary

Answer 3: Global + future (possible 20 regions)
  VPC: 10.0.0.0/8 (16M IPs) - prudent

Step 2: Allocate Per-Region Block

VPC: 10.0.0.0/8 (16M IPs)

Region us-west1: 10.0.0.0/11 (2M IPs)
Region us-central1: 10.32.0.0/11
Region us-east1: 10.64.0.0/11
Region us-south1: 10.96.0.0/11
Region europe-west1: 10.128.0.0/11
Region europe-west2: 10.160.0.0/11
Region europe-west3: 10.192.0.0/11
Region asia-northeast1: 10.224.0.0/11
Region asia-northeast2: [Reserve for future]

Formula: Each region gets /11 (2M IPs) from global /8
  Regions 1-8: 10.0.0.0/11 through 10.224.0.0/11
  Regions 9+: Emergency allocation if needed

Step 3: Per-Subnet Allocation Within Region

Region us-west1: 10.0.0.0/11 (2M IPs)

  Tier 1 (LB): 10.0.0.0/15 (128K)
    Subnet A: 10.0.0.0/21 (2K)
    Subnet B: 10.0.8.0/21 (2K)
    [Reserve: 10.0.16.0/15 for growth]

  Tier 2 (App): 10.0.32.0/13 (64K)
    Subnet A: 10.0.32.0/19 (8K)
    Subnet B: 10.0.64.0/19 (8K)
    [Reserve: 10.0.96.0/13 for future tiers]

  Secondary ranges:
    GKE pods: 10.2.0.0/12 (1M)
    [Reserve rest of region for cache, spillover]

Constraint: No Overlapping CIDRs Across Peered Networks

Critical rule for Shared VPC + multi-VPC peering:

❌ INVALID:
  VPC-A subnet: 10.0.0.0/16
  VPC-B subnet: 10.0.0.0/24 (OVERLAPS!)
  Peering attempt: FAILS

✅ VALID:
  VPC-A subnet: 10.0.0.0/16
  VPC-B subnet: 10.16.0.0/12 (NO OVERLAP)
  Peering: SUCCESS

✅ VALID:
  VPC-A subnet: 10.0.0.0/24
  VPC-B subnet: 10.0.1.0/24 (Different)
  Peering: SUCCESS

Multi-VPC Planning

Organization with 3 teams, each needs own VPC for autonomy:

VPC "team-a" (10.0.0.0/12)
  - 1M IPs
  - Can fit 100+ subnets
  - Peering with others

VPC "team-b" (10.16.0.0/12)
  - Different CIDR block
  - No overlap with team-a

VPC "team-c" (10.32.0.0/12)
  - Can peer with team-a, team-b
  - Full mesh peering enabled

Rule: Each org must maintain CIDR allocation spreadsheet

VPCs:
├── team-a: 10.0.0.0/12
├── team-b: 10.16.0.0/12
├── team-c: 10.32.0.0/12
├── shared-services: 10.48.0.0/12
└── reserve: 10.64.0.0/7 (future acquisition, DR)

Expanding Subnets: The Limited Options

Important: Primary ranges can expand, secondary ranges cannot.

Expanding Primary Range

Current: 10.0.0.0/20 (4096 IPs)
Need: 8192 IPs (double)

Option: Expand primary to /19

gcloud compute networks subnets expand-ip-range SUBNET \
  --prefix-length=19

Result:
  Before: 10.0.0.0/20 (4096)
  After: 10.0.0.0/19 (8192)
  
Constraint: CAN ONLY EXPAND, NOT SHRINK
  ✅ /20 → /19 OK
  ✅ /20 → /18 OK
  ❌ /19 → /20 INVALID

Gotcha: Expansion locks in new size forever

Plan: 10.0.0.0/20 (4096 IPs) - thought enough

Reality: After 6 months, 5000 pods
Action: Expand to 10.0.0.0/19 (8192)

New reality: After 12 months, 10K pods
Needed: 10.0.0.0/18 (16K)
Action: Expand again

Lesson: Plan larger upfront, avoid repeated expansions

Secondary Ranges: Immutable Approach

Cannot modify secondary range once created:
  ✅ Add new secondary range
  ✅ Delete unused secondary range
  ❌ Resize existing secondary range
  ❌ Change CIDR of existing range

Solution for growth:
  Step 1: Create secondary range 2 (10.4.0.0/16)
  Step 2: Create new GKE cluster with range 2
  Step 3: Migrate workloads from cluster 1 to cluster 2
  Step 4: Delete old cluster, optionally delete secondary range 1

Production Sizing Examples

Small Organization (1-5 regions)

VPC: 10.0.0.0/12 (1M IPs)

Per-region: /16 (65K)
├── Subnets (primary): /24 each (256 IPs)
├── Secondary GKE: /18 per cluster (16K pods potential)
└── Reserve: /17 (128K per region, for growth)

Rationale:
  - Single /12 sufficient for 15+ regions if needed
  - Per-region /16 allows 4 substantial GKE clusters
  - Simple to manage

Medium Organization (5-15 regions)

VPC: 10.0.0.0/10 (4M IPs)

Per-region: /13 (8K subnets × several)
├── Primary subnets: /19 (512)
├── GKE secondaries: /11 each (2K pods potential per secondary)
└── Reserve significant blocks

Rationale:
  - Covers all major GCP regions
  - Per-region allocation flexible
  - Multi-VPC peering possible if teams separated

Large Organization (15+ regions, multi-VPC)

VPC 1 "production" (10.0.0.0/9)
VPC 2 "staging" (10.128.0.0/10)
VPC 3 "sandbox" (10.192.0.0/10)

Spreadsheet:
  ├── Per-VPC CIDR reservation
  ├── Per-region allocation
  ├── Per-tier subnet assignment
  ├── GKE secondary range allocation
  ├── On-premises IP ranges (for Interconnect)
  └── ISP allocation (for public IPs, if own AS)

Disaster Recovery: Backup CIDR Block

For critical systems, maintain second VPC:

Primary VPC: "prod" (10.0.0.0/8)
  ├── All active workloads

DR VPC: "prod-dr" (172.16.0.0/12)
  ├── Warm standby, synchronized data
  ├── Non-overlapping CIDR (different prefix)
  ├── Connected via Interconnect (low-latency)
  
Failover:
  Step 1: Update DNS to point to DR VPC IPs
  Step 2: Activate services in DR
  Step 3: Verify 100% traffic on DR
  Step 4: Investigate primary VPC
  
Cost: 1.5x-2x during DR preparation
Risk mitigation: Worth it for critical systems

Troubleshooting CIDR Issues

Symptom: Cannot Peer Networks

Error: "Cannot establish peering, overlapping subnet ranges"

Diagnosis:
  gcloud compute networks peerings list --network=vpc-a
  
  Identify conflicting ranges:
    VPC-A subnet 1: 10.0.0.0/16
    VPC-B subnet 1: 10.0.0.0/20 ← OVERLAP!
  
Fix:
  Option 1: Recreate VPC-B with different CIDR (delete/rebuild)
  Option 2: Add secondary range to VPC-B if possible
  Option 3: Use Interconnect instead of peering (more complex)

Symptom: IP Address Exhaustion

Error: "Failed to allocate IP, no addresses available in subnet"

Diagnosis:
  gcloud compute networks subnets describe SUBNET --region=REGION
  
  shows: "10.0.0.0/24 with 2 IPs remaining out of 254"

Prevention:
  - Monitor IP usage regularly
  - Alert when >80% utilized
  - Expand proactively

Fix:
  Immediate: Expand primary range (gcloud ... expand-ip-range)
  Long-term: Plan secondary range growth

Conclusion

CIDR planning is like network design's foundation:

✅ Get it right: scales smoothly for years
❌ Get it wrong: costly migration, downtime

Key takeaways:

Primary ranges expandable, secondary ranges immutable - plan for 10x growth
Global uniqueness required for multi-VPC peering - spreadsheet is your friend
Reserve space for future regions - cheaper than migration
GKE pod CIDR is not VM primary CIDR - separate planning tracks
Test CIDR layout before rollout - peering verification prevents disasters

Invest 4 hours in spreadsheet now >> 40 hours recovering from IP exhaustion later.

Subnet Design & CIDR Planning — Masterclass at Scale ​

Executive Summary ​

Subnet: More Than Just CIDR ​

Understanding CIDR Allocation ​

Quick Reference: IP Count by Prefix ​

Primary vs Secondary Ranges ​

Primary Range (Mandatory) ​

Secondary Ranges (Optional but Recommended) ​

Production Pattern: Multi-tier Network Design ​

Scenario: E-commerce Platform ​

Firewall Strategy for Multi-tier ​

GKE Pod CIDR Planning ​

Single Cluster ​

Multi-cluster in Same Region ​

Real-world Pitfall: Secondary Range Exhaustion ​

CIDR Allocation Strategy: The Spreadsheet Method ​

Step 1: Determine Organization Scope ​

Step 2: Allocate Per-Region Block ​

Step 3: Per-Subnet Allocation Within Region ​

Constraint: No Overlapping CIDRs Across Peered Networks ​

Multi-VPC Planning ​

Expanding Subnets: The Limited Options ​

Expanding Primary Range ​

Secondary Ranges: Immutable Approach ​

Production Sizing Examples ​

Small Organization (1-5 regions) ​

Medium Organization (5-15 regions) ​

Large Organization (15+ regions, multi-VPC) ​

Disaster Recovery: Backup CIDR Block ​

Troubleshooting CIDR Issues ​

Symptom: Cannot Peer Networks ​

Symptom: IP Address Exhaustion ​

Conclusion ​