Skip to content

VPC Flow Logs Analysis — Network Observability at Scale

Executive Summary

VPC Flow Logs = GCP captures packets flowing through network (sample rate configurable).

Features:

  • ✅ Granular network telemetry (every 5 tuples)
  • ✅ Configurable sampling (1% → 100%)
  • ✅ Export to Cloud Logging, BigQuery, Cloud Storage
  • ✅ Troubleshooting: latency, packet loss, DoS patterns
  • ❌ Cost: ~$0.50/GB (can be high at scale)
  • ❌ Privacy: Captures packet-level metadata

What VPC Flow Logs Capture

Flow Tuple

5-tuple identifies flow:
  - Source IP: 10.0.1.5
  - Source port: 54321
  - Destination IP: 10.0.2.10
  - Destination port: 3306
  - Protocol: TCP

One flow record per tuple per interval (default 5s)

Sample Log Entry

json
{
  "insertId": "abcdef1234567890",
  "resource": {
    "type": "gce_subnetwork",
    "labels": {
      "subnet_name": "prod-app",
      "network_name": "prod-vpc",
      "region": "us-central1"
    }
  },
  "timestamp": "2026-05-19T12:34:56Z",
  "jsonPayload": {
    "bytes_received": 1024,
    "bytes_sent": 2048,
    "dest_ip": "10.0.2.10",
    "dest_port": 3306,
    "protocol": 6,
    "src_ip": "10.0.1.5",
    "src_port": 54321,
    "start_time": 1234567890,
    "end_time": 1234567895,
    "tcp_flags": "SYN,ACK",
    "rtt_msec": 2
  }
}

Sampling: Tradeoff Performance vs Cost

Sampling Rates

1% sampling (default):
  - 1 in 100 flows captured
  - Cost: Minimal
  - Data: Statistically representative
  - Accuracy: ~1% error margin

10% sampling:
  - 1 in 10 flows captured
  - Cost: 10× higher
  - Data: More detailed
  - Accuracy: ~10% error margin

50% sampling:
  - 1 in 2 flows captured
  - Cost: 50× higher
  - Data: Very detailed
  - Accuracy: ~50% error margin

100% sampling (all flows):
  - Every flow captured
  - Cost: Maximum
  - Data: Complete (massive)
  - For: High-value troubleshooting only

Calculating Cost Impact

Typical workload:

Network throughput: 10 Gbps average
  = 10 × 10^9 bits/sec
  = 1.25 × 10^9 bytes/sec
  = 1.25 GB/sec
  = 108 TB/day

VPC Flow Logs (1% sampling):
  Size: 108 TB × 1% = 1.08 TB/day
  Cost: 1.08 TB × $0.50/GB = $540/day
       = $16,200/month

VPC Flow Logs (100% sampling):
  Size: 108 TB/day
  Cost: 108 TB × $0.50/GB = $54,000/day
       = $1,620,000/month (!!)

Configuring VPC Flow Logs

Enable on Subnet

bash
gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --enable-flow-logs \
  --logging-aggregation-interval=5s \
  --logging-flow-sample-rate=0.5  # 50% sampling

# Export destination:
gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --logging-metadata=include-all  # Include all fields

Export Destinations

Option 1: Cloud Logging (default)
  Retention: 30 days (configurable)
  Query: Use gcloud logging read or Cloud Logging UI
  Cost: Flow Log cost + Logging storage

Option 2: BigQuery
  Retention: User-managed (can be years)
  Query: SQL (powerful for analysis)
  Cost: Flow Log cost + BigQuery storage + query cost

Option 3: Cloud Storage
  Retention: User-managed
  Query: Download and analyze locally
  Cost: Flow Log cost + GCS storage

Best practice: Cloud Logging for short-term troubleshooting
               BigQuery for long-term analysis/compliance

Routing to BigQuery

bash
gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --enable-flow-logs

# Then configure in VPC Network → Firewall → VPC Flow Logs
# Set destination to BigQuery dataset

Alternative (terraform):
resource "google_compute_subnetwork" "prod_app" {
  name                     = "prod-app"
  ip_cidr_range            = "10.0.1.0/24"
  
  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sample_rate     = 0.5
    metadata             = "INCLUDE_ALL"
    metadata_fields      = ["all_fields"]
  }
}

Analysis: Common Queries

Query 1: Top Talkers

sql
SELECT
  src_ip,
  COUNT(*) as flow_count,
  SUM(CAST(bytes_sent as INT64)) as total_bytes_sent
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN "2026-05-19T00:00:00Z" 
                    AND "2026-05-19T23:59:59Z"
GROUP BY src_ip
ORDER BY total_bytes_sent DESC
LIMIT 10;

Result: Which VMs sending most traffic?
        (Helps identify data exfiltration or misconfigured processes)

Query 2: Traffic by Port

sql
SELECT
  dest_port,
  COUNT(*) as flow_count,
  COUNTIF(tcp_flags LIKE "%SYN%") as syn_count
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
                    AND CURRENT_TIMESTAMP()
GROUP BY dest_port
ORDER BY flow_count DESC
LIMIT 20;

Result: What ports most active?
        Identify scanning (many SYN without data)

Query 3: Latency Analysis (RTT)

sql
SELECT
  src_ip,
  dest_ip,
  dest_port,
  APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(50)] as p50_rtt,
  APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(99)] as p99_rtt,
  MAX(CAST(rtt_msec as INT64)) as max_rtt
FROM `project.dataset.vpc_flow_logs`
WHERE rtt_msec IS NOT NULL
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port
HAVING p99_rtt > 10  # Latency > 10ms
ORDER BY p99_rtt DESC;

Result: Connections with high latency
        (Identify performance bottlenecks)

Query 4: Blocked Connections (Firewall Denied)

sql
SELECT
  src_ip,
  dest_ip,
  dest_port,
  protocol,
  COUNT(*) as attempt_count
FROM `project.dataset.vpc_flow_logs`
WHERE action = "DROPPED" OR tcp_flags LIKE "%SYN%"  # SYN without response
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port, protocol
ORDER BY attempt_count DESC;

Result: VMs trying to connect to unreachable destinations
        (Check firewall rules, routing)

Query 5: DDoS Detection Pattern

sql
SELECT
  dest_ip,
  dest_port,
  COUNT(DISTINCT src_ip) as unique_sources,
  COUNT(*) as total_flows
FROM `project.dataset.vpc_flow_logs`
WHERE bytes_received = 0  # No response data (one-way flood)
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MIN)
GROUP BY dest_ip, dest_port
HAVING unique_sources > 100 OR total_flows > 10000
ORDER BY total_flows DESC;

Result: Possible DDoS targets
        (Detect volumetric attacks early)

Metadata Fields

Available Fields

Network fields:
  - src_ip, dest_ip
  - src_port, dest_port
  - protocol (TCP=6, UDP=17, ICMP=1)
  
Data fields:
  - bytes_sent, bytes_received
  - start_time, end_time
  - rtt_msec (round-trip time)
  
TCP-specific:
  - tcp_flags (SYN, ACK, FIN, RST)
  - tcp_rtt_microseconds
  
Context fields:
  - connection_state (NEW, ESTABLISHED, etc.)
  - src_instance_id, dest_instance_id (VM IDs)
  - src_vpc, dest_vpc (VPC names)
  - src_region, dest_region
  
Include with: --logging-metadata=include-all

Cost Impact of Metadata

Minimal metadata (just IPs/ports):
  ~100 bytes per flow entry
  
Full metadata (all fields):
  ~500 bytes per flow entry
  
5× cost difference!

Decision: Include metadata only for troubleshooting
          Use minimal for cost-sensitive analysis

Troubleshooting with Flow Logs

Symptom: Slow Database Connection

VM app-1 (10.0.1.5) reports slow queries to DB (10.0.2.10:3306)

Query flow logs:
SELECT
  CAST(rtt_msec as INT64) as latency,
  bytes_sent,
  bytes_received,
  timestamp
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
  AND dest_ip = "10.0.2.10"
  AND dest_port = 3306
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC
LIMIT 20;

Results:
  Latency 1-2ms: Normal (VPC-local)
  Latency 50-100ms: Abnormal (cross-region or congestion?)
  Latency 200+ms: Problem (check routing, network congestion)

Action:
  If latency jumped, check:
    1. VM CPU/memory (slow instance)
    2. Network congestion (check top talkers)
    3. DNS resolution delay
    4. Multi-region routing (wrong route?)

Symptom: Connection Timeouts

VM app-1 cannot connect to external service (203.0.113.5:443)

Query:
SELECT
  action,
  tcp_flags,
  COUNT(*) as count
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
  AND dest_ip = "203.0.113.5"
  AND dest_port = 443
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY action, tcp_flags;

Results:
  Action=DROPPED, Count=1000+: Firewall blocking
  Action=ACCEPTED, tcp_flags=SYN, Count=1000: No response (host down or filtering)
  Action=ACCEPTED, tcp_flags=SYN+ACK, Count=1: Success

Resolution:
  If DROPPED: Check firewall rules
    gcloud compute firewall-rules list --filter="..."
    
  If SYN no response: Check if destination alive
    traceroute 203.0.113.5

Symptom: Unexpected Data Egress

Cloud Billing shows $5000/month egress charge (unexpected)

Query:
SELECT
  dest_ip,
  REGEXP_SUBSTR(dest_ip, r'(^[0-9]+)') as dest_octet,
  SUM(CAST(bytes_sent as INT64)) as total_bytes
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND src_ip LIKE "10.%"  # VPC-local
  AND dest_ip NOT LIKE "10.%"  # External
GROUP BY dest_ip
ORDER BY total_bytes DESC
LIMIT 20;

Results:
  Identify which external IPs receiving traffic
  
Then investigate:
  Is this expected? (backups, replication?)
  Compromised VM? (data exfiltration?)
  Misconfigured process? (logging to external API?)
  
Action:
  Block with firewall rule or disable PGA for those VMs

Cost Management

Sampling Strategy

Development/Testing:
  Sample rate: 1% (or less)
  Retention: 7 days (Cloud Logging)
  Cost: $0-10/month

Production:
  Sample rate: 5-10% (for troubleshooting)
  Retention: 30 days (Cloud Logging) + 90 days (BigQuery)
  Cost: $500-2000/month

High-security environment:
  Sample rate: 50% (detect attacks)
  Retention: 1 year (BigQuery, compliance)
  Cost: $10,000-50,000/month

Cost Optimization

Tip 1: Disable VPC Flow Logs on non-critical subnets
       (dev, test networks)

Tip 2: Use 1% sampling for baseline, 100% only for incident response
       (temporary enablement for investigation)

Tip 3: Export to BigQuery with partition/clustering
       (faster queries, lower query cost)

Tip 4: Use TTL on BigQuery tables (auto-delete old logs)
       (prevent unbounded growth)

Tip 5: Run analysis during off-peak hours
       (BigQuery slots for cost predictability)

Best Practices

Do:

  • Enable VPC Flow Logs on production subnets
  • Use BigQuery for long-term analysis
  • Set up alerts on suspicious patterns (DDoS, exfiltration)
  • Document flow log queries for team
  • Include metadata only when needed

Don't:

  • Use 100% sampling by default (too expensive)
  • Ignore flow logs (missed troubleshooting opportunities)
  • Store in Cloud Logging forever (costs accumulate)
  • Query flow logs without time bounds (slow queries)
  • Enable on every subnet unless necessary (cost adds up)

Conclusion

VPC Flow Logs provide powerful network visibility:

  • Granular: 5-tuple flows, per-packet latency
  • Flexible: Multiple export destinations
  • Scalable: Configurable sampling
  • Actionable: SQL queries for analysis

Essential for: Production troubleshooting, security analysis, capacity planning.

Cost-effective when: Using appropriate sampling and retention policies.