VPC Flow Logs Analysis — Network Observability at Scale

Executive Summary

VPC Flow Logs = GCP captures packets flowing through network (sample rate configurable).

Features:

✅ Granular network telemetry (every 5 tuples)
✅ Configurable sampling (1% → 100%)
✅ Export to Cloud Logging, BigQuery, Cloud Storage
✅ Troubleshooting: latency, packet loss, DoS patterns
❌ Cost: ~$0.50/GB (can be high at scale)
❌ Privacy: Captures packet-level metadata

What VPC Flow Logs Capture

Flow Tuple

5-tuple identifies flow:
  - Source IP: 10.0.1.5
  - Source port: 54321
  - Destination IP: 10.0.2.10
  - Destination port: 3306
  - Protocol: TCP

One flow record per tuple per interval (default 5s)

Sample Log Entry

json

{
  "insertId": "abcdef1234567890",
  "resource": {
    "type": "gce_subnetwork",
    "labels": {
      "subnet_name": "prod-app",
      "network_name": "prod-vpc",
      "region": "us-central1"
    }
  },
  "timestamp": "2026-05-19T12:34:56Z",
  "jsonPayload": {
    "bytes_received": 1024,
    "bytes_sent": 2048,
    "dest_ip": "10.0.2.10",
    "dest_port": 3306,
    "protocol": 6,
    "src_ip": "10.0.1.5",
    "src_port": 54321,
    "start_time": 1234567890,
    "end_time": 1234567895,
    "tcp_flags": "SYN,ACK",
    "rtt_msec": 2
  }
}

Sampling: Tradeoff Performance vs Cost

Sampling Rates

1% sampling (default):
  - 1 in 100 flows captured
  - Cost: Minimal
  - Data: Statistically representative
  - Accuracy: ~1% error margin

10% sampling:
  - 1 in 10 flows captured
  - Cost: 10× higher
  - Data: More detailed
  - Accuracy: ~10% error margin

50% sampling:
  - 1 in 2 flows captured
  - Cost: 50× higher
  - Data: Very detailed
  - Accuracy: ~50% error margin

100% sampling (all flows):
  - Every flow captured
  - Cost: Maximum
  - Data: Complete (massive)
  - For: High-value troubleshooting only

Calculating Cost Impact

Typical workload:

Network throughput: 10 Gbps average
  = 10 × 10^9 bits/sec
  = 1.25 × 10^9 bytes/sec
  = 1.25 GB/sec
  = 108 TB/day

VPC Flow Logs (1% sampling):
  Size: 108 TB × 1% = 1.08 TB/day
  Cost: 1.08 TB × $0.50/GB = $540/day
       = $16,200/month

VPC Flow Logs (100% sampling):
  Size: 108 TB/day
  Cost: 108 TB × $0.50/GB = $54,000/day
       = $1,620,000/month (!!)

Configuring VPC Flow Logs

Enable on Subnet

bash

gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --enable-flow-logs \
  --logging-aggregation-interval=5s \
  --logging-flow-sample-rate=0.5  # 50% sampling

# Export destination:
gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --logging-metadata=include-all  # Include all fields

Export Destinations

Option 1: Cloud Logging (default)
  Retention: 30 days (configurable)
  Query: Use gcloud logging read or Cloud Logging UI
  Cost: Flow Log cost + Logging storage

Option 2: BigQuery
  Retention: User-managed (can be years)
  Query: SQL (powerful for analysis)
  Cost: Flow Log cost + BigQuery storage + query cost

Option 3: Cloud Storage
  Retention: User-managed
  Query: Download and analyze locally
  Cost: Flow Log cost + GCS storage

Best practice: Cloud Logging for short-term troubleshooting
               BigQuery for long-term analysis/compliance

Routing to BigQuery

bash

gcloud compute networks subnets update prod-app \
  --region=us-central1 \
  --enable-flow-logs

# Then configure in VPC Network → Firewall → VPC Flow Logs
# Set destination to BigQuery dataset

Alternative (terraform):
resource "google_compute_subnetwork" "prod_app" {
  name                     = "prod-app"
  ip_cidr_range            = "10.0.1.0/24"
  
  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sample_rate     = 0.5
    metadata             = "INCLUDE_ALL"
    metadata_fields      = ["all_fields"]
  }
}

Analysis: Common Queries

Query 1: Top Talkers

sql

SELECT
  src_ip,
  COUNT(*) as flow_count,
  SUM(CAST(bytes_sent as INT64)) as total_bytes_sent
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN "2026-05-19T00:00:00Z" 
                    AND "2026-05-19T23:59:59Z"
GROUP BY src_ip
ORDER BY total_bytes_sent DESC
LIMIT 10;

Result: Which VMs sending most traffic?
        (Helps identify data exfiltration or misconfigured processes)

Query 2: Traffic by Port

sql

SELECT
  dest_port,
  COUNT(*) as flow_count,
  COUNTIF(tcp_flags LIKE "%SYN%") as syn_count
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
                    AND CURRENT_TIMESTAMP()
GROUP BY dest_port
ORDER BY flow_count DESC
LIMIT 20;

Result: What ports most active?
        Identify scanning (many SYN without data)

Query 3: Latency Analysis (RTT)

sql

SELECT
  src_ip,
  dest_ip,
  dest_port,
  APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(50)] as p50_rtt,
  APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(99)] as p99_rtt,
  MAX(CAST(rtt_msec as INT64)) as max_rtt
FROM `project.dataset.vpc_flow_logs`
WHERE rtt_msec IS NOT NULL
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port
HAVING p99_rtt > 10  # Latency > 10ms
ORDER BY p99_rtt DESC;

Result: Connections with high latency
        (Identify performance bottlenecks)

Query 4: Blocked Connections (Firewall Denied)

sql

SELECT
  src_ip,
  dest_ip,
  dest_port,
  protocol,
  COUNT(*) as attempt_count
FROM `project.dataset.vpc_flow_logs`
WHERE action = "DROPPED" OR tcp_flags LIKE "%SYN%"  # SYN without response
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port, protocol
ORDER BY attempt_count DESC;

Result: VMs trying to connect to unreachable destinations
        (Check firewall rules, routing)

Query 5: DDoS Detection Pattern

sql

SELECT
  dest_ip,
  dest_port,
  COUNT(DISTINCT src_ip) as unique_sources,
  COUNT(*) as total_flows
FROM `project.dataset.vpc_flow_logs`
WHERE bytes_received = 0  # No response data (one-way flood)
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MIN)
GROUP BY dest_ip, dest_port
HAVING unique_sources > 100 OR total_flows > 10000
ORDER BY total_flows DESC;

Result: Possible DDoS targets
        (Detect volumetric attacks early)

Metadata Fields

Available Fields

Network fields:
  - src_ip, dest_ip
  - src_port, dest_port
  - protocol (TCP=6, UDP=17, ICMP=1)
  
Data fields:
  - bytes_sent, bytes_received
  - start_time, end_time
  - rtt_msec (round-trip time)
  
TCP-specific:
  - tcp_flags (SYN, ACK, FIN, RST)
  - tcp_rtt_microseconds
  
Context fields:
  - connection_state (NEW, ESTABLISHED, etc.)
  - src_instance_id, dest_instance_id (VM IDs)
  - src_vpc, dest_vpc (VPC names)
  - src_region, dest_region
  
Include with: --logging-metadata=include-all

Cost Impact of Metadata

Minimal metadata (just IPs/ports):
  ~100 bytes per flow entry
  
Full metadata (all fields):
  ~500 bytes per flow entry
  
5× cost difference!

Decision: Include metadata only for troubleshooting
          Use minimal for cost-sensitive analysis

Troubleshooting with Flow Logs

Symptom: Slow Database Connection

VM app-1 (10.0.1.5) reports slow queries to DB (10.0.2.10:3306)

Query flow logs:
SELECT
  CAST(rtt_msec as INT64) as latency,
  bytes_sent,
  bytes_received,
  timestamp
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
  AND dest_ip = "10.0.2.10"
  AND dest_port = 3306
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC
LIMIT 20;

Results:
  Latency 1-2ms: Normal (VPC-local)
  Latency 50-100ms: Abnormal (cross-region or congestion?)
  Latency 200+ms: Problem (check routing, network congestion)

Action:
  If latency jumped, check:
    1. VM CPU/memory (slow instance)
    2. Network congestion (check top talkers)
    3. DNS resolution delay
    4. Multi-region routing (wrong route?)

Symptom: Connection Timeouts

VM app-1 cannot connect to external service (203.0.113.5:443)

Query:
SELECT
  action,
  tcp_flags,
  COUNT(*) as count
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
  AND dest_ip = "203.0.113.5"
  AND dest_port = 443
  AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY action, tcp_flags;

Results:
  Action=DROPPED, Count=1000+: Firewall blocking
  Action=ACCEPTED, tcp_flags=SYN, Count=1000: No response (host down or filtering)
  Action=ACCEPTED, tcp_flags=SYN+ACK, Count=1: Success

Resolution:
  If DROPPED: Check firewall rules
    gcloud compute firewall-rules list --filter="..."
    
  If SYN no response: Check if destination alive
    traceroute 203.0.113.5

Symptom: Unexpected Data Egress

Cloud Billing shows $5000/month egress charge (unexpected)

Query:
SELECT
  dest_ip,
  REGEXP_SUBSTR(dest_ip, r'(^[0-9]+)') as dest_octet,
  SUM(CAST(bytes_sent as INT64)) as total_bytes
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND src_ip LIKE "10.%"  # VPC-local
  AND dest_ip NOT LIKE "10.%"  # External
GROUP BY dest_ip
ORDER BY total_bytes DESC
LIMIT 20;

Results:
  Identify which external IPs receiving traffic
  
Then investigate:
  Is this expected? (backups, replication?)
  Compromised VM? (data exfiltration?)
  Misconfigured process? (logging to external API?)
  
Action:
  Block with firewall rule or disable PGA for those VMs

Cost Management

Sampling Strategy

Development/Testing:
  Sample rate: 1% (or less)
  Retention: 7 days (Cloud Logging)
  Cost: $0-10/month

Production:
  Sample rate: 5-10% (for troubleshooting)
  Retention: 30 days (Cloud Logging) + 90 days (BigQuery)
  Cost: $500-2000/month

High-security environment:
  Sample rate: 50% (detect attacks)
  Retention: 1 year (BigQuery, compliance)
  Cost: $10,000-50,000/month

Cost Optimization

Tip 1: Disable VPC Flow Logs on non-critical subnets
       (dev, test networks)

Tip 2: Use 1% sampling for baseline, 100% only for incident response
       (temporary enablement for investigation)

Tip 3: Export to BigQuery with partition/clustering
       (faster queries, lower query cost)

Tip 4: Use TTL on BigQuery tables (auto-delete old logs)
       (prevent unbounded growth)

Tip 5: Run analysis during off-peak hours
       (BigQuery slots for cost predictability)

Best Practices

✅ Do:

Enable VPC Flow Logs on production subnets
Use BigQuery for long-term analysis
Set up alerts on suspicious patterns (DDoS, exfiltration)
Document flow log queries for team
Include metadata only when needed

❌ Don't:

Use 100% sampling by default (too expensive)
Ignore flow logs (missed troubleshooting opportunities)
Store in Cloud Logging forever (costs accumulate)
Query flow logs without time bounds (slow queries)
Enable on every subnet unless necessary (cost adds up)

Conclusion

VPC Flow Logs provide powerful network visibility:

Granular: 5-tuple flows, per-packet latency
Flexible: Multiple export destinations
Scalable: Configurable sampling
Actionable: SQL queries for analysis

Essential for: Production troubleshooting, security analysis, capacity planning.

Cost-effective when: Using appropriate sampling and retention policies.

VPC Flow Logs Analysis — Network Observability at Scale ​

Executive Summary ​

What VPC Flow Logs Capture ​

Flow Tuple ​

Sample Log Entry ​

Sampling: Tradeoff Performance vs Cost ​

Sampling Rates ​

Calculating Cost Impact ​

Configuring VPC Flow Logs ​

Enable on Subnet ​

Export Destinations ​

Routing to BigQuery ​

Analysis: Common Queries ​

Query 1: Top Talkers ​

Query 2: Traffic by Port ​

Query 3: Latency Analysis (RTT) ​

Query 4: Blocked Connections (Firewall Denied) ​

Query 5: DDoS Detection Pattern ​

Metadata Fields ​

Available Fields ​

Cost Impact of Metadata ​

Troubleshooting with Flow Logs ​

Symptom: Slow Database Connection ​

Symptom: Connection Timeouts ​

Symptom: Unexpected Data Egress ​

Cost Management ​

Sampling Strategy ​

Cost Optimization ​

Best Practices ​

Conclusion ​