VPC Flow Logs Analysis — Network Observability at Scale
Executive Summary
VPC Flow Logs = GCP captures packets flowing through network (sample rate configurable).
Features:
- ✅ Granular network telemetry (every 5 tuples)
- ✅ Configurable sampling (1% → 100%)
- ✅ Export to Cloud Logging, BigQuery, Cloud Storage
- ✅ Troubleshooting: latency, packet loss, DoS patterns
- ❌ Cost: ~$0.50/GB (can be high at scale)
- ❌ Privacy: Captures packet-level metadata
What VPC Flow Logs Capture
Flow Tuple
5-tuple identifies flow:
- Source IP: 10.0.1.5
- Source port: 54321
- Destination IP: 10.0.2.10
- Destination port: 3306
- Protocol: TCP
One flow record per tuple per interval (default 5s)Sample Log Entry
json
{
"insertId": "abcdef1234567890",
"resource": {
"type": "gce_subnetwork",
"labels": {
"subnet_name": "prod-app",
"network_name": "prod-vpc",
"region": "us-central1"
}
},
"timestamp": "2026-05-19T12:34:56Z",
"jsonPayload": {
"bytes_received": 1024,
"bytes_sent": 2048,
"dest_ip": "10.0.2.10",
"dest_port": 3306,
"protocol": 6,
"src_ip": "10.0.1.5",
"src_port": 54321,
"start_time": 1234567890,
"end_time": 1234567895,
"tcp_flags": "SYN,ACK",
"rtt_msec": 2
}
}Sampling: Tradeoff Performance vs Cost
Sampling Rates
1% sampling (default):
- 1 in 100 flows captured
- Cost: Minimal
- Data: Statistically representative
- Accuracy: ~1% error margin
10% sampling:
- 1 in 10 flows captured
- Cost: 10× higher
- Data: More detailed
- Accuracy: ~10% error margin
50% sampling:
- 1 in 2 flows captured
- Cost: 50× higher
- Data: Very detailed
- Accuracy: ~50% error margin
100% sampling (all flows):
- Every flow captured
- Cost: Maximum
- Data: Complete (massive)
- For: High-value troubleshooting onlyCalculating Cost Impact
Typical workload:
Network throughput: 10 Gbps average
= 10 × 10^9 bits/sec
= 1.25 × 10^9 bytes/sec
= 1.25 GB/sec
= 108 TB/day
VPC Flow Logs (1% sampling):
Size: 108 TB × 1% = 1.08 TB/day
Cost: 1.08 TB × $0.50/GB = $540/day
= $16,200/month
VPC Flow Logs (100% sampling):
Size: 108 TB/day
Cost: 108 TB × $0.50/GB = $54,000/day
= $1,620,000/month (!!)Configuring VPC Flow Logs
Enable on Subnet
bash
gcloud compute networks subnets update prod-app \
--region=us-central1 \
--enable-flow-logs \
--logging-aggregation-interval=5s \
--logging-flow-sample-rate=0.5 # 50% sampling
# Export destination:
gcloud compute networks subnets update prod-app \
--region=us-central1 \
--logging-metadata=include-all # Include all fieldsExport Destinations
Option 1: Cloud Logging (default)
Retention: 30 days (configurable)
Query: Use gcloud logging read or Cloud Logging UI
Cost: Flow Log cost + Logging storage
Option 2: BigQuery
Retention: User-managed (can be years)
Query: SQL (powerful for analysis)
Cost: Flow Log cost + BigQuery storage + query cost
Option 3: Cloud Storage
Retention: User-managed
Query: Download and analyze locally
Cost: Flow Log cost + GCS storage
Best practice: Cloud Logging for short-term troubleshooting
BigQuery for long-term analysis/complianceRouting to BigQuery
bash
gcloud compute networks subnets update prod-app \
--region=us-central1 \
--enable-flow-logs
# Then configure in VPC Network → Firewall → VPC Flow Logs
# Set destination to BigQuery dataset
Alternative (terraform):
resource "google_compute_subnetwork" "prod_app" {
name = "prod-app"
ip_cidr_range = "10.0.1.0/24"
log_config {
aggregation_interval = "INTERVAL_5_SEC"
flow_sample_rate = 0.5
metadata = "INCLUDE_ALL"
metadata_fields = ["all_fields"]
}
}Analysis: Common Queries
Query 1: Top Talkers
sql
SELECT
src_ip,
COUNT(*) as flow_count,
SUM(CAST(bytes_sent as INT64)) as total_bytes_sent
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN "2026-05-19T00:00:00Z"
AND "2026-05-19T23:59:59Z"
GROUP BY src_ip
ORDER BY total_bytes_sent DESC
LIMIT 10;
Result: Which VMs sending most traffic?
(Helps identify data exfiltration or misconfigured processes)Query 2: Traffic by Port
sql
SELECT
dest_port,
COUNT(*) as flow_count,
COUNTIF(tcp_flags LIKE "%SYN%") as syn_count
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND CURRENT_TIMESTAMP()
GROUP BY dest_port
ORDER BY flow_count DESC
LIMIT 20;
Result: What ports most active?
Identify scanning (many SYN without data)Query 3: Latency Analysis (RTT)
sql
SELECT
src_ip,
dest_ip,
dest_port,
APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(50)] as p50_rtt,
APPROX_QUANTILES(CAST(rtt_msec as INT64), 100)[OFFSET(99)] as p99_rtt,
MAX(CAST(rtt_msec as INT64)) as max_rtt
FROM `project.dataset.vpc_flow_logs`
WHERE rtt_msec IS NOT NULL
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port
HAVING p99_rtt > 10 # Latency > 10ms
ORDER BY p99_rtt DESC;
Result: Connections with high latency
(Identify performance bottlenecks)Query 4: Blocked Connections (Firewall Denied)
sql
SELECT
src_ip,
dest_ip,
dest_port,
protocol,
COUNT(*) as attempt_count
FROM `project.dataset.vpc_flow_logs`
WHERE action = "DROPPED" OR tcp_flags LIKE "%SYN%" # SYN without response
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY src_ip, dest_ip, dest_port, protocol
ORDER BY attempt_count DESC;
Result: VMs trying to connect to unreachable destinations
(Check firewall rules, routing)Query 5: DDoS Detection Pattern
sql
SELECT
dest_ip,
dest_port,
COUNT(DISTINCT src_ip) as unique_sources,
COUNT(*) as total_flows
FROM `project.dataset.vpc_flow_logs`
WHERE bytes_received = 0 # No response data (one-way flood)
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 MIN)
GROUP BY dest_ip, dest_port
HAVING unique_sources > 100 OR total_flows > 10000
ORDER BY total_flows DESC;
Result: Possible DDoS targets
(Detect volumetric attacks early)Metadata Fields
Available Fields
Network fields:
- src_ip, dest_ip
- src_port, dest_port
- protocol (TCP=6, UDP=17, ICMP=1)
Data fields:
- bytes_sent, bytes_received
- start_time, end_time
- rtt_msec (round-trip time)
TCP-specific:
- tcp_flags (SYN, ACK, FIN, RST)
- tcp_rtt_microseconds
Context fields:
- connection_state (NEW, ESTABLISHED, etc.)
- src_instance_id, dest_instance_id (VM IDs)
- src_vpc, dest_vpc (VPC names)
- src_region, dest_region
Include with: --logging-metadata=include-allCost Impact of Metadata
Minimal metadata (just IPs/ports):
~100 bytes per flow entry
Full metadata (all fields):
~500 bytes per flow entry
5× cost difference!
Decision: Include metadata only for troubleshooting
Use minimal for cost-sensitive analysisTroubleshooting with Flow Logs
Symptom: Slow Database Connection
VM app-1 (10.0.1.5) reports slow queries to DB (10.0.2.10:3306)
Query flow logs:
SELECT
CAST(rtt_msec as INT64) as latency,
bytes_sent,
bytes_received,
timestamp
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
AND dest_ip = "10.0.2.10"
AND dest_port = 3306
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
ORDER BY timestamp DESC
LIMIT 20;
Results:
Latency 1-2ms: Normal (VPC-local)
Latency 50-100ms: Abnormal (cross-region or congestion?)
Latency 200+ms: Problem (check routing, network congestion)
Action:
If latency jumped, check:
1. VM CPU/memory (slow instance)
2. Network congestion (check top talkers)
3. DNS resolution delay
4. Multi-region routing (wrong route?)Symptom: Connection Timeouts
VM app-1 cannot connect to external service (203.0.113.5:443)
Query:
SELECT
action,
tcp_flags,
COUNT(*) as count
FROM `project.dataset.vpc_flow_logs`
WHERE src_ip = "10.0.1.5"
AND dest_ip = "203.0.113.5"
AND dest_port = 443
AND timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
GROUP BY action, tcp_flags;
Results:
Action=DROPPED, Count=1000+: Firewall blocking
Action=ACCEPTED, tcp_flags=SYN, Count=1000: No response (host down or filtering)
Action=ACCEPTED, tcp_flags=SYN+ACK, Count=1: Success
Resolution:
If DROPPED: Check firewall rules
gcloud compute firewall-rules list --filter="..."
If SYN no response: Check if destination alive
traceroute 203.0.113.5Symptom: Unexpected Data Egress
Cloud Billing shows $5000/month egress charge (unexpected)
Query:
SELECT
dest_ip,
REGEXP_SUBSTR(dest_ip, r'(^[0-9]+)') as dest_octet,
SUM(CAST(bytes_sent as INT64)) as total_bytes
FROM `project.dataset.vpc_flow_logs`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND src_ip LIKE "10.%" # VPC-local
AND dest_ip NOT LIKE "10.%" # External
GROUP BY dest_ip
ORDER BY total_bytes DESC
LIMIT 20;
Results:
Identify which external IPs receiving traffic
Then investigate:
Is this expected? (backups, replication?)
Compromised VM? (data exfiltration?)
Misconfigured process? (logging to external API?)
Action:
Block with firewall rule or disable PGA for those VMsCost Management
Sampling Strategy
Development/Testing:
Sample rate: 1% (or less)
Retention: 7 days (Cloud Logging)
Cost: $0-10/month
Production:
Sample rate: 5-10% (for troubleshooting)
Retention: 30 days (Cloud Logging) + 90 days (BigQuery)
Cost: $500-2000/month
High-security environment:
Sample rate: 50% (detect attacks)
Retention: 1 year (BigQuery, compliance)
Cost: $10,000-50,000/monthCost Optimization
Tip 1: Disable VPC Flow Logs on non-critical subnets
(dev, test networks)
Tip 2: Use 1% sampling for baseline, 100% only for incident response
(temporary enablement for investigation)
Tip 3: Export to BigQuery with partition/clustering
(faster queries, lower query cost)
Tip 4: Use TTL on BigQuery tables (auto-delete old logs)
(prevent unbounded growth)
Tip 5: Run analysis during off-peak hours
(BigQuery slots for cost predictability)Best Practices
✅ Do:
- Enable VPC Flow Logs on production subnets
- Use BigQuery for long-term analysis
- Set up alerts on suspicious patterns (DDoS, exfiltration)
- Document flow log queries for team
- Include metadata only when needed
❌ Don't:
- Use 100% sampling by default (too expensive)
- Ignore flow logs (missed troubleshooting opportunities)
- Store in Cloud Logging forever (costs accumulate)
- Query flow logs without time bounds (slow queries)
- Enable on every subnet unless necessary (cost adds up)
Conclusion
VPC Flow Logs provide powerful network visibility:
- Granular: 5-tuple flows, per-packet latency
- Flexible: Multiple export destinations
- Scalable: Configurable sampling
- Actionable: SQL queries for analysis
Essential for: Production troubleshooting, security analysis, capacity planning.
Cost-effective when: Using appropriate sampling and retention policies.