IAM Policy Propagation: Eventual Consistency & Testing
Tại sao IAM propagation khó khăn
Một trong những least understood aspects của GCP security adalah sự thực rằng IAM policies không propagate instantly. Khi bạn grant một role tới một user:
T+0: gcloud projects add-iam-policy-binding PROJECT_ID \
--member=user:alice@company.com \
--role=roles/editor
Response: Updated policy
T+0.5s: Alice tries to access project
Result: May get "permission denied" (policy not yet visible)
T+5s: Alice tries again
Result: Likely succeeds (policy propagated)
T+60s+: All caches updated (guaranteed)Production reality:
- Users grant roles, immediately try to use—get access denied
- Scripts create service accounts, immediately use—fail
- IAM policies change, dependent services still enforce old permissions
- Monitoring shows inconsistent audit logs
IAM Propagation Architecture
IAM policies có three-layer propagation system:
Layer 1: Control Plane (Instant)
T+0: setIamPolicy() API call
↓
Policy stored ở master control plane
↓
Response returned (synchronous)Tại layer này, IAM policy immediately updated. API call completes successfully.
Layer 2: API Server Cache (5-30 seconds)
T+0 to T+30s:
kube-apiserver (or equivalent GCP service) updates local policy cache
This propagates tới:
- Load balancers
- Regional deployments
- Service replicasExample: GKE control plane replicates IAM policies tới all replicas dalam cluster. Takes time untuk consensus.
Layer 3: Data Plane Services (5-60 seconds)
T+5 to T+60s:
Compute Engine, Cloud Storage, BigQuery, etc.
sync policy changes from master
Some services have local caches:
- Compute Engine caches ở node level
- Cloud Storage caches ở regional levelWorst case:
- User gets 403 Forbidden (permission denied) khi seharusnya dapat access
- Tidak ada error log—hanya permission denied
- Sangat frustrating untuk debugging
Practical Propagation Delays
Scenario 1: User Access Control
bash
# T+0: Grant Editor role
gcloud projects add-iam-policy-binding my-project \
--member=user:alice@company.com \
--role=roles/editor
# T+0 to T+10s: Alice tries to access Cloud Console
# Result: "You don't have permissions to access this project"
# T+15s: Alice refreshes browser
# Result: Access granted (policy propagated)Why? Cloud Console caches IAM policies ở browser + backend. Both caches need update.
Scenario 2: Service Account Impersonation
python
# T+0: Create service account + grant roles
sa = create_service_account("app-sa")
grant_role(sa_email, "roles/compute.admin")
# T+0 to T+5s: Try tạo VM dengan service account
gcloud compute instances create test-vm \
--service-account=$SA_EMAIL
# May fail: Service account doesn't have compute.instances.create
# T+10s: Retry
# Success: Role propagatedScenario 3: Deny Policies
Deny policies memiliki even longer propagation time (up to 60s):
bash
# T+0: Create deny policy (explicit deny)
gcloud iam deny-policies create deny-sa-iam-binding \
--location=organizations/ORG_ID \
--rules='deny {permissions: ["iam.serviceAccounts.actAs"]; principals: ["principalSet://goog/public:all"]}'
# T+0 to T+60s: Deny policy propagates
# Policy enforcement may be inconsistent during this windowCaching Behavior
Client-Side Caching
Google Cloud SDKs cache policy information:
python
from google.cloud import iam_admin_v1
from functools import lru_cache
# Default: SDK caches for 5 minutes
policy = iam_admin_client.get_iam_policy(resource)
# ❌ Problem: Stale cache
time.sleep(2) # User just granted new role
policy = iam_admin_client.get_iam_policy(resource) # Still shows old policy
# ✅ Solution: Disable caching
client = iam_admin_v1.IAMClient()
client.api = iam_admin_v1.services.iam.transports.IAMTransport(
# Disable caching
cache_policy=None
)Service-Level Caching
Different GCP services cache policies differently:
| Service | Cache TTL | Propagation Time |
|---|---|---|
| Cloud IAM (API) | Instant | 1-5s |
| Cloud Console UI | 5 mins | 10-30s |
| Compute Engine | 10 mins per node | 5-30s per region |
| Cloud Storage | 5 mins | 10-60s |
| BigQuery | 15 mins | 5-30s |
| GKE | Varies | 5-30s |
Testing IAM Propagation
Test 1: Permission Checks
python
def test_iam_propagation(resource_name, member, role):
"""Test if IAM policy change propagated"""
import time
# Grant role
policy = get_iam_policy(resource_name)
policy.bindings.append({
"role": role,
"members": [member]
})
set_iam_policy(resource_name, policy)
# Poll until permission visible
max_retries = 30
for attempt in range(max_retries):
try:
# Attempt operation that requires the role
result = test_permission(resource_name, member, role)
if result:
print(f"✓ Role propagated after {attempt} seconds")
return True
except Exception as e:
if attempt == max_retries - 1:
print(f"✗ Role not propagated after {attempt} seconds")
raise
time.sleep(1)
return False
def test_permission(resource_name, member, role):
"""Verify if member actually has role via testIamPermissions"""
# testIamPermissions is permission-specific check
# Get permissions granted by role
role_permissions = get_permissions_for_role(role)
# Check if member can perform these permissions
can_perform = client.test_iam_permissions(
resource=resource_name,
permissions=role_permissions,
identity=member # In real API, check through service account
)
return len(can_perform) > 0Test 2: Service Account Testing
bash
#!/bin/bash
# test-iam-propagation.sh
PROJECT_ID=$1
SA_EMAIL=$2
TIMEOUT=60
# Grant role to service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:$SA_EMAIL \
--role=roles/compute.admin
# Poll until service account can use role
end_time=$(($(date +%s) + TIMEOUT))
while [ $(date +%s) -lt $end_time ]; do
# Create service account credentials (local for testing)
gcloud iam service-accounts keys create /tmp/key.json \
--iam-account=$SA_EMAIL
# Try operation that requires role
if gcloud compute instances list \
--project=$PROJECT_ID \
--account=$SA_EMAIL \
--key-file=/tmp/key.json 2>/dev/null; then
echo "✓ IAM propagated successfully"
rm /tmp/key.json
exit 0
fi
sleep 2
done
echo "✗ IAM not propagated within $TIMEOUT seconds"
exit 1Test 3: Cross-Service Propagation
python
def test_cross_service_propagation():
"""Test if Compute Engine sees IAM changes"""
import time
sa_email = create_service_account("test-sa")
# Grant Compute Instance Admin role
grant_role(sa_email, "roles/compute.instanceAdmin.v1")
# Wait for propagation
time.sleep(5)
# Test: Can service account create VM?
try:
credentials = impersonate_service_account(sa_email)
compute_client = compute_v1.InstancesClient(credentials=credentials)
# Attempt create VM
operation = compute_client.insert(
project=PROJECT_ID,
zone="us-central1-a",
body={"name": "test-vm", "machineType": "..."}
)
print("✓ Service account can create VMs (IAM propagated)")
return True
except Exception as e:
if "permission denied" in str(e):
print(f"✗ IAM not yet propagated: {e}")
return False
raiseHandling Propagation in Production
Pattern 1: Retry Loop with Exponential Backoff
python
import time
from google.api_core import retry
# Decorator handles retries automatically
@retry.Retry(
initial=1, # Start with 1 second
maximum=10, # Max 10 seconds
multiplier=2, # Double each time
deadline=60 # Overall timeout: 60 seconds
)
def use_service_account(sa_email):
"""Use service account (may fail initially if IAM not propagated)"""
try:
# Try operation
create_resource_with_sa(sa_email)
return True
except google.api_core.exceptions.PermissionDenied:
# Retry if permission denied (likely propagation issue)
raise
# Usage:
use_service_account("app-sa@project.iam.gserviceaccount.com")Pattern 2: Idempotent Operations
python
def create_vm_idempotent(instance_name, sa_email):
"""Create VM, handling IAM propagation gracefully"""
import time
for attempt in range(5):
try:
# Check if VM already exists
try:
instance = get_instance(instance_name)
print(f"✓ VM already exists")
return instance
except NotFound:
pass
# Create VM (will fail if IAM not propagated)
instance = create_vm(instance_name, service_account=sa_email)
print(f"✓ Created VM after {attempt} attempts")
return instance
except PermissionDenied as e:
if attempt < 4:
wait_time = 2 ** attempt # exponential backoff
print(f"! Permission denied, retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise
create_vm_idempotent("app-vm", sa_email)Pattern 3: Pre-warming Services
python
def setup_project_with_service_account(project_id, sa_email):
"""Setup project, pre-warming services to ensure IAM propagation"""
# Create service account
sa = create_service_account(project_id, sa_email)
# Grant necessary roles
grant_role(project_id, sa_email, "roles/compute.admin")
grant_role(project_id, sa_email, "roles/storage.admin")
# Pre-warm: Make test API calls with service account
# This forces all services to load and cache IAM policies
print("Pre-warming Compute Engine...")
try:
compute_client = compute_v1.InstancesClient(
credentials=impersonate_service_account(sa_email)
)
compute_client.list(project=project_id, zone="us-central1-a")
except Exception:
pass # Expected to fail if no resources, but warms cache
print("Pre-warming Cloud Storage...")
try:
storage_client = storage.Client(
project=project_id,
credentials=impersonate_service_account(sa_email)
)
list(storage_client.list_buckets())
except Exception:
pass
# Wait for caches to settle
time.sleep(5)
print("✓ Project pre-warmed, ready for operations")Monitoring IAM Propagation Issues
Detect via Audit Logs
bash
# Query audit logs for IAM changes
gcloud logging read \
'severity=WARNING AND
resource.type="service_account" AND
protoPayload.methodName=~"SetIamPolicy"' \
--limit=10 \
--format=json
# Monitor for permission denied errors
gcloud logging read \
'severity=ERROR AND
httpRequest.status="403"' \
--limit=20Implement Custom Monitoring
python
from prometheus_client import Gauge
import time
iam_propagation_delay = Gauge(
'iam_propagation_delay_seconds',
'Time until IAM policy fully propagated'
)
def measure_iam_propagation(resource, member, role):
"""Measure actual propagation time"""
start_time = time.time()
grant_role(resource, member, role)
# Poll until visible
while True:
elapsed = time.time() - start_time
if can_member_perform_action(resource, member, role):
iam_propagation_delay.observe(elapsed)
print(f"IAM propagation took {elapsed:.1f}s")
break
if elapsed > 60:
print("⚠️ IAM propagation took > 60s (potential issue)")
break
time.sleep(1)Common Failure Patterns
| Pattern | Symptom | Fix |
|---|---|---|
| No retry logic | First API call fails immediately | Add exponential backoff retry |
| Assume instant | Race condition in tests | Add 5-10s delay or retry loop |
| Cross-service | Service A sees role, B doesn't | Wait longer, pre-warm services |
| Client caching | Old policy still visible | Clear client cache or new client |
| Deny policies | Take 60s to propagate | Extra delay for deny policy changes |