IAM Policy Propagation: Eventual Consistency & Testing

Tại sao IAM propagation khó khăn

Một trong những least understood aspects của GCP security adalah sự thực rằng IAM policies không propagate instantly. Khi bạn grant một role tới một user:

T+0: gcloud projects add-iam-policy-binding PROJECT_ID \
       --member=user:alice@company.com \
       --role=roles/editor
     Response: Updated policy

T+0.5s: Alice tries to access project
     Result: May get "permission denied" (policy not yet visible)

T+5s: Alice tries again
     Result: Likely succeeds (policy propagated)

T+60s+: All caches updated (guaranteed)

Production reality:

Users grant roles, immediately try to use—get access denied
Scripts create service accounts, immediately use—fail
IAM policies change, dependent services still enforce old permissions
Monitoring shows inconsistent audit logs

IAM Propagation Architecture

IAM policies có three-layer propagation system:

Layer 1: Control Plane (Instant)

T+0: setIamPolicy() API call
     ↓
     Policy stored ở master control plane
     ↓
     Response returned (synchronous)

Tại layer này, IAM policy immediately updated. API call completes successfully.

Layer 2: API Server Cache (5-30 seconds)

T+0 to T+30s: 
     kube-apiserver (or equivalent GCP service) updates local policy cache
     This propagates tới:
     - Load balancers
     - Regional deployments
     - Service replicas

Example: GKE control plane replicates IAM policies tới all replicas dalam cluster. Takes time untuk consensus.

Layer 3: Data Plane Services (5-60 seconds)

T+5 to T+60s:
     Compute Engine, Cloud Storage, BigQuery, etc.
     sync policy changes from master
     
     Some services have local caches:
     - Compute Engine caches ở node level
     - Cloud Storage caches ở regional level

Worst case:

User gets 403 Forbidden (permission denied) khi seharusnya dapat access
Tidak ada error log—hanya permission denied
Sangat frustrating untuk debugging

Practical Propagation Delays

Scenario 1: User Access Control

bash

# T+0: Grant Editor role
gcloud projects add-iam-policy-binding my-project \
  --member=user:alice@company.com \
  --role=roles/editor

# T+0 to T+10s: Alice tries to access Cloud Console
# Result: "You don't have permissions to access this project"

# T+15s: Alice refreshes browser
# Result: Access granted (policy propagated)

Why? Cloud Console caches IAM policies ở browser + backend. Both caches need update.

Scenario 2: Service Account Impersonation

python

# T+0: Create service account + grant roles
sa = create_service_account("app-sa")
grant_role(sa_email, "roles/compute.admin")

# T+0 to T+5s: Try tạo VM dengan service account
gcloud compute instances create test-vm \
  --service-account=$SA_EMAIL
# May fail: Service account doesn't have compute.instances.create

# T+10s: Retry
# Success: Role propagated

Scenario 3: Deny Policies

Deny policies memiliki even longer propagation time (up to 60s):

bash

# T+0: Create deny policy (explicit deny)
gcloud iam deny-policies create deny-sa-iam-binding \
  --location=organizations/ORG_ID \
  --rules='deny {permissions: ["iam.serviceAccounts.actAs"]; principals: ["principalSet://goog/public:all"]}'

# T+0 to T+60s: Deny policy propagates
# Policy enforcement may be inconsistent during this window

Caching Behavior

Client-Side Caching

Google Cloud SDKs cache policy information:

python

from google.cloud import iam_admin_v1
from functools import lru_cache

# Default: SDK caches for 5 minutes
policy = iam_admin_client.get_iam_policy(resource)

# ❌ Problem: Stale cache
time.sleep(2)  # User just granted new role
policy = iam_admin_client.get_iam_policy(resource)  # Still shows old policy

# ✅ Solution: Disable caching
client = iam_admin_v1.IAMClient()
client.api = iam_admin_v1.services.iam.transports.IAMTransport(
    # Disable caching
    cache_policy=None
)

Service-Level Caching

Different GCP services cache policies differently:

Service	Cache TTL	Propagation Time
Cloud IAM (API)	Instant	1-5s
Cloud Console UI	5 mins	10-30s
Compute Engine	10 mins per node	5-30s per region
Cloud Storage	5 mins	10-60s
BigQuery	15 mins	5-30s
GKE	Varies	5-30s

Testing IAM Propagation

Test 1: Permission Checks

python

def test_iam_propagation(resource_name, member, role):
    """Test if IAM policy change propagated"""
    import time
    
    # Grant role
    policy = get_iam_policy(resource_name)
    policy.bindings.append({
        "role": role,
        "members": [member]
    })
    set_iam_policy(resource_name, policy)
    
    # Poll until permission visible
    max_retries = 30
    for attempt in range(max_retries):
        try:
            # Attempt operation that requires the role
            result = test_permission(resource_name, member, role)
            if result:
                print(f"✓ Role propagated after {attempt} seconds")
                return True
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"✗ Role not propagated after {attempt} seconds")
                raise
        
        time.sleep(1)
    
    return False

def test_permission(resource_name, member, role):
    """Verify if member actually has role via testIamPermissions"""
    # testIamPermissions is permission-specific check
    
    # Get permissions granted by role
    role_permissions = get_permissions_for_role(role)
    
    # Check if member can perform these permissions
    can_perform = client.test_iam_permissions(
        resource=resource_name,
        permissions=role_permissions,
        identity=member  # In real API, check through service account
    )
    
    return len(can_perform) > 0

Test 2: Service Account Testing

bash

#!/bin/bash
# test-iam-propagation.sh

PROJECT_ID=$1
SA_EMAIL=$2
TIMEOUT=60

# Grant role to service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member=serviceAccount:$SA_EMAIL \
  --role=roles/compute.admin

# Poll until service account can use role
end_time=$(($(date +%s) + TIMEOUT))

while [ $(date +%s) -lt $end_time ]; do
    # Create service account credentials (local for testing)
    gcloud iam service-accounts keys create /tmp/key.json \
      --iam-account=$SA_EMAIL
    
    # Try operation that requires role
    if gcloud compute instances list \
           --project=$PROJECT_ID \
           --account=$SA_EMAIL \
           --key-file=/tmp/key.json 2>/dev/null; then
        echo "✓ IAM propagated successfully"
        rm /tmp/key.json
        exit 0
    fi
    
    sleep 2
done

echo "✗ IAM not propagated within $TIMEOUT seconds"
exit 1

Test 3: Cross-Service Propagation

python

def test_cross_service_propagation():
    """Test if Compute Engine sees IAM changes"""
    import time
    
    sa_email = create_service_account("test-sa")
    
    # Grant Compute Instance Admin role
    grant_role(sa_email, "roles/compute.instanceAdmin.v1")
    
    # Wait for propagation
    time.sleep(5)
    
    # Test: Can service account create VM?
    try:
        credentials = impersonate_service_account(sa_email)
        compute_client = compute_v1.InstancesClient(credentials=credentials)
        
        # Attempt create VM
        operation = compute_client.insert(
            project=PROJECT_ID,
            zone="us-central1-a",
            body={"name": "test-vm", "machineType": "..."}
        )
        
        print("✓ Service account can create VMs (IAM propagated)")
        return True
    except Exception as e:
        if "permission denied" in str(e):
            print(f"✗ IAM not yet propagated: {e}")
            return False
        raise

Handling Propagation in Production

Pattern 1: Retry Loop with Exponential Backoff

python

import time
from google.api_core import retry

# Decorator handles retries automatically
@retry.Retry(
    initial=1,           # Start with 1 second
    maximum=10,          # Max 10 seconds
    multiplier=2,        # Double each time
    deadline=60          # Overall timeout: 60 seconds
)
def use_service_account(sa_email):
    """Use service account (may fail initially if IAM not propagated)"""
    try:
        # Try operation
        create_resource_with_sa(sa_email)
        return True
    except google.api_core.exceptions.PermissionDenied:
        # Retry if permission denied (likely propagation issue)
        raise

# Usage:
use_service_account("app-sa@project.iam.gserviceaccount.com")

Pattern 2: Idempotent Operations

python

def create_vm_idempotent(instance_name, sa_email):
    """Create VM, handling IAM propagation gracefully"""
    import time
    
    for attempt in range(5):
        try:
            # Check if VM already exists
            try:
                instance = get_instance(instance_name)
                print(f"✓ VM already exists")
                return instance
            except NotFound:
                pass
            
            # Create VM (will fail if IAM not propagated)
            instance = create_vm(instance_name, service_account=sa_email)
            print(f"✓ Created VM after {attempt} attempts")
            return instance
            
        except PermissionDenied as e:
            if attempt < 4:
                wait_time = 2 ** attempt  # exponential backoff
                print(f"! Permission denied, retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

create_vm_idempotent("app-vm", sa_email)

Pattern 3: Pre-warming Services

python

def setup_project_with_service_account(project_id, sa_email):
    """Setup project, pre-warming services to ensure IAM propagation"""
    
    # Create service account
    sa = create_service_account(project_id, sa_email)
    
    # Grant necessary roles
    grant_role(project_id, sa_email, "roles/compute.admin")
    grant_role(project_id, sa_email, "roles/storage.admin")
    
    # Pre-warm: Make test API calls with service account
    # This forces all services to load and cache IAM policies
    
    print("Pre-warming Compute Engine...")
    try:
        compute_client = compute_v1.InstancesClient(
            credentials=impersonate_service_account(sa_email)
        )
        compute_client.list(project=project_id, zone="us-central1-a")
    except Exception:
        pass  # Expected to fail if no resources, but warms cache
    
    print("Pre-warming Cloud Storage...")
    try:
        storage_client = storage.Client(
            project=project_id,
            credentials=impersonate_service_account(sa_email)
        )
        list(storage_client.list_buckets())
    except Exception:
        pass
    
    # Wait for caches to settle
    time.sleep(5)
    
    print("✓ Project pre-warmed, ready for operations")

Monitoring IAM Propagation Issues

Detect via Audit Logs

bash

# Query audit logs for IAM changes
gcloud logging read \
  'severity=WARNING AND 
   resource.type="service_account" AND 
   protoPayload.methodName=~"SetIamPolicy"' \
  --limit=10 \
  --format=json

# Monitor for permission denied errors
gcloud logging read \
  'severity=ERROR AND 
   httpRequest.status="403"' \
  --limit=20

Implement Custom Monitoring

python

from prometheus_client import Gauge
import time

iam_propagation_delay = Gauge(
    'iam_propagation_delay_seconds',
    'Time until IAM policy fully propagated'
)

def measure_iam_propagation(resource, member, role):
    """Measure actual propagation time"""
    start_time = time.time()
    
    grant_role(resource, member, role)
    
    # Poll until visible
    while True:
        elapsed = time.time() - start_time
        
        if can_member_perform_action(resource, member, role):
            iam_propagation_delay.observe(elapsed)
            print(f"IAM propagation took {elapsed:.1f}s")
            break
        
        if elapsed > 60:
            print("⚠️  IAM propagation took > 60s (potential issue)")
            break
        
        time.sleep(1)

Common Failure Patterns

Pattern	Symptom	Fix
No retry logic	First API call fails immediately	Add exponential backoff retry
Assume instant	Race condition in tests	Add 5-10s delay or retry loop
Cross-service	Service A sees role, B doesn't	Wait longer, pre-warm services
Client caching	Old policy still visible	Clear client cache or new client
Deny policies	Take 60s to propagate	Extra delay for deny policy changes

IAM Policy Propagation: Eventual Consistency & Testing ​

Tại sao IAM propagation khó khăn ​

IAM Propagation Architecture ​

Layer 1: Control Plane (Instant) ​

Layer 2: API Server Cache (5-30 seconds) ​

Layer 3: Data Plane Services (5-60 seconds) ​

Practical Propagation Delays ​

Scenario 1: User Access Control ​

Scenario 2: Service Account Impersonation ​

Scenario 3: Deny Policies ​

Caching Behavior ​

Client-Side Caching ​

Service-Level Caching ​

Testing IAM Propagation ​

Test 1: Permission Checks ​

Test 2: Service Account Testing ​

Test 3: Cross-Service Propagation ​

Handling Propagation in Production ​

Pattern 1: Retry Loop with Exponential Backoff ​

Pattern 2: Idempotent Operations ​

Pattern 3: Pre-warming Services ​

Monitoring IAM Propagation Issues ​

Detect via Audit Logs ​

Implement Custom Monitoring ​

Common Failure Patterns ​

References ​

IAM Policy Propagation: Eventual Consistency & Testing

Tại sao IAM propagation khó khăn

IAM Propagation Architecture

Layer 1: Control Plane (Instant)

Layer 2: API Server Cache (5-30 seconds)

Layer 3: Data Plane Services (5-60 seconds)

Practical Propagation Delays

Scenario 1: User Access Control

Scenario 2: Service Account Impersonation

Scenario 3: Deny Policies

Caching Behavior

Client-Side Caching

Service-Level Caching

Testing IAM Propagation

Test 1: Permission Checks

Test 2: Service Account Testing

Test 3: Cross-Service Propagation

Handling Propagation in Production

Pattern 1: Retry Loop with Exponential Backoff

Pattern 2: Idempotent Operations

Pattern 3: Pre-warming Services

Monitoring IAM Propagation Issues

Detect via Audit Logs

Implement Custom Monitoring

Common Failure Patterns

References