Skip to content

IAM Policy Propagation: Eventual Consistency & Testing

Tại sao IAM propagation khó khăn

Một trong những least understood aspects của GCP security adalah sự thực rằng IAM policies không propagate instantly. Khi bạn grant một role tới một user:

T+0: gcloud projects add-iam-policy-binding PROJECT_ID \
       --member=user:alice@company.com \
       --role=roles/editor
     Response: Updated policy

T+0.5s: Alice tries to access project
     Result: May get "permission denied" (policy not yet visible)

T+5s: Alice tries again
     Result: Likely succeeds (policy propagated)

T+60s+: All caches updated (guaranteed)

Production reality:

  • Users grant roles, immediately try to use—get access denied
  • Scripts create service accounts, immediately use—fail
  • IAM policies change, dependent services still enforce old permissions
  • Monitoring shows inconsistent audit logs

IAM Propagation Architecture

IAM policies có three-layer propagation system:

Layer 1: Control Plane (Instant)

T+0: setIamPolicy() API call

     Policy stored ở master control plane

     Response returned (synchronous)

Tại layer này, IAM policy immediately updated. API call completes successfully.

Layer 2: API Server Cache (5-30 seconds)

T+0 to T+30s: 
     kube-apiserver (or equivalent GCP service) updates local policy cache
     This propagates tới:
     - Load balancers
     - Regional deployments
     - Service replicas

Example: GKE control plane replicates IAM policies tới all replicas dalam cluster. Takes time untuk consensus.

Layer 3: Data Plane Services (5-60 seconds)

T+5 to T+60s:
     Compute Engine, Cloud Storage, BigQuery, etc.
     sync policy changes from master
     
     Some services have local caches:
     - Compute Engine caches ở node level
     - Cloud Storage caches ở regional level

Worst case:

  • User gets 403 Forbidden (permission denied) khi seharusnya dapat access
  • Tidak ada error log—hanya permission denied
  • Sangat frustrating untuk debugging

Practical Propagation Delays

Scenario 1: User Access Control

bash
# T+0: Grant Editor role
gcloud projects add-iam-policy-binding my-project \
  --member=user:alice@company.com \
  --role=roles/editor

# T+0 to T+10s: Alice tries to access Cloud Console
# Result: "You don't have permissions to access this project"

# T+15s: Alice refreshes browser
# Result: Access granted (policy propagated)

Why? Cloud Console caches IAM policies ở browser + backend. Both caches need update.

Scenario 2: Service Account Impersonation

python
# T+0: Create service account + grant roles
sa = create_service_account("app-sa")
grant_role(sa_email, "roles/compute.admin")

# T+0 to T+5s: Try tạo VM dengan service account
gcloud compute instances create test-vm \
  --service-account=$SA_EMAIL
# May fail: Service account doesn't have compute.instances.create

# T+10s: Retry
# Success: Role propagated

Scenario 3: Deny Policies

Deny policies memiliki even longer propagation time (up to 60s):

bash
# T+0: Create deny policy (explicit deny)
gcloud iam deny-policies create deny-sa-iam-binding \
  --location=organizations/ORG_ID \
  --rules='deny {permissions: ["iam.serviceAccounts.actAs"]; principals: ["principalSet://goog/public:all"]}'

# T+0 to T+60s: Deny policy propagates
# Policy enforcement may be inconsistent during this window

Caching Behavior

Client-Side Caching

Google Cloud SDKs cache policy information:

python
from google.cloud import iam_admin_v1
from functools import lru_cache

# Default: SDK caches for 5 minutes
policy = iam_admin_client.get_iam_policy(resource)

# ❌ Problem: Stale cache
time.sleep(2)  # User just granted new role
policy = iam_admin_client.get_iam_policy(resource)  # Still shows old policy

# ✅ Solution: Disable caching
client = iam_admin_v1.IAMClient()
client.api = iam_admin_v1.services.iam.transports.IAMTransport(
    # Disable caching
    cache_policy=None
)

Service-Level Caching

Different GCP services cache policies differently:

ServiceCache TTLPropagation Time
Cloud IAM (API)Instant1-5s
Cloud Console UI5 mins10-30s
Compute Engine10 mins per node5-30s per region
Cloud Storage5 mins10-60s
BigQuery15 mins5-30s
GKEVaries5-30s

Testing IAM Propagation

Test 1: Permission Checks

python
def test_iam_propagation(resource_name, member, role):
    """Test if IAM policy change propagated"""
    import time
    
    # Grant role
    policy = get_iam_policy(resource_name)
    policy.bindings.append({
        "role": role,
        "members": [member]
    })
    set_iam_policy(resource_name, policy)
    
    # Poll until permission visible
    max_retries = 30
    for attempt in range(max_retries):
        try:
            # Attempt operation that requires the role
            result = test_permission(resource_name, member, role)
            if result:
                print(f"✓ Role propagated after {attempt} seconds")
                return True
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"✗ Role not propagated after {attempt} seconds")
                raise
        
        time.sleep(1)
    
    return False

def test_permission(resource_name, member, role):
    """Verify if member actually has role via testIamPermissions"""
    # testIamPermissions is permission-specific check
    
    # Get permissions granted by role
    role_permissions = get_permissions_for_role(role)
    
    # Check if member can perform these permissions
    can_perform = client.test_iam_permissions(
        resource=resource_name,
        permissions=role_permissions,
        identity=member  # In real API, check through service account
    )
    
    return len(can_perform) > 0

Test 2: Service Account Testing

bash
#!/bin/bash
# test-iam-propagation.sh

PROJECT_ID=$1
SA_EMAIL=$2
TIMEOUT=60

# Grant role to service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member=serviceAccount:$SA_EMAIL \
  --role=roles/compute.admin

# Poll until service account can use role
end_time=$(($(date +%s) + TIMEOUT))

while [ $(date +%s) -lt $end_time ]; do
    # Create service account credentials (local for testing)
    gcloud iam service-accounts keys create /tmp/key.json \
      --iam-account=$SA_EMAIL
    
    # Try operation that requires role
    if gcloud compute instances list \
           --project=$PROJECT_ID \
           --account=$SA_EMAIL \
           --key-file=/tmp/key.json 2>/dev/null; then
        echo "✓ IAM propagated successfully"
        rm /tmp/key.json
        exit 0
    fi
    
    sleep 2
done

echo "✗ IAM not propagated within $TIMEOUT seconds"
exit 1

Test 3: Cross-Service Propagation

python
def test_cross_service_propagation():
    """Test if Compute Engine sees IAM changes"""
    import time
    
    sa_email = create_service_account("test-sa")
    
    # Grant Compute Instance Admin role
    grant_role(sa_email, "roles/compute.instanceAdmin.v1")
    
    # Wait for propagation
    time.sleep(5)
    
    # Test: Can service account create VM?
    try:
        credentials = impersonate_service_account(sa_email)
        compute_client = compute_v1.InstancesClient(credentials=credentials)
        
        # Attempt create VM
        operation = compute_client.insert(
            project=PROJECT_ID,
            zone="us-central1-a",
            body={"name": "test-vm", "machineType": "..."}
        )
        
        print("✓ Service account can create VMs (IAM propagated)")
        return True
    except Exception as e:
        if "permission denied" in str(e):
            print(f"✗ IAM not yet propagated: {e}")
            return False
        raise

Handling Propagation in Production

Pattern 1: Retry Loop with Exponential Backoff

python
import time
from google.api_core import retry

# Decorator handles retries automatically
@retry.Retry(
    initial=1,           # Start with 1 second
    maximum=10,          # Max 10 seconds
    multiplier=2,        # Double each time
    deadline=60          # Overall timeout: 60 seconds
)
def use_service_account(sa_email):
    """Use service account (may fail initially if IAM not propagated)"""
    try:
        # Try operation
        create_resource_with_sa(sa_email)
        return True
    except google.api_core.exceptions.PermissionDenied:
        # Retry if permission denied (likely propagation issue)
        raise

# Usage:
use_service_account("app-sa@project.iam.gserviceaccount.com")

Pattern 2: Idempotent Operations

python
def create_vm_idempotent(instance_name, sa_email):
    """Create VM, handling IAM propagation gracefully"""
    import time
    
    for attempt in range(5):
        try:
            # Check if VM already exists
            try:
                instance = get_instance(instance_name)
                print(f"✓ VM already exists")
                return instance
            except NotFound:
                pass
            
            # Create VM (will fail if IAM not propagated)
            instance = create_vm(instance_name, service_account=sa_email)
            print(f"✓ Created VM after {attempt} attempts")
            return instance
            
        except PermissionDenied as e:
            if attempt < 4:
                wait_time = 2 ** attempt  # exponential backoff
                print(f"! Permission denied, retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

create_vm_idempotent("app-vm", sa_email)

Pattern 3: Pre-warming Services

python
def setup_project_with_service_account(project_id, sa_email):
    """Setup project, pre-warming services to ensure IAM propagation"""
    
    # Create service account
    sa = create_service_account(project_id, sa_email)
    
    # Grant necessary roles
    grant_role(project_id, sa_email, "roles/compute.admin")
    grant_role(project_id, sa_email, "roles/storage.admin")
    
    # Pre-warm: Make test API calls with service account
    # This forces all services to load and cache IAM policies
    
    print("Pre-warming Compute Engine...")
    try:
        compute_client = compute_v1.InstancesClient(
            credentials=impersonate_service_account(sa_email)
        )
        compute_client.list(project=project_id, zone="us-central1-a")
    except Exception:
        pass  # Expected to fail if no resources, but warms cache
    
    print("Pre-warming Cloud Storage...")
    try:
        storage_client = storage.Client(
            project=project_id,
            credentials=impersonate_service_account(sa_email)
        )
        list(storage_client.list_buckets())
    except Exception:
        pass
    
    # Wait for caches to settle
    time.sleep(5)
    
    print("✓ Project pre-warmed, ready for operations")

Monitoring IAM Propagation Issues

Detect via Audit Logs

bash
# Query audit logs for IAM changes
gcloud logging read \
  'severity=WARNING AND 
   resource.type="service_account" AND 
   protoPayload.methodName=~"SetIamPolicy"' \
  --limit=10 \
  --format=json

# Monitor for permission denied errors
gcloud logging read \
  'severity=ERROR AND 
   httpRequest.status="403"' \
  --limit=20

Implement Custom Monitoring

python
from prometheus_client import Gauge
import time

iam_propagation_delay = Gauge(
    'iam_propagation_delay_seconds',
    'Time until IAM policy fully propagated'
)

def measure_iam_propagation(resource, member, role):
    """Measure actual propagation time"""
    start_time = time.time()
    
    grant_role(resource, member, role)
    
    # Poll until visible
    while True:
        elapsed = time.time() - start_time
        
        if can_member_perform_action(resource, member, role):
            iam_propagation_delay.observe(elapsed)
            print(f"IAM propagation took {elapsed:.1f}s")
            break
        
        if elapsed > 60:
            print("⚠️  IAM propagation took > 60s (potential issue)")
            break
        
        time.sleep(1)

Common Failure Patterns

PatternSymptomFix
No retry logicFirst API call fails immediatelyAdd exponential backoff retry
Assume instantRace condition in testsAdd 5-10s delay or retry loop
Cross-serviceService A sees role, B doesn'tWait longer, pre-warm services
Client cachingOld policy still visibleClear client cache or new client
Deny policiesTake 60s to propagateExtra delay for deny policy changes

References