Skip to content

Response Policy Zones (RPZ): Internal Overrides & Security

Tại sao điều này quan trọng

Response Policy Zones (RPZ) = DNS-level access control & security filtering. Cho phép intercept và modify DNS responses tại infrastructure level, trước khi app sees them.

Real scenarios:

Scenario 1: Malware Blocking
├── Query: malware.badsite.com
├── RPZ rule: malware.badsite.com → NXDOMAIN
└── Result: Query blocked at DNS level (faster than firewall)

Scenario 2: Internal Service Redirect
├── Query: api.internal-test.example.com (from staging)
├── RPZ rule: api.internal-test.example.com → 10.0.2.100 (staging IP)
└── Result: Redirect to internal IP, bypass public endpoint

Scenario 3: Compliance Override
├── Query: external-service.com (not allowed in prod)
├── RPZ rule: external-service.com → NXDOMAIN
└── Result: Compliance enforced at DNS level

RPZ Fundamentals

What is RPZ?

RPZ = policy layer above DNS records, allows:

  • Block queries (return NXDOMAIN)
  • Redirect queries (return internal IP)
  • Pass through (allow normally)
  • Log queries (audit trail)
Normal Zone Flow:
  Query: api.example.com

  Check records

  Return A 35.201.100.50

RPZ Flow:
  Query: api.example.com

  Check RPZ rules first

  Rule matches: Return 10.0.1.100 (internal IP)

  Return 10.0.1.100

GCP Cloud DNS RPZ Support

Important: GCP Cloud DNS does NOT have native RPZ support yet (as of writing).

Alternatives in GCP:

  1. Custom Corefile in GKE (via CoreDNS plugins)
  2. Firewall rules (layer 4, not DNS)
  3. Cloud Armor (layer 7, not DNS)
  4. Unbound / custom DNS proxy (DIY approach)

For on-premises BIND DNS with RPZ:

bind
zone "rpz.internal" {
    type master;
    file "/etc/bind/rpz/internal-policy.zone";
};

zone "." {
    type hint;
    file "/etc/bind/db.root";
};

# Apply RPZ to queries
response-policy {
    zone "rpz.internal";
};

GKE Implementation via CoreDNS

CoreDNS Rewrite Plugin (Closest to RPZ)

corefile
.:53 {
    cache 30
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
      ttl 30
    }
    
    # Rewrite/redirect (similar to RPZ behavior)
    rewrite name regex (.*)\.internal-test\.example\.com {1}.internal.prod.example.com
    
    # Can also block with custom response
    template ANY NXDOMAIN malware.badsite.com {
      rcode NXDOMAIN
    }
    
    forward . /etc/resolv.conf {
      max_concurrent 1000
    }
}

Custom DNS Proxy (Full RPZ)

Deploy custom DNS proxy with RPZ support:

bash
# Option 1: Unbound with RPZ
docker run -d \
  --name=dns-proxy \
  --net=host \
  --volume=/etc/unbound:/etc/unbound:ro \
  nlnetlabs/unbound:latest

# Option 2: Bind with RPZ
docker run -d \
  --name=dns-proxy \
  --net=host \
  --volume=/etc/bind:/etc/bind:ro \
  internetsystemsconsortium/bind9:latest

Production Patterns

Pattern 1: Malware Block List

Maintain list of known malware domains:
  - ransomware.badguys.com
  - c2.attacker.io
  - phishing.site.com

RPZ Rule: Block all
  Block any query to these domains → NXDOMAIN
  
Result:
  ✓ Prevents infection propagation
  ✓ Blocks C&C communication
  ✓ Audit trail of attempts

Pattern 2: Internal Service Redirect

Scenario: Staging env isolates from external APIs

Public: api.example.com → 35.201.100.50 (production)
Internal (staging): api.example.com → 10.0.2.100 (staging mock)

RPZ Rule:
  From staging VPC: api.example.com → 10.0.2.100
  From prod VPC: api.example.com → 35.201.100.50 (normal)
  
Result:
  ✓ Staging doesn't call prod APIs (safety)
  ✓ No code changes needed
  ✓ Enforced at DNS level

Pattern 3: Compliance Override

Requirement: Certain domains not accessible from prod

RPZ Rule:
  social-media.com → NXDOMAIN (from prod)
  news-site.com → NXDOMAIN (from prod)
  
Result:
  ✓ Blocks non-business services
  ✓ Audit trail for compliance
  ✓ Cannot be bypassed (DNS-level)

Implementation Strategy

Strategy 1: DNS Firewall (Using Cloud Armor)

bash
# Create firewall policy
gcloud compute security-policies create dns-policy \
  --description="DNS filtering policy"

# Allow known good domains
gcloud compute security-policies rules create 1000 \
  --security-policy=dns-policy \
  --action=allow \
  --expression='origin.region_code == "US"'

# Block malware domains
gcloud compute security-policies rules create 1001 \
  --security-policy=dns-policy \
  --action=deny \
  --expression='evaluatePreconfiguredExpr("xss-stable")'

Strategy 2: GKE CoreDNS Rewrite

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  internal.override: |
    # Block external APIs from prod
    template ANY NXDOMAIN payment-processor.com {
      rcode NXDOMAIN
    }
    template ANY NXDOMAIN third-party-service.io {
      rcode NXDOMAIN
    }
    
    # Redirect staging queries
    rewrite name regex (.*)\.staging-api\.example\.com {1}.internal-staging.example.com
    
    # Custom responses
    template ANY A compliance-check.internal.company.com {
      response "<HEADER> <STATUS> <CTL> <QUESTION> <ANSWER>"
      answer "compliance-check.internal.company.com 60 IN A 10.0.1.100"
    }

Strategy 3: Custom Proxy (Most Control)

dockerfile
# Dockerfile for custom DNS proxy
FROM ubuntu:22.04

RUN apt-get update && \
    apt-get install -y unbound && \
    rm -rf /var/lib/apt/lists/*

COPY unbound.conf /etc/unbound/
COPY rpz.zone /etc/unbound/

CMD ["unbound", "-d"]

Monitoring & Auditing

Enable Query Logging

bash
# Enable DNS query logging
gcloud dns policies create rpz-log \
  --description="Log RPZ-matched queries" \
  --enable-logging \
  --log-network=projects/PROJECT/global/networks/default

# View blocked queries
gcloud logging read "resource.type=dns_query AND jsonPayload.block=true" \
  --limit=100 \
  --format=json | jq '.[] | {domain: .jsonPayload.qname, action: .jsonPayload.action}'

Alert on Policy Violations

bash
# Create alert for high block rate
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High DNS block rate" \
  --condition-threshold-value=100 \
  --condition-threshold-filter='resource.type="dns_policy"'

Troubleshooting

Issue 1: RPZ Rule Not Blocking

bash
# Debug:
1. Verify RPZ rule syntax
   # Ensure BIND config is valid
   named-checkconf /etc/bind/named.conf

2. Check if RPZ applied to right resolver
   # Verify zone is in response-policy section
   grep -A 5 "response-policy" /etc/bind/named.conf

3. Test query
   dig @192.168.1.10 malware.badsite.com +trace
   # Check if NXDOMAIN is returned

4. Check logs
   tail -f /var/log/syslog | grep rpz

Issue 2: False Positives (Blocking Legitimate Domain)

bash
# Symptom: Service broken after RPZ rule added

Debug:
  1. Identify which queries blocked
     gcloud logging read 'jsonPayload.block=true' --limit=100
  
  2. Verify rule is correct
     grep "legitimate-domain.com" /etc/bind/rpz.zone
  
  3. Temporary allow for debugging
     Remove rule, test, re-add if needed
  
  4. Implement whitelist
     Add: legitimate-domain.com A legitimate.ip
     BEFORE blacklist rule (order matters)

Security Considerations

Data Privacy

RPZ logs contain all DNS queries:

Queries logged: all domains accessed by internal systems
Risk: Sensitive queries could be exposed

Mitigation:
  1. Restrict access to DNS logs
     Only ops/security team can view
  
  2. Anonymize/summarize
     Log domain only, not full query path
  
  3. Retention policy
     Auto-delete logs after 30 days

RPZ Cache Poisoning

If RPZ rules cached incorrectly:
  Query: api.example.com
  RPZ Match: → Redirect to 10.0.1.100
  Cache: 3600 seconds
  
  If RPZ rule removed later:
    Old cached result still used
    → Query still redirects incorrectly
    
Mitigation:
  1. Lower cache TTL for RPZ zones (60 seconds)
  2. Flush cache when rules change
  3. Monitor cache hit rates

Best Practices

  1. Whitelist-based approach (allow what's needed, block everything else)
  2. Separate RPZ rules by team/environment (easier audit)
  3. Regular review of RPZ rules (remove obsolete rules)
  4. Monitor block rates (spike indicates misconfiguration)
  5. Document rationale for each block rule
  6. Test before production (RPZ impact can be severe)
  7. Logging + alerting (detect policy violations)
  8. Include bypass mechanism (emergency access)

References