Troubleshooting Guide

This guide helps diagnose and resolve common operational issues with Cerbos deployments.

Health Check Failures

Symptoms

Health check endpoints return errors or non-200 status codes.

HTTP Health Check Returns 500

Possible Causes:

Cerbos server not fully initialized
Storage backend unavailable
Critical service failure

Diagnosis:

# Check HTTP endpoint directly
curl -v http://localhost:3592/_cerbos/health?service=cerbos.svc.v1.CerbosService

# Check server logs
docker logs cerbos --tail 100

# Check if server is listening
netstat -tlnp | grep 3592

Solutions:

Check storage backend connectivity
Review server startup logs for errors
Verify configuration file is valid
Ensure adequate resources (memory, CPU)
Check for port conflicts

gRPC Health Check Timeout

Possible Causes:

Server not responding on gRPC port
TLS configuration mismatch
Network connectivity issues
Firewall blocking gRPC port

Diagnosis:

# Test gRPC connectivity
grpcurl -plaintext localhost:3593 list

# Test with TLS
grpcurl -cacert ca.crt localhost:3593 list

# Check health specifically
grpcurl -plaintext localhost:3593 \
  grpc.health.v1.Health/Check \
  -d '{"service":"cerbos.svc.v1.CerbosService"}'

Solutions:

Verify gRPC port (3593) is accessible
Check TLS configuration matches client
Ensure gRPC server started successfully
Review firewall/security group rules
Test with cerbos healthcheck --kind=grpc

Health Check Works but Requests Fail

Possible Causes:

Policy loading errors
Validation failures
Partial service degradation

Diagnosis:

# Check policy load status
cerbosctl --server=localhost:3593 get policies

# Test actual authorization request
grpcurl -plaintext -d '{
  "principal": {"id": "user1", "roles": ["user"]},
  "resource": {"kind": "document", "id": "1"},
  "actions": ["view"]
}' localhost:3593 cerbos.svc.v1.CerbosService/CheckResources

Solutions:

Validate policy syntax: cerbos compile <policy-dir>
Check policy store synchronization
Review error logs for policy evaluation failures
Verify schema validation if using schemas

Storage Backend Issues

Git Storage Problems

Git Clone/Fetch Failures

Error Messages:

failed to clone repository
authentication failed
repository not found

Diagnosis:

# Test git access manually
git clone <repository-url> /tmp/test-clone

# Check SSH key
ssh-add -l
ssh -T [email protected]

# Check HTTPS credentials
git credential fill

Solutions:For SSH:

storage:
  driver: git
  git:
    protocol: ssh
    url: [email protected]:org/policies.git
    sshAuth:
      privateKeyFile: /path/to/key
      # Ensure file is mounted and readable

For HTTPS:

storage:
  driver: git
  git:
    protocol: https
    url: https://github.com/org/policies.git
    auth:
      username: oauth2
      password: ${GITHUB_TOKEN}

Common Fixes:

Verify repository URL is correct
Check authentication credentials are valid
Ensure SSH key has no passphrase or use ssh-agent
For private repos, verify access permissions
Check network connectivity to git server

Policies Not Updating

Symptoms:

Policy changes not reflected in Cerbos
cerbos_dev_store_last_successful_refresh metric not updating

Diagnosis:

# Check update interval
grep updatePollInterval config.yaml

# Monitor metrics
curl -s http://localhost:3592/_cerbos/metrics | \
  grep cerbos_dev_store_last_successful_refresh

# Check sync errors
curl -s http://localhost:3592/_cerbos/metrics | \
  grep cerbos_dev_store_sync_error_count

Solutions:

storage:
  driver: git
  git:
    updatePollInterval: 60s  # Reduce for faster updates
    checkoutTimeout: 30s

Reduce updatePollInterval if needed
Check for network issues to git server
Review logs for fetch errors
Verify branch name is correct
Force refresh via Admin API if available

Database Storage Problems

Connection Pool Exhaustion

Error Messages:

could not open a connection to the database
connection pool timeout
too many connections

Diagnosis:

# Check active connections (PostgreSQL)
psql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name='cerbos'"

# Monitor connection pool metrics
curl -s http://localhost:3592/_cerbos/metrics | grep pool

Solutions:

storage:
  driver: postgres
  postgres:
    connPool:
      maxOpen: 25      # Increase if needed
      maxIdle: 10
      maxLifetime: 300s
      maxIdleTime: 60s

Tuning Guidelines:

Set maxOpen based on workload: (CPU cores × 2) + spindles
Keep maxIdle at ~50% of maxOpen
Ensure database max_connections > Cerbos instances × maxOpen
Monitor for connection leaks in application

Database Connection Failures

Error Messages:

connection refused
authentication failed
SSL required

Diagnosis:

# Test connection manually
psql "postgresql://user:pass@localhost:5432/cerbos?sslmode=require"

# Check network connectivity
telnet db-host 5432

# Verify SSL/TLS
openssl s_client -connect db-host:5432 -starttls postgres

Solutions:Connection String:

storage:
  driver: postgres
  postgres:
    url: "postgresql://user:${DB_PASSWORD}@host:5432/cerbos?sslmode=verify-full&sslrootcert=/path/to/ca.crt"

SSL Modes:

disable: No SSL (development only)
require: Require SSL, don’t verify cert
verify-ca: Verify certificate authority
verify-full: Verify cert and hostname (recommended)

Common Fixes:

Verify database host and port
Check username and password
Ensure database exists
Configure SSL/TLS properly
Check firewall rules
Verify connection retry settings

Performance Issues

High Latency

Consistently High P50 Latency (> 10ms)

Diagnosis:

# Check median latency
histogram_quantile(0.50, 
  rate(cerbos_dev_engine_check_latency_bucket[5m])
)

# Check cache hit rate
sum(rate(cerbos_dev_cache_access_count{result="hit"}[5m])) / 
sum(rate(cerbos_dev_cache_access_count[5m]))

# Check compilation time
histogram_quantile(0.95, 
  rate(cerbos_dev_compiler_compile_duration_bucket[5m])
)

Common Causes:

Cache misses: Low cache hit rate (< 90%)
Complex policies: Heavy condition evaluation
Storage latency: Slow policy loading
Resource constraints: CPU/memory pressure

Solutions:

Warm cache on startup
Simplify policy conditions
Optimize storage backend (see Performance guide)
Increase CPU/memory allocation
Review policy design patterns

Intermittent Latency Spikes

Diagnosis:

# Monitor GC pauses
curl -s http://localhost:3592/_cerbos/metrics | \
  grep go_gc_duration_seconds

# Check memory usage
curl -s http://localhost:3592/_cerbos/metrics | \
  grep process_resident_memory_bytes

# Review logs for cache evictions
docker logs cerbos 2>&1 | grep -i evict

Common Causes:

GC pressure: Frequent garbage collection
Cache evictions: Memory pressure causing cache churn
Storage sync: Policy updates during requests
Network issues: Intermittent connectivity problems

Solutions:

Increase memory allocation
Reduce GC frequency by allocating more heap
Stagger policy updates across instances
Investigate network stability
Monitor P99 latency trends

Storage Backend Latency

Diagnosis:

# Bundle operation latency
histogram_quantile(0.95, 
  rate(cerbos_dev_store_bundle_op_latency_bucket[5m])
)

# Store poll count
rate(cerbos_dev_store_poll_count[5m])

# Sync errors
rate(cerbos_dev_store_sync_error_count[5m])

Solutions by Storage Type:Git:

Increase updatePollInterval to reduce fetch frequency
Use local git cache/mirror
Reduce repository size

Database:

Tune connection pool (see Database section)
Add database indexes
Use read replicas

Blob Storage:

Use regional endpoints
Enable CDN/caching layer
Reduce poll interval if network is slow

Memory Issues

Out of Memory (OOM) Errors

Symptoms:

container killed by OOMKiller
runtime: out of memory
fatal error: runtime: out of memory

Diagnosis:

# Check memory usage
curl -s http://localhost:3592/_cerbos/metrics | \
  grep process_resident_memory_bytes

# Monitor cache size
curl -s http://localhost:3592/_cerbos/metrics | \
  grep cerbos_dev_cache_live_objects

# Check container limits
docker stats cerbos
kubectl top pod cerbos-xxx

Memory Estimation:

Required Memory = Base (100MB)
                + (Policies × 1MB)
                + (Request Rate × 10KB)
                + (Audit Buffer × 1KB)

Solutions:Kubernetes:

resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "2Gi"  # 2-3x request

Docker:

docker run --memory=2g --memory-reservation=1g cerbos

Optimization:

Reduce policy count if excessive
Decrease audit buffer size
Implement policy archival strategy
Monitor for memory leaks

Memory Leak Detection

Symptoms:

Memory usage grows continuously
OOM after days/weeks of operation
GC cannot reclaim memory

Diagnosis:

# Monitor memory growth over time
curl -s http://localhost:3592/_cerbos/metrics | \
  grep -E '(process_resident_memory|go_memstats_alloc)'

# Check goroutine count
curl -s http://localhost:3592/_cerbos/metrics | \
  grep go_goroutines

# Enable pprof (if compiled with debug)
go tool pprof http://localhost:6060/debug/pprof/heap

Solutions:

Update to latest Cerbos version
Report issue with metrics/logs
Implement periodic pod restarts as workaround
Monitor for specific operations causing leaks

TLS and Certificate Issues

Certificate Verification Failed

Error Messages:

x509: certificate signed by unknown authority
tls: bad certificate
certificate has expired

Diagnosis:

# Check certificate details
openssl x509 -in /path/to/cert.crt -text -noout

# Verify certificate chain
openssl verify -CAfile ca.crt cert.crt

# Test TLS connection
openssl s_client -connect localhost:3593 -showcerts

Solutions:Expired Certificate:

Check expiration: openssl x509 -enddate -noout -in cert.crt
Renew certificate
Cerbos auto-reloads on file change

Wrong CA:

server:
  tls:
    cert: /path/to/cert.crt
    key: /path/to/key.key
    caCert: /path/to/ca.crt  # Must match cert issuer

Client-Side:

# Skip verification (testing only!)
cerbos healthcheck --insecure

# Use custom CA
cerbos healthcheck --ca-cert=/path/to/ca.crt

Certificate Not Reloading

Symptoms:

Updated certificate on disk
Cerbos still uses old certificate
No reload logged

Diagnosis:

# Check file permissions
ls -la /path/to/cert.crt

# Verify file was actually updated
stat /path/to/cert.crt

# Check Cerbos logs
docker logs cerbos 2>&1 | grep -i certificate

Solutions:

Ensure Cerbos has read permissions
Verify files are not symlinks to read-only mounts
Update both cert and key atomically
Check filesystem supports inotify (for auto-reload)
Restart Cerbos if auto-reload fails

Admin API Issues

Authentication Failed

Error Messages:

401 Unauthorized
authentication failed
invalid credentials

Diagnosis:

# Test credentials
curl -u cerbos:password https://localhost:3592/admin/policy

# Verify password hash generation
echo "password" | htpasswd -niBC 10 cerbos | cut -d ':' -f 2 | base64

Solutions:Regenerate Password Hash:

# Generate new hash
echo "NewPassword" | htpasswd -niBC 10 cerbos | cut -d ':' -f 2 | base64

# Update configuration
server:
  adminAPI:
    enabled: true
    adminCredentials:
      username: cerbos
      passwordHash: <new-hash>

Common Issues:

Password not base64-encoded
Using bcrypt cost < 10
Using wrong username
Credentials from environment variables not set

Admin API Disabled

Error Messages:

404 Not Found
unimplemented
service not available

Solution:

server:
  adminAPI:
    enabled: true  # Must be explicitly enabled
    adminCredentials:
      username: admin
      passwordHash: ${ADMIN_PASSWORD_HASH}

Verify:

# Check service status via health endpoint
grpcurl -plaintext localhost:3593 \
  grpc.health.v1.Health/Check \
  -d '{"service":"cerbos.svc.v1.CerbosAdminService"}'

Audit Logging Issues

Audit Logs Not Written

Diagnosis:

# Check audit error count
curl -s http://localhost:3592/_cerbos/metrics | \
  grep cerbos_dev_audit_error_count

# Check configuration
grep -A 10 '^audit:' config.yaml

# Verify backend connectivity (if Kafka/Hub)

Solutions:File Backend:

audit:
  enabled: true
  accessLogsEnabled: true
  decisionLogsEnabled: true
  backend: file
  file:
    path: /var/log/cerbos/audit.log
    # Ensure directory exists and is writable

Kafka Backend:

# Test Kafka connectivity
kafkacat -b kafka:9092 -L

# Check Kafka topic
kafka-topics --bootstrap-server kafka:9092 --list

Common Fixes:

Verify backend is enabled in config
Check file permissions (file backend)
Verify Kafka broker connectivity
Ensure topic exists and is writable
Check for network/firewall issues

Oversized Audit Entries

Symptoms:

cerbos_dev_audit_oversized_entry_count increasing
Some requests not logged

Diagnosis:

rate(cerbos_dev_audit_oversized_entry_count[5m])

Solutions:Kafka Backend:

audit:
  backend: kafka
  kafka:
    maxBufferedRecords: 1000  # Increase if needed
    compression: ['zstd']      # Use better compression

Filter Sensitive Data:

audit:
  excludeMetadataKeys:
    - authorization
    - x-large-header
  decisionLogFilters:
    checkResources:
      ignoreAllowAll: true  # Reduce log volume

Debug Mode

Enable debug logging temporarily:

# Send SIGUSR1 signal to Cerbos process
kill -USR1 $(pidof cerbos)

# Or in Kubernetes
kubectl exec cerbos-pod -- kill -USR1 1

Debug logging automatically reverts after 10 minutes. Configure Duration:

export CERBOS_TEMP_LOG_LEVEL_DURATION=5m

Getting Help

Information to Collect

When reporting issues:

Version: cerbos version
Configuration: Sanitized config file
Logs: Last 100 lines of logs
Metrics: Relevant Prometheus metrics
Environment: Deployment method (Docker, K8s, binary)
Reproduction: Steps to reproduce the issue

Log Collection

# Docker
docker logs cerbos --tail 100 > cerbos.log

# Kubernetes
kubectl logs deployment/cerbos --tail=100 > cerbos.log

# Binary
journalctl -u cerbos -n 100 > cerbos.log

Support Channels

Community Slack: Join Cerbos Slack for community help
GitHub Issues: Open issues for bugs
Documentation: Check docs.cerbos.dev
Enterprise Support: Contact Cerbos for enterprise support

Common Error Messages

Error	Cause	Solution
`failed to load config`	Invalid YAML	Validate YAML syntax
`store initialization failed`	Storage backend unreachable	Check storage connectivity
`policy compilation failed`	Invalid policy syntax	Run `cerbos compile`
`connection refused`	Service not listening	Verify server started, check ports
`certificate signed by unknown authority`	TLS cert mismatch	Verify CA certificate
`authentication failed`	Wrong credentials	Regenerate password hash
`context deadline exceeded`	Timeout	Increase timeout, check network
`resource exhausted`	Rate limiting	Reduce request rate or increase limits

Documentation Index

​Health Check Failures

​Symptoms

​Storage Backend Issues

​Git Storage Problems

​Database Storage Problems

​Performance Issues

​High Latency

​Memory Issues

​TLS and Certificate Issues

​Admin API Issues

​Audit Logging Issues

​Debug Mode

​Getting Help

​Information to Collect

​Log Collection

​Support Channels

​Common Error Messages

Health Check Failures

Symptoms

Storage Backend Issues

Git Storage Problems

Database Storage Problems

Performance Issues

High Latency

Memory Issues

TLS and Certificate Issues

Admin API Issues

Audit Logging Issues

Debug Mode

Getting Help

Information to Collect

Log Collection

Support Channels

Common Error Messages