Trace Tail-Based Sampling
8 minute read
Overview
Tail-based sampling makes intelligent sampling decisions after collecting complete trace information, enabling organizations to reduce trace volumes by 60-90% while preserving all errors, high-latency requests, and business-critical traces.
Key benefits:
- Evaluate complete trace context before deciding
- Preserve 100% of errors and anomalies
- Reduce storage costs while maintaining observability
- Scale horizontally in Kubernetes with consistent hashing
Core Architecture
Three-Tier Caching Strategy
Tail-based sampling uses a sophisticated three-tier cache system to optimize memory usage and processing speed. Each cache serves a specific purpose in the trace decision pipeline:
- Primary Buffer (LRU Cache)
- Stores active traces awaiting decision
- Default: 50,000 traces
- Memory: ~400-500 MB
- Keep Cache (Fast-Path)
- Previously sampled trace IDs
- Default: 20,000 IDs (~640 KB)
- Immediately forwards late spans
- Drop Cache (Fast-Path)
- Previously rejected trace IDs
- Default: 100,000 IDs (~3.2 MB)
- Immediately discards late spans
The Primary Buffer holds all traces currently being evaluated. Once a decision is made, the trace ID moves to either the Keep Cache (if sampled) or Drop Cache (if rejected). This allows late-arriving spans to be processed instantly without re-evaluating policies.
Decision Flow
The following diagram illustrates the complete lifecycle of a span from arrival through final sampling decision:
Each span first checks the Keep/Drop caches for a fast-path decision. If not found, it enters the buffer and waits for the decision interval to expire before policy evaluation occurs.
Typically 40-80% of spans get a fast-path decision from the caches, skipping policy evaluation entirely.
Policy Types
Tail-based sampling supports 10 policy types that can be combined to create sophisticated sampling strategies. The table below summarizes each policy type with its primary use case and a concrete example:
| Policy Type | Use Case | Example |
|---|---|---|
| Probabilistic | Baseline sampling | 10% of all traces |
| Latency | Slow requests | Traces > 2 seconds |
| Status Code | Errors | All ERROR status |
| Span Count | Filter noise | Traces with 12+ spans |
| String Attribute | Service filtering | payment-service only |
| Numeric Attribute | Business metrics | cart_value > $1000 |
| Boolean Attribute | Feature flags | experimental_feature=true |
| Condition (OTTL) | Complex logic | status>=400 AND tier=“enterprise” |
| AND | Combine filters | Errors AND latency > 2s |
| DROP | Explicit rejection | Health checks with status=OK |
Each policy evaluates traces independently. The first eight policies make positive decisions (sample the trace), while AND combines multiple criteria, and DROP explicitly rejects traces. Policies can be layered to create multi-stage filtering logic.
Policy evaluation is sequential with short-circuit (first match wins).
Memory Management
Sizing Formula
Accurately sizing memory for tail-based sampling is critical to prevent Out-Of-Memory (OOM) errors. Use this formula to calculate required memory based on your trace volume:
Required Memory = (Traces/sec × Decision Interval × Avg Trace Size) × 1.5
For example, with 5,000 traces/sec, a 30s interval, and 10 KB average trace size:
5,000 × 30 × 10,240 × 1.5 = 2.2 GB
In this case, set a 3 GB pod memory limit with GOMEMLIMIT at 2.8 GB.
The 1.5 multiplier accounts for overhead including garbage collection, cache structures, and Go runtime memory. In this example, you’d need 2.2 GB just for trace buffering, so setting a 3 GB pod limit with GOMEMLIMIT at 2.8 GB provides safe headroom.
Key Parameters
decision_interval: How long to wait before deciding (default: 30s). Set to 1.3-2x P99 trace completion latency. Longer intervals mean more complete traces but higher memory usage; shorter intervals reduce memory but risk incomplete traces.batch_cache_size: Max traces in buffer (default: 50,000). Monitor eviction rate (target: < 5%) and size for 2x peak traffic.GOMEMLIMIT: Set to 90% of pod memory limit. This prevents OOM kills by triggering GC before reaching the limit.
Kubernetes Deployment Sizing
Three-Tier Gateway Architecture
EdgeDelta’s gateway deployment separates concerns into three specialized tiers, each with distinct scaling characteristics:
- Processor Tier (Port 4319)
- Stateless OTLP receivers
- Scale: CPU-based (75% target)
- Handles incoming trace data and routes to compactors
- Compactor Tier (Port 9199)
- Stateful tail sampling (memory-critical)
- Scale: Memory-based (70% target)
- All spans from the same trace must route to the same pod using consistent hashing
- Rollup Tier (Port 9200)
- Metric aggregation (RED: Rate, Errors, Duration)
- Scale: CPU-based (75% target)
Resource Sizing Guide
Use this table to estimate the number of pods and memory requirements based on your peak trace ingestion rate. These recommendations are based on production deployments with a 30-second decision interval:
| Trace Volume | Processor Pods | Compactor Pods | Compactor Memory | Total Cost/Month |
|---|---|---|---|---|
| < 500/sec | 2 | 1 | 2 GB | $50-100 |
| 500-1,000 | 2-3 | 1-2 | 3 GB | $100-200 |
| 1,000-2,500 | 3-5 | 2-3 | 4 GB | $200-400 |
| 2,500-5,000 | 5-8 | 3-5 | 6 GB | $400-800 |
| 5,000-10,000 | 8-12 | 5-8 | 8 GB | $800-1,500 |
The pod counts shown are the recommended starting points for minimum replicas. The compactor memory value represents the per-pod limit. Cost estimates assume AWS EKS c5.xlarge instances at $0.17/hour and include all three tiers.
Use consistent hashing by trace_id to ensure all spans from a trace reach the same compactor pod.
Example Compactor Configuration
This Helm values configuration shows recommended settings for the compactor tier in a medium-traffic deployment (1,000-2,500 traces/sec):
compactorProps:
replicas: 2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 3000m
memory: 3Gi
goMemLimit: "2800MiB"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 8
targetForMemoryUtilizationPercentage: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600 # 10 minutes
The goMemLimit is set to 93% of the memory limit (2.8 GB of 3 GB), triggering garbage collection before hitting OOM. Memory-based autoscaling targets 70% utilization. Scale-up happens quickly (60s stabilization) to handle traffic spikes, while scale-down waits 10 minutes to avoid thrashing during temporary dips.
Trace Completeness Handling
The Challenge
OpenTelemetry spans arrive independently without explicit completion markers. Traces are considered complete using time-based heuristics.
Solution: Decision Interval Tuning
For example, if your P99 trace completion latency is 15 seconds, set decision_interval to 20-30 seconds (1.3-2x P99). This ensures approximately 99% of traces are evaluated with complete span data.
You can measure trace completion latency with this PromQL query:
histogram_quantile(0.99, rate(trace_span_arrival_duration_bucket[5m]))
Handling Asynchronous Spans
Async operations like message queues and background jobs require special handling because spans can arrive minutes or hours apart. For synchronous HTTP requests, all spans typically arrive within milliseconds, so a 30-second decision window works well. However, asynchronous workflows fail with this default because a producer span might arrive immediately while the consumer span arrives 5 minutes later after message queue processing.
Best practices:
- Pure sync API: 10-30s decision interval
- Mixed sync/async: 60-120s
- Heavy async (queues): 300-600s
- Extend keep_cache_ttl for async workloads (1-24 hours)
Key Metrics to Monitor
Critical Alerts
Set up these three critical Prometheus alerts to detect operational issues before they impact trace sampling.
High eviction rate indicates insufficient buffer capacity. This alert fires when more than 10 traces per second are being removed from the buffer before decisions complete:
rate(edgedelta_tail_sampling_evictions_total[5m]) > 10
Memory pressure alerts trigger at 85% usage, giving time to scale before OOM:
container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85
Late span arrival indicates incomplete traces. This alert fires when more than 10% of spans arrive after their trace has already been decided, suggesting the decision interval is too short:
rate(edgedelta_tail_sampling_late_spans_total[5m]) > 10% of total spans
Health Indicators
Monitor these metrics to ensure your tail sampling deployment is operating efficiently:
- Buffer Utilization: < 80% (healthy)
- Cache Hit Rate (Keep): > 80% (optimal)
- Eviction Rate: < 1% (target)
- Sampling Rate: 10-30% overall (cost-effective)
Production Best Practices
Policy Design
- Always start with DROP policies - Eliminate health checks first
- Sample all errors - Errors are rare but critical
- Implement tiered latency sampling - 100% of P99+, 50% of P95+, 10% of P50+
- Establish baseline with probabilistic - Final 5-10% catch-all policy
- Avoid over-sampling - Target 10-30% overall rate
Example Production Policy
This production-ready policy configuration demonstrates best practices for tail-based sampling. Policies are ordered strategically with DROP first, critical traces next, and a probabilistic baseline last:
sampling_policies:
# 1. Drop known noise
- name: drop_health_checks
policy_type: drop
sub_policies:
- policy_type: string_attribute
key: http.route
values: ["/health", "/ready"]
# 2. Always sample errors
- name: all_errors
policy_type: status_code
status_codes: [ERROR]
# 3. Sample high latency
- name: slow_requests
policy_type: latency
lower_threshold: 2s
# 4. Sample critical services at higher rate
- name: critical_services
policy_type: and
sub_policies:
- policy_type: string_attribute
key: service.name
values: [payment-service, auth-service]
- policy_type: probabilistic
percentage: 50
# 5. Baseline for everything else
- name: baseline
policy_type: probabilistic
percentage: 5
Traces first encounter the DROP policy which explicitly rejects health checks. Surviving traces are then evaluated for errors (100% sampled), high latency (100% sampled), critical services (50% sampled), and finally all remaining traces get a 5% baseline sample. This approach ensures 100% error visibility while managing overall volume.
Deployment Checklist
- Measure P99 trace completion latency
- Calculate memory requirements using formula
- Set GOMEMLIMIT to 90% of pod memory limit
- Configure consistent hashing by trace_id
- Enable HPA with memory target (70%) for compactor
- Set topology spread for high availability
- Configure ServiceMonitor for Prometheus metrics
- Create alerts for eviction rate, memory pressure, late spans
- Load test at 2x peak traffic
- Verify cache hit rates > 80%
Quick Reference
Common Commands
# View current sampling rate
kubectl logs -n edgedelta compactor-pod | grep "sampling_rate"
# Check memory usage
kubectl top pods -n edgedelta | grep compactor
# Scale compactor manually
kubectl scale deployment edgedelta-gateway-compactor --replicas=4
Troubleshooting
Common issues and their resolutions when operating tail-based sampling in production:
| Symptom | Solution |
|---|---|
| High eviction rate | Increase batch_cache_size or scale compactor pods |
| OOM kills | Set GOMEMLIMIT, reduce cache size, or scale horizontally |
| Incomplete traces | Increase decision_interval or keep_cache_ttl |
| Low sampling rate | Check DROP policies, verify policy ordering |
| High CPU | Optimize policy ordering, reduce regex complexity |
Each symptom indicates a specific operational issue. High eviction means traces are being removed from the buffer before decisions complete. OOM kills suggest memory limits are too low. Incomplete traces indicate late-arriving spans. Low sampling rates often result from overly aggressive DROP policies. High CPU usage typically comes from inefficient policy evaluation.
See Also
- Tail Sample Processor - Configuration reference for all sampling policy types and YAML syntax
- Consistent Probabilistic Sampling - Edge Delta blog on sampling implementation
- OpenTelemetry Trace Specification