Trace Tail-Based Sampling

A comprehensive guide to tail-based sampling for distributed tracing, covering memory management, scalability, and Kubernetes gateway sizing for production deployments.

8 minute read

Overview

Tail-based sampling makes intelligent sampling decisions after collecting complete trace information, enabling organizations to reduce trace volumes by 60-90% while preserving all errors, high-latency requests, and business-critical traces.

Key benefits:

Evaluate complete trace context before deciding
Preserve 100% of errors and anomalies
Reduce storage costs while maintaining observability
Scale horizontally in Kubernetes with consistent hashing

Core Architecture

Three-Tier Caching Strategy

Tail-based sampling uses a sophisticated three-tier cache system to optimize memory usage and processing speed. Each cache serves a specific purpose in the trace decision pipeline:

Primary Buffer (LRU Cache)
- Stores active traces awaiting decision
- Default: 50,000 traces
- Memory: ~400-500 MB
Keep Cache (Fast-Path)
- Previously sampled trace IDs
- Default: 20,000 IDs (~640 KB)
- Immediately forwards late spans
Drop Cache (Fast-Path)
- Previously rejected trace IDs
- Default: 100,000 IDs (~3.2 MB)
- Immediately discards late spans

The Primary Buffer holds all traces currently being evaluated. Once a decision is made, the trace ID moves to either the Keep Cache (if sampled) or Drop Cache (if rejected). This allows late-arriving spans to be processed instantly without re-evaluating policies.

Decision Flow

The following diagram illustrates the complete lifecycle of a span from arrival through final sampling decision:

flowchart LR
    A[Span Arrival] --> B[Cache Check]
    B --> C["Buffer (Wait 30s)"]
    C --> D[Policy Evaluation]
    D --> E[Sample/Drop]

Each span first checks the Keep/Drop caches for a fast-path decision. If not found, it enters the buffer and waits for the decision interval to expire before policy evaluation occurs.

Typically 40-80% of spans get a fast-path decision from the caches, skipping policy evaluation entirely.

Policy Types

Tail-based sampling supports 10 policy types that can be combined to create sophisticated sampling strategies. The table below summarizes each policy type with its primary use case and a concrete example:

Policy Type	Use Case	Example
Probabilistic	Baseline sampling	10% of all traces
Latency	Slow requests	Traces > 2 seconds
Status Code	Errors	All ERROR status
Span Count	Filter noise	Traces with 12+ spans
String Attribute	Service filtering	payment-service only
Numeric Attribute	Business metrics	cart_value > $1000
Boolean Attribute	Feature flags	experimental_feature=true
Condition (OTTL)	Complex logic	status>=400 AND tier=“enterprise”
AND	Combine filters	Errors AND latency > 2s
DROP	Explicit rejection	Health checks with status=OK

Each policy evaluates traces independently. The first eight policies make positive decisions (sample the trace), while AND combines multiple criteria, and DROP explicitly rejects traces. Policies can be layered to create multi-stage filtering logic.

Policy evaluation is sequential with short-circuit (first match wins).

Memory Management

Sizing Formula

Accurately sizing memory for tail-based sampling is critical to prevent Out-Of-Memory (OOM) errors. Use this formula to calculate required memory based on your trace volume:

Required Memory = (Traces/sec × Decision Interval × Avg Trace Size) × 1.5

For example, with 5,000 traces/sec, a 30s interval, and 10 KB average trace size:

5,000 × 30 × 10,240 × 1.5 = 2.2 GB

In this case, set a 3 GB pod memory limit with GOMEMLIMIT at 2.8 GB.

The 1.5 multiplier accounts for overhead including garbage collection, cache structures, and Go runtime memory. In this example, you’d need 2.2 GB just for trace buffering, so setting a 3 GB pod limit with GOMEMLIMIT at 2.8 GB provides safe headroom.

Key Parameters

decision_interval: How long to wait before deciding (default: 30s). Set to 1.3-2x P99 trace completion latency. Longer intervals mean more complete traces but higher memory usage; shorter intervals reduce memory but risk incomplete traces.
batch_cache_size: Max traces in buffer (default: 50,000). Monitor eviction rate (target: < 5%) and size for 2x peak traffic.
GOMEMLIMIT: Set to 90% of pod memory limit. This prevents OOM kills by triggering GC before reaching the limit.

Kubernetes Deployment Sizing

Three-Tier Gateway Architecture

EdgeDelta’s gateway deployment separates concerns into three specialized tiers, each with distinct scaling characteristics:

Processor Tier (Port 4319)
- Stateless OTLP receivers
- Scale: CPU-based (75% target)
- Handles incoming trace data and routes to compactors
Compactor Tier (Port 9199)
- Stateful tail sampling (memory-critical)
- Scale: Memory-based (70% target)
- All spans from the same trace must route to the same pod using consistent hashing
Rollup Tier (Port 9200)
- Metric aggregation (RED: Rate, Errors, Duration)
- Scale: CPU-based (75% target)

Resource Sizing Guide

Use this table to estimate the number of pods and memory requirements based on your peak trace ingestion rate. These recommendations are based on production deployments with a 30-second decision interval:

Trace Volume	Processor Pods	Compactor Pods	Compactor Memory	Total Cost/Month
< 500/sec	2	1	2 GB	$50-100
500-1,000	2-3	1-2	3 GB	$100-200
1,000-2,500	3-5	2-3	4 GB	$200-400
2,500-5,000	5-8	3-5	6 GB	$400-800
5,000-10,000	8-12	5-8	8 GB	$800-1,500

The pod counts shown are the recommended starting points for minimum replicas. The compactor memory value represents the per-pod limit. Cost estimates assume AWS EKS c5.xlarge instances at $0.17/hour and include all three tiers.

Use consistent hashing by trace_id to ensure all spans from a trace reach the same compactor pod.

Example Compactor Configuration

This Helm values configuration shows recommended settings for the compactor tier in a medium-traffic deployment (1,000-2,500 traces/sec):

compactorProps:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 3000m
      memory: 3Gi
  goMemLimit: "2800MiB"

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetForMemoryUtilizationPercentage: 70
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 60
      scaleDown:
        stabilizationWindowSeconds: 600  # 10 minutes

The goMemLimit is set to 93% of the memory limit (2.8 GB of 3 GB), triggering garbage collection before hitting OOM. Memory-based autoscaling targets 70% utilization. Scale-up happens quickly (60s stabilization) to handle traffic spikes, while scale-down waits 10 minutes to avoid thrashing during temporary dips.

Trace Completeness Handling

The Challenge

OpenTelemetry spans arrive independently without explicit completion markers. Traces are considered complete using time-based heuristics.

Solution: Decision Interval Tuning

For example, if your P99 trace completion latency is 15 seconds, set decision_interval to 20-30 seconds (1.3-2x P99). This ensures approximately 99% of traces are evaluated with complete span data.

You can measure trace completion latency with this PromQL query:

histogram_quantile(0.99, rate(trace_span_arrival_duration_bucket[5m]))

Handling Asynchronous Spans

Async operations like message queues and background jobs require special handling because spans can arrive minutes or hours apart. For synchronous HTTP requests, all spans typically arrive within milliseconds, so a 30-second decision window works well. However, asynchronous workflows fail with this default because a producer span might arrive immediately while the consumer span arrives 5 minutes later after message queue processing.

Best practices:

Pure sync API: 10-30s decision interval
Mixed sync/async: 60-120s
Heavy async (queues): 300-600s
Extend keep_cache_ttl for async workloads (1-24 hours)

Key Metrics to Monitor

Critical Alerts

Set up these three critical Prometheus alerts to detect operational issues before they impact trace sampling.

High eviction rate indicates insufficient buffer capacity. This alert fires when more than 10 traces per second are being removed from the buffer before decisions complete:

rate(edgedelta_tail_sampling_evictions_total[5m]) > 10

Memory pressure alerts trigger at 85% usage, giving time to scale before OOM:

container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85

Late span arrival indicates incomplete traces. This alert fires when more than 10% of spans arrive after their trace has already been decided, suggesting the decision interval is too short:

rate(edgedelta_tail_sampling_late_spans_total[5m]) > 10% of total spans

Health Indicators

Monitor these metrics to ensure your tail sampling deployment is operating efficiently:

Buffer Utilization: < 80% (healthy)
Cache Hit Rate (Keep): > 80% (optimal)
Eviction Rate: < 1% (target)
Sampling Rate: 10-30% overall (cost-effective)

Production Best Practices

Policy Design

Always start with DROP policies - Eliminate health checks first
Sample all errors - Errors are rare but critical
Implement tiered latency sampling - 100% of P99+, 50% of P95+, 10% of P50+
Establish baseline with probabilistic - Final 5-10% catch-all policy
Avoid over-sampling - Target 10-30% overall rate

Example Production Policy

This production-ready policy configuration demonstrates best practices for tail-based sampling. Policies are ordered strategically with DROP first, critical traces next, and a probabilistic baseline last:

sampling_policies:
  # 1. Drop known noise
  - name: drop_health_checks
    policy_type: drop
    sub_policies:
      - policy_type: string_attribute
        key: http.route
        values: ["/health", "/ready"]

  # 2. Always sample errors
  - name: all_errors
    policy_type: status_code
    status_codes: [ERROR]

  # 3. Sample high latency
  - name: slow_requests
    policy_type: latency
    lower_threshold: 2s

  # 4. Sample critical services at higher rate
  - name: critical_services
    policy_type: and
    sub_policies:
      - policy_type: string_attribute
        key: service.name
        values: [payment-service, auth-service]
      - policy_type: probabilistic
        percentage: 50

  # 5. Baseline for everything else
  - name: baseline
    policy_type: probabilistic
    percentage: 5

Traces first encounter the DROP policy which explicitly rejects health checks. Surviving traces are then evaluated for errors (100% sampled), high latency (100% sampled), critical services (50% sampled), and finally all remaining traces get a 5% baseline sample. This approach ensures 100% error visibility while managing overall volume.

Deployment Checklist

Measure P99 trace completion latency
Calculate memory requirements using formula
Set GOMEMLIMIT to 90% of pod memory limit
Configure consistent hashing by trace_id
Enable HPA with memory target (70%) for compactor
Set topology spread for high availability
Configure ServiceMonitor for Prometheus metrics
Create alerts for eviction rate, memory pressure, late spans
Load test at 2x peak traffic
Verify cache hit rates > 80%

Quick Reference

Common Commands

# View current sampling rate
kubectl logs -n edgedelta compactor-pod | grep "sampling_rate"

# Check memory usage
kubectl top pods -n edgedelta | grep compactor

# Scale compactor manually
kubectl scale deployment edgedelta-gateway-compactor --replicas=4

Troubleshooting

Common issues and their resolutions when operating tail-based sampling in production:

Symptom	Solution
High eviction rate	Increase batch_cache_size or scale compactor pods
OOM kills	Set GOMEMLIMIT, reduce cache size, or scale horizontally
Incomplete traces	Increase decision_interval or keep_cache_ttl
Low sampling rate	Check DROP policies, verify policy ordering
High CPU	Optimize policy ordering, reduce regex complexity

Each symptom indicates a specific operational issue. High eviction means traces are being removed from the buffer before decisions complete. OOM kills suggest memory limits are too low. Incomplete traces indicate late-arriving spans. Low sampling rates often result from overly aggressive DROP policies. High CPU usage typically comes from inefficient policy evaluation.