Trace Tail-Based Sampling

A comprehensive guide to tail-based sampling for distributed tracing, covering memory management, scalability, and Kubernetes gateway sizing for production deployments.

Overview

Tail-based sampling makes intelligent sampling decisions after collecting complete trace information, enabling organizations to reduce trace volumes by 60-90% while preserving all errors, high-latency requests, and business-critical traces.

Key benefits:

  • Evaluate complete trace context before deciding
  • Preserve 100% of errors and anomalies
  • Reduce storage costs while maintaining observability
  • Scale horizontally in Kubernetes with consistent hashing

Core Architecture

Three-Tier Caching Strategy

Tail-based sampling uses a sophisticated three-tier cache system to optimize memory usage and processing speed. Each cache serves a specific purpose in the trace decision pipeline:

  • Primary Buffer (LRU Cache)
    • Stores active traces awaiting decision
    • Default: 50,000 traces
    • Memory: ~400-500 MB
  • Keep Cache (Fast-Path)
    • Previously sampled trace IDs
    • Default: 20,000 IDs (~640 KB)
    • Immediately forwards late spans
  • Drop Cache (Fast-Path)
    • Previously rejected trace IDs
    • Default: 100,000 IDs (~3.2 MB)
    • Immediately discards late spans

The Primary Buffer holds all traces currently being evaluated. Once a decision is made, the trace ID moves to either the Keep Cache (if sampled) or Drop Cache (if rejected). This allows late-arriving spans to be processed instantly without re-evaluating policies.

Decision Flow

The following diagram illustrates the complete lifecycle of a span from arrival through final sampling decision:

flowchart LR A[Span Arrival] --> B[Cache Check] B --> C["Buffer (Wait 30s)"] C --> D[Policy Evaluation] D --> E[Sample/Drop]

Each span first checks the Keep/Drop caches for a fast-path decision. If not found, it enters the buffer and waits for the decision interval to expire before policy evaluation occurs.

Typically 40-80% of spans get a fast-path decision from the caches, skipping policy evaluation entirely.

Policy Types

Tail-based sampling supports 10 policy types that can be combined to create sophisticated sampling strategies. The table below summarizes each policy type with its primary use case and a concrete example:

Policy TypeUse CaseExample
ProbabilisticBaseline sampling10% of all traces
LatencySlow requestsTraces > 2 seconds
Status CodeErrorsAll ERROR status
Span CountFilter noiseTraces with 12+ spans
String AttributeService filteringpayment-service only
Numeric AttributeBusiness metricscart_value > $1000
Boolean AttributeFeature flagsexperimental_feature=true
Condition (OTTL)Complex logicstatus>=400 AND tier=“enterprise”
ANDCombine filtersErrors AND latency > 2s
DROPExplicit rejectionHealth checks with status=OK

Each policy evaluates traces independently. The first eight policies make positive decisions (sample the trace), while AND combines multiple criteria, and DROP explicitly rejects traces. Policies can be layered to create multi-stage filtering logic.

Policy evaluation is sequential with short-circuit (first match wins).

Memory Management

Sizing Formula

Accurately sizing memory for tail-based sampling is critical to prevent Out-Of-Memory (OOM) errors. Use this formula to calculate required memory based on your trace volume:

Required Memory = (Traces/sec × Decision Interval × Avg Trace Size) × 1.5

For example, with 5,000 traces/sec, a 30s interval, and 10 KB average trace size:

5,000 × 30 × 10,240 × 1.5 = 2.2 GB

In this case, set a 3 GB pod memory limit with GOMEMLIMIT at 2.8 GB.

The 1.5 multiplier accounts for overhead including garbage collection, cache structures, and Go runtime memory. In this example, you’d need 2.2 GB just for trace buffering, so setting a 3 GB pod limit with GOMEMLIMIT at 2.8 GB provides safe headroom.

Key Parameters

  • decision_interval: How long to wait before deciding (default: 30s). Set to 1.3-2x P99 trace completion latency. Longer intervals mean more complete traces but higher memory usage; shorter intervals reduce memory but risk incomplete traces.

  • batch_cache_size: Max traces in buffer (default: 50,000). Monitor eviction rate (target: < 5%) and size for 2x peak traffic.

  • GOMEMLIMIT: Set to 90% of pod memory limit. This prevents OOM kills by triggering GC before reaching the limit.

Kubernetes Deployment Sizing

Three-Tier Gateway Architecture

EdgeDelta’s gateway deployment separates concerns into three specialized tiers, each with distinct scaling characteristics:

  • Processor Tier (Port 4319)
    • Stateless OTLP receivers
    • Scale: CPU-based (75% target)
    • Handles incoming trace data and routes to compactors
  • Compactor Tier (Port 9199)
    • Stateful tail sampling (memory-critical)
    • Scale: Memory-based (70% target)
    • All spans from the same trace must route to the same pod using consistent hashing
  • Rollup Tier (Port 9200)
    • Metric aggregation (RED: Rate, Errors, Duration)
    • Scale: CPU-based (75% target)

Resource Sizing Guide

Use this table to estimate the number of pods and memory requirements based on your peak trace ingestion rate. These recommendations are based on production deployments with a 30-second decision interval:

Trace VolumeProcessor PodsCompactor PodsCompactor MemoryTotal Cost/Month
< 500/sec212 GB$50-100
500-1,0002-31-23 GB$100-200
1,000-2,5003-52-34 GB$200-400
2,500-5,0005-83-56 GB$400-800
5,000-10,0008-125-88 GB$800-1,500

The pod counts shown are the recommended starting points for minimum replicas. The compactor memory value represents the per-pod limit. Cost estimates assume AWS EKS c5.xlarge instances at $0.17/hour and include all three tiers.

Use consistent hashing by trace_id to ensure all spans from a trace reach the same compactor pod.

Example Compactor Configuration

This Helm values configuration shows recommended settings for the compactor tier in a medium-traffic deployment (1,000-2,500 traces/sec):

compactorProps:
  replicas: 2
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 3000m
      memory: 3Gi
  goMemLimit: "2800MiB"

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetForMemoryUtilizationPercentage: 70
    behavior:
      scaleUp:
        stabilizationWindowSeconds: 60
      scaleDown:
        stabilizationWindowSeconds: 600  # 10 minutes

The goMemLimit is set to 93% of the memory limit (2.8 GB of 3 GB), triggering garbage collection before hitting OOM. Memory-based autoscaling targets 70% utilization. Scale-up happens quickly (60s stabilization) to handle traffic spikes, while scale-down waits 10 minutes to avoid thrashing during temporary dips.

Trace Completeness Handling

The Challenge

OpenTelemetry spans arrive independently without explicit completion markers. Traces are considered complete using time-based heuristics.

Solution: Decision Interval Tuning

For example, if your P99 trace completion latency is 15 seconds, set decision_interval to 20-30 seconds (1.3-2x P99). This ensures approximately 99% of traces are evaluated with complete span data.

You can measure trace completion latency with this PromQL query:

histogram_quantile(0.99, rate(trace_span_arrival_duration_bucket[5m]))

Handling Asynchronous Spans

Async operations like message queues and background jobs require special handling because spans can arrive minutes or hours apart. For synchronous HTTP requests, all spans typically arrive within milliseconds, so a 30-second decision window works well. However, asynchronous workflows fail with this default because a producer span might arrive immediately while the consumer span arrives 5 minutes later after message queue processing.

Best practices:

  • Pure sync API: 10-30s decision interval
  • Mixed sync/async: 60-120s
  • Heavy async (queues): 300-600s
  • Extend keep_cache_ttl for async workloads (1-24 hours)

Key Metrics to Monitor

Critical Alerts

Set up these three critical Prometheus alerts to detect operational issues before they impact trace sampling.

High eviction rate indicates insufficient buffer capacity. This alert fires when more than 10 traces per second are being removed from the buffer before decisions complete:

rate(edgedelta_tail_sampling_evictions_total[5m]) > 10

Memory pressure alerts trigger at 85% usage, giving time to scale before OOM:

container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85

Late span arrival indicates incomplete traces. This alert fires when more than 10% of spans arrive after their trace has already been decided, suggesting the decision interval is too short:

rate(edgedelta_tail_sampling_late_spans_total[5m]) > 10% of total spans

Health Indicators

Monitor these metrics to ensure your tail sampling deployment is operating efficiently:

  • Buffer Utilization: < 80% (healthy)
  • Cache Hit Rate (Keep): > 80% (optimal)
  • Eviction Rate: < 1% (target)
  • Sampling Rate: 10-30% overall (cost-effective)

Production Best Practices

Policy Design

  1. Always start with DROP policies - Eliminate health checks first
  2. Sample all errors - Errors are rare but critical
  3. Implement tiered latency sampling - 100% of P99+, 50% of P95+, 10% of P50+
  4. Establish baseline with probabilistic - Final 5-10% catch-all policy
  5. Avoid over-sampling - Target 10-30% overall rate

Example Production Policy

This production-ready policy configuration demonstrates best practices for tail-based sampling. Policies are ordered strategically with DROP first, critical traces next, and a probabilistic baseline last:

sampling_policies:
  # 1. Drop known noise
  - name: drop_health_checks
    policy_type: drop
    sub_policies:
      - policy_type: string_attribute
        key: http.route
        values: ["/health", "/ready"]

  # 2. Always sample errors
  - name: all_errors
    policy_type: status_code
    status_codes: [ERROR]

  # 3. Sample high latency
  - name: slow_requests
    policy_type: latency
    lower_threshold: 2s

  # 4. Sample critical services at higher rate
  - name: critical_services
    policy_type: and
    sub_policies:
      - policy_type: string_attribute
        key: service.name
        values: [payment-service, auth-service]
      - policy_type: probabilistic
        percentage: 50

  # 5. Baseline for everything else
  - name: baseline
    policy_type: probabilistic
    percentage: 5

Traces first encounter the DROP policy which explicitly rejects health checks. Surviving traces are then evaluated for errors (100% sampled), high latency (100% sampled), critical services (50% sampled), and finally all remaining traces get a 5% baseline sample. This approach ensures 100% error visibility while managing overall volume.

Deployment Checklist

  • Measure P99 trace completion latency
  • Calculate memory requirements using formula
  • Set GOMEMLIMIT to 90% of pod memory limit
  • Configure consistent hashing by trace_id
  • Enable HPA with memory target (70%) for compactor
  • Set topology spread for high availability
  • Configure ServiceMonitor for Prometheus metrics
  • Create alerts for eviction rate, memory pressure, late spans
  • Load test at 2x peak traffic
  • Verify cache hit rates > 80%

Quick Reference

Common Commands

# View current sampling rate
kubectl logs -n edgedelta compactor-pod | grep "sampling_rate"

# Check memory usage
kubectl top pods -n edgedelta | grep compactor

# Scale compactor manually
kubectl scale deployment edgedelta-gateway-compactor --replicas=4

Troubleshooting

Common issues and their resolutions when operating tail-based sampling in production:

SymptomSolution
High eviction rateIncrease batch_cache_size or scale compactor pods
OOM killsSet GOMEMLIMIT, reduce cache size, or scale horizontally
Incomplete tracesIncrease decision_interval or keep_cache_ttl
Low sampling rateCheck DROP policies, verify policy ordering
High CPUOptimize policy ordering, reduce regex complexity

Each symptom indicates a specific operational issue. High eviction means traces are being removed from the buffer before decisions complete. OOM kills suggest memory limits are too low. Incomplete traces indicate late-arriving spans. Low sampling rates often result from overly aggressive DROP policies. High CPU usage typically comes from inefficient policy evaluation.

See Also