Scale Edge Delta Deployments

Scaling Edge Delta Deployments in Kubernetes.

8 minute read

Recap

A pipeline is a set of agents with a single configuration. There are different types of pipelines that you can deploy depending on your architecture.

See Architecture for an overview of the pipelines.
See Kubernetes Manifests for a description of each pipeline’s manifests.
See Sizing and Capacity Estimations.
See Deployment Examples for default installations of node, coordinator and gateway pipelines in clusters with different numbers of nodes.
See Helm values that are made available.

Overview

In environments with a high volume of logs and metrics the default replica counts for node and gateway pipeline rollup and compactor agents might not be sufficient.

1. Gather Information.

Check the status of the pods in the environment. Identify and record the NAME of any pods that are not in the Ready STATUS.

kubectl get pods -n edgedelta

View the pod description with the kubectl describe command.

kubectl describe pods <pod name> -n edgedelta   >> describe.txt

Replace <pod name> with the pod’s name as per the NAME column from the previous step. Also change edgedelta if the components were installed in a different namespace.

Then view the pod logs with the kubectl logs command and the --previous flag:

kubectl logs <pod name> -n edgedelta --previous  >> pod.log

Replace <pod name> with the pod’s name as per the NAME column from the previous step. Also change edgedelta if the components were installed in a different namespace.

See Debug the Installation of Edge Delta Components for more information.

2. Interpret information.

Examine describe.txt. In the Containers section, find Last State and the Reason field. If the Reason is OOMKilled, this pod was restarted due to memory pressure. In that case either the replica count needs to be increased and/or resource limits should be increased. If CPU consumption is close to the resource limits more resources should be provisioned for agents and/or rollup/Compactor Agents.

There can also be resource pressure in the system even if there is no restart associated with OOMKilled reason. Examine the pod.log file and check for logs indicating resource limitations. Verify that the pod isn’t running out of assigned resources such as CPU and memory, as this could cause crashes and errors.

If you have the Kubernetes Metrics Server installed, you can also identify resource pressure using top pods:

kubectl top pods -n edgedelta

3. Estimate Resources

If there are either restarts in the system with OOMKilled as the reason or resource pressure, you need to increase replica count and/or resource limits for the related component (ie. change specs only for impacted components, not for all). Before you can do this you need to estimate resource requirements. See Sizing and Capacity Estimations. If you prefer many small pods:

Processor Agent It is usually unnecessary to update replica counts for agents if they are running in DaemonSet mode. For other modes, you increase the replica count. Using higher resource limits should be done only when overall consumption is low but memory spikes happen occasionally (Causing the OOMKilled status). Increasing resource limits can be done by observing the max memory demanded in metric explorer, grouped by host.

Rollup Agent The rollup agent is capable of processing approximately 10GB/h metric volume with maximum 1000m CPU and 500MB memory consumption per replica. For example, if the Agent is sending 90GB/hour volume, the total resource requirement would be 9000m CPU and 4.5GB memory. With a 10% safety margin this volume could be processed with 5 replicas with 2000m CPU and 1GB memory. For high availability there should be at least 2 replicas. After rollup, the data volume per hour sent to the compactor agent should be decreased by 60x.

Compactor Agent The compactor agent is capable of processing approximately 1GB/hour volume of logs or metrics coming from the rollup agent per 300m CPU and 200MB memory. For example, if the rollup agent is sending 22GB/hour volume, the total resource requirement would be 6600m CPU and 4.4GB memory. With a 10% safety margin this volume could be processed with 3 replicas with 2500m CPU and 1.8GB memory. For high availability there should be at least 2 replicas.

4. Update Resource Limits

Performing these steps will help ensure that your Edge Delta agent pods have the necessary resources they need to function effectively and reduce the likelihood of evictions due to resource constraints. Assigning too many resources can be costly and inefficient, it’s important to find the right balance based on the actual usage metrics and performance of the pods.

For changing the rollup and compactor agent resource limits, use the compactorProps.resources and rollUpProps.resources fields. For increasing replica counts, use the compactorProps.replicas and rollUpProps.replicas fields.

Suppose you want to change the replica count for the rollup agent from 2 to 3 and you want to change CPU and memory resource limits:

--set rollUpProps.replicas=3 \
--set rollUpProps.resources.limits.cpu=2000m \
--set rollUpProps.resources.limits.memory=2Gi

Note: Individual parameters passed in with –set take precedence over passed in values files.

Monitor the Rollout: Observe the rollout of the updated configuration to ensure that pods are scheduled and running:

kubectl rollout status daemonset/edgedelta-agent -n edgedelta

Also replace daemonset/edgedelta-agent with the relevant workload type and name generated by your Helm release, for example:

kubectl rollout status deployment/edgedelta-rollup -n edgedelta

Verify No Evictions: After applying the changes, monitor the pods to ensure that they’re not being evicted by issuing:

kubectl get pods -n edgedelta

In the output, verify that the pods remain in a Running state and do not go into an Evicted state.

Double-check Resource Usage: To ensure that the new resource requests and limits are appropriate, check the resource usage of the pods:

kubectl top pods -n edgedelta

This will help confirm that the pods have sufficient resources and aren’t using more than anticipated.

Scaling Gateway Components

Gateway HPAs calculate utilization from resources.requests, not limits. Keep requests at 60–80 % of steady‑state usage to allow room for bursty traffic.

Gateway Component Reference Sizing

Component (Workload)	Typical Volume Example	Benchmark Usage*	Notes
Processor (Deployment)	10 GB /h total logs / traces / metrics	≈ 1500 m CPU, ≈ 1500 Mi memory per replica	Carries out parsing, enrichment, and fan‑out
Rollup (Deployment)	90 GB /h raw metrics	≈ 9000 m CPU, ≈ 4.5 Gi memory per replica	Reduces metric cardinality and frequency
Compactor (Deployment)	22 GB /h	≈ 6600 m CPU, ≈ 4.4 Gi memory per replica	Compresses and encodes data before egress

*Values include a modest safety margin but should be validated with kubectl top pods.

1. Enable HPA for Gateway Processor

Let Kubernetes scale the gateway automatically by attaching a Horizontal Pod Autoscaler (HPA).

  --set deployment.autoscaling.enabled=true \
  --set deployment.autoscaling.minReplicas=2 \
  --set deployment.autoscaling.maxReplicas=10 \
  --set deployment.autoscaling.targetForCPUUtilizationPercentage=80 \
  --set deployment.autoscaling.targetForMemoryUtilizationPercentage=80

Advanced Tuning (optional)

If you need finer control over how fast replicas are added or removed, override the HPA’s default behavior values:

--set-json 'deployment.autoscaling.behavior={
  "scaleUp": {
    "stabilizationWindowSeconds": 60,
    "policies":[{"type":"Percent","value":50,"periodSeconds":60}]
  },
  "scaleDown": {
    "stabilizationWindowSeconds":300
  }
}'

A scale‑up window of 60s keeps the fleet responsive, while a scale‑down window of 300s avoids thrashing during traffic dips.

2. Enable HPA for Rollup and Compactor

Now configure autoscaling for the two downstream gateway components: Rollup (handles metric reduction) and Compactor (handles compression and egress).
Each example shows the minimal Helm flags to turn on HPAs and set sensible replica limits:

For rollup agents:

  --set rollUpProps.autoscaling.enabled=true \
  --set rollUpProps.autoscaling.minReplicas=1 \
  --set rollUpProps.autoscaling.maxReplicas=8 \
  --set rollUpProps.autoscaling.targetForCPUUtilizationPercentage=75 \
  --set rollUpProps.autoscaling.targetForMemoryUtilizationPercentage=80

For compactor agents:

  --set compactorProps.autoscaling.enabled=true \
  --set compactorProps.autoscaling.minReplicas=1 \
  --set compactorProps.autoscaling.maxReplicas=6 \
  --set compactorProps.autoscaling.targetForCPUUtilizationPercentage=75 \
  --set compactorProps.autoscaling.targetForMemoryUtilizationPercentage=80 \

3. Validate Gateway Scaling

With the HPAs configured, use kubectl to make sure they exist, are reading metrics, and are steering replica counts the way you expect.

Check HPA objects This command lists every HPA in the edgedelta namespace and shows its current versus desired replica count.

kubectl get hpa -n edgedelta

Inspect utilization and replica histories Describe each HPA to see live CPU / memory percentages, scaling events, and any recent errors.

kubectl describe hpa edgedelta-gateway          -n edgedelta
kubectl describe hpa edgedelta-gateway-rollup   -n edgedelta
kubectl describe hpa edgedelta-gateway-compactor -n edgedelta

Verify pod-level metrics Finally, spot-check individual pods to ensure real CPU and memory usage line up with HPA targets.

kubectl top pods -n edgedelta

Gateway Scaling Best Practices

Before you switch the workload on, double check for these best practices:

Size HPAs from requests, and keep limits roughly 2–4 × higher for burst head-room.
Begin with 70–80 % CPU and about 80 % memory targets; fine-tune after a few days of real traffic.
Run at least two replicas of every Deployment so updates and node failures don’t cause downtime.
When back-pressure builds, bump Rollup replicas first; scale the Compactor afterward if needed.
Watch the fleet through one full peak/off-peak cycle before locking in tighter thresholds.

Troubleshooting Gateway HPAs

Symptom	Possible Cause	Fix
HPA shows `Unknown` metrics	Metrics‑server missing or RBAC denied	Deploy / repair metrics‑server; confirm ClusterRole rules
Pods stay pinned at minReplicas	Requests sized too high, so utilization < target	Lower requests or lower target percentages
Rapid oscillation between replica counts	Scale‑up and scale‑down windows too short	Widen `stabilizationWindowSeconds`, cap max scale‑up percent
Compactor CPU throttling	Rollup under‑provisioned, sending unrolled traffic	Scale Rollup first, then Compactor

Next Steps

Load Test → Tune Run a controlled load test after deployment to confirm that HPAs converge at expected replica counts.
Alerting Add alerts on kube_hpa_status_current_replicas and component‑specific CPU or memory saturation metrics.

Version pinning – When upgrading, always specify the exact chart you want with --version vX.Y.Z to keep every pod on the same build and avoid mixed agents.