Scale Edge Delta Deployments
8 minute read
Recap
A pipeline is a set of agents with a single configuration. There are different types of pipelines that you can deploy depending on your architecture.
- See Architecture for an overview of the pipelines.
- See Kubernetes Manifests for a description of each pipeline’s manifests.
- See Sizing and Capacity Estimations.
- See Deployment Examples for default installations of node, coordinator and gateway pipelines in clusters with different numbers of nodes.
- See Helm values that are made available.
Overview
In environments with a high volume of logs and metrics the default replica counts for node and gateway pipeline rollup and compactor agents might not be sufficient.
1. Gather Information.
Check the status of the pods in the environment. Identify and record the NAME
of any pods that are not in the Ready
STATUS
.
kubectl get pods -n edgedelta
View the pod description with the kubectl describe
command.
kubectl describe pods <pod name> -n edgedelta >> describe.txt
Replace <pod name>
with the pod’s name as per the NAME column from the previous step. Also change edgedelta
if the components were installed in a different namespace.
Then view the pod logs with the kubectl logs
command and the --previous
flag:
kubectl logs <pod name> -n edgedelta --previous >> pod.log
Replace <pod name>
with the pod’s name as per the NAME column from the previous step. Also change edgedelta
if the components were installed in a different namespace.
See Debug the Installation of Edge Delta Components for more information.
2. Interpret information.
Examine describe.txt. In the Containers
section, find Last State
and the Reason
field. If the Reason is OOMKilled
, this pod was restarted due to memory pressure. In that case either the replica count needs to be increased and/or resource limits should be increased. If CPU consumption is close to the resource limits more resources should be provisioned for agents and/or rollup/Compactor Agents.
There can also be resource pressure in the system even if there is no restart associated with OOMKilled reason. Examine the pod.log file and check for logs indicating resource limitations. Verify that the pod isn’t running out of assigned resources such as CPU and memory, as this could cause crashes and errors.
If you have the Kubernetes Metrics Server installed, you can also identify resource pressure using top pods
:
kubectl top pods -n edgedelta
3. Estimate Resources
If there are either restarts in the system with OOMKilled
as the reason or resource pressure, you need to increase replica count and/or resource limits for the related component (ie. change specs only for impacted components, not for all). Before you can do this you need to estimate resource requirements. See Sizing and Capacity Estimations. If you prefer many small pods:
Processor Agent It is usually unnecessary to update replica counts for agents if they are running in DaemonSet mode. For other modes, you increase the replica count. Using higher resource limits should be done only when overall consumption is low but memory spikes happen occasionally (Causing the OOMKilled status). Increasing resource limits can be done by observing the max memory demanded in metric explorer, grouped by host.
Rollup Agent The rollup agent is capable of processing approximately 10GB/h metric volume with maximum 1000m CPU and 500MB memory consumption per replica. For example, if the Agent is sending 90GB/hour volume, the total resource requirement would be 9000m CPU and 4.5GB memory. With a 10% safety margin this volume could be processed with 5 replicas with 2000m CPU and 1GB memory. For high availability there should be at least 2 replicas. After rollup, the data volume per hour sent to the compactor agent should be decreased by 60x.
Compactor Agent The compactor agent is capable of processing approximately 1GB/hour volume of logs or metrics coming from the rollup agent per 300m CPU and 200MB memory. For example, if the rollup agent is sending 22GB/hour volume, the total resource requirement would be 6600m CPU and 4.4GB memory. With a 10% safety margin this volume could be processed with 3 replicas with 2500m CPU and 1.8GB memory. For high availability there should be at least 2 replicas.
4. Update Resource Limits
Performing these steps will help ensure that your Edge Delta agent pods have the necessary resources they need to function effectively and reduce the likelihood of evictions due to resource constraints. Assigning too many resources can be costly and inefficient, it’s important to find the right balance based on the actual usage metrics and performance of the pods.
For changing the rollup and compactor agent resource limits, use the compactorProps.resources
and rollUpProps.resources
fields. For increasing replica counts, use the compactorProps.replicas
and rollUpProps.replicas
fields.
Suppose you want to change the replica count for the rollup agent from 2 to 3 and you want to change CPU and memory resource limits:
--set rollUpProps.replicas=3 \
--set rollUpProps.resources.limits.cpu=2000m \
--set rollUpProps.resources.limits.memory=2Gi
Note: Individual parameters passed in with –set take precedence over passed in values files.
- Monitor the Rollout: Observe the rollout of the updated configuration to ensure that pods are scheduled and running:
kubectl rollout status daemonset/edgedelta-agent -n edgedelta
Also replace daemonset/edgedelta-agent with the relevant workload type and name generated by your Helm release, for example:
kubectl rollout status deployment/edgedelta-rollup -n edgedelta
- Verify No Evictions: After applying the changes, monitor the pods to ensure that they’re not being evicted by issuing:
kubectl get pods -n edgedelta
In the output, verify that the pods remain in a Running state and do not go into an Evicted state.
- Double-check Resource Usage: To ensure that the new resource requests and limits are appropriate, check the resource usage of the pods:
kubectl top pods -n edgedelta
This will help confirm that the pods have sufficient resources and aren’t using more than anticipated.
Scaling Gateway Components
Gateway HPAs calculate utilization from resources.requests
, not limits. Keep requests at 60–80 % of steady‑state usage to allow room for bursty traffic.
Gateway Component Reference Sizing
Component (Workload) | Typical Volume Example | Benchmark Usage* | Notes |
---|---|---|---|
Processor (Deployment) | 10 GB /h total logs / traces / metrics | ≈ 1500 m CPU, ≈ 1500 Mi memory per replica | Carries out parsing, enrichment, and fan‑out |
Rollup (Deployment) | 90 GB /h raw metrics | ≈ 9000 m CPU, ≈ 4.5 Gi memory per replica | Reduces metric cardinality & frequency |
Compactor (Deployment) | 22 GB /h | ≈ 6600 m CPU, ≈ 4.4 Gi memory per replica | Compresses & encodes data before egress |
*Values include a modest safety margin but should be validated with kubectl top pods
.
5. Enable HPA for Gateway Processor
Let Kubernetes scale the gateway automatically by attaching a Horizontal Pod Autoscaler (HPA).
--set deployment.autoscaling.enabled=true \
--set deployment.autoscaling.minReplicas=2 \
--set deployment.autoscaling.maxReplicas=10 \
--set deployment.autoscaling.targetForCPUUtilizationPercentage=80 \
--set deployment.autoscaling.targetForMemoryUtilizationPercentage=80
Advanced Tuning (optional)
If you need finer control over how fast replicas are added or removed, override the HPA’s default behavior
values:
--set-json 'deployment.autoscaling.behavior={
"scaleUp": {
"stabilizationWindowSeconds": 60,
"policies":[{"type":"Percent","value":50,"periodSeconds":60}]
},
"scaleDown": {
"stabilizationWindowSeconds":300
}
}'
A scale‑up window of 60s keeps the fleet responsive, while a scale‑down window of 300s avoids thrashing during traffic dips.
6. Enable HPA for Rollup and Compactor
Now configure autoscaling for the two downstream gateway components—Rollup (handles metric reduction) and Compactor (handles compression and egress).
Each example shows the minimal Helm flags to turn on HPAs and set sensible replica limits:
For rollup agents:
--set rollUpProps.autoscaling.enabled=true \
--set rollUpProps.autoscaling.minReplicas=1 \
--set rollUpProps.autoscaling.maxReplicas=8 \
--set rollUpProps.autoscaling.targetForCPUUtilizationPercentage=75 \
--set rollUpProps.autoscaling.targetForMemoryUtilizationPercentage=80
For compactor agents:
--set compactorProps.autoscaling.enabled=true \
--set compactorProps.autoscaling.minReplicas=1 \
--set compactorProps.autoscaling.maxReplicas=6 \
--set compactorProps.autoscaling.targetForCPUUtilizationPercentage=75 \
--set compactorProps.autoscaling.targetForMemoryUtilizationPercentage=80 \
7. Validate Gateway Scaling
With the HPAs configured, use kubectl
to make sure they exist, are reading metrics, and are steering replica counts the way you expect.
- Check HPA objects
This command lists every HPA in the
edgedelta
namespace and shows its current versus desired replica count.
kubectl get hpa -n edgedelta
- Inspect utilization and replica histories Describe each HPA to see live CPU / memory percentages, scaling events, and any recent errors.
kubectl describe hpa edgedelta-gateway -n edgedelta
kubectl describe hpa edgedelta-gateway-rollup -n edgedelta
kubectl describe hpa edgedelta-gateway-compactor -n edgedelta
- Verify pod-level metrics Finally, spot-check individual pods to ensure real CPU and memory usage line up with HPA targets.
kubectl top pods -n edgedelta
8. Gateway Scaling Best Practices
Before you switch the workload on, double check for these best practices:
- Size HPAs from requests, and keep limits roughly 2–4 × higher for burst head-room.
- Begin with 70–80 % CPU and about 80 % memory targets; fine-tune after a few days of real traffic.
- Run at least two replicas of every Deployment so updates and node failures don’t cause downtime.
- When back-pressure builds, bump Rollup replicas first; scale the Compactor afterward if needed.
- Watch the fleet through one full peak/off-peak cycle before locking in tighter thresholds.
9. Troubleshooting Gateway HPAs
Symptom | Possible Cause | Fix |
---|---|---|
HPA shows Unknown metrics |
Metrics‑server missing or RBAC denied | Deploy / repair metrics‑server; confirm ClusterRole rules |
Pods stay pinned at minReplicas | Requests sized too high, so utilization < target | Lower requests or lower target percentages |
Rapid oscillation between replica counts | Scale‑up and scale‑down windows too short | Widen stabilizationWindowSeconds , cap max scale‑up percent |
Compactor CPU throttling | Rollup under‑provisioned, sending unrolled traffic | Scale Rollup first, then Compactor |
Next Steps
- Load Test → Tune Run a controlled load test after deployment to confirm that HPAs converge at expected replica counts.
- Alerting Add alerts on
kube_hpa_status_current_replicas
and component‑specific CPU or memory saturation metrics.
- Version pinning – When upgrading, always specify the exact chart you want with
--version vX.Y.Z
to keep every pod on the same build and avoid mixed agents.