Scale Edge Delta Deployments
7 minute read
Recap
The Edge Delta fleet includes the Processor agent, a Compactor Agent, and a Rollup Agent.
- The Processor agent pre-processes data: extracting insights, generating alerts, creating summarized datasets, and more. It delivers logs to the Rollup Agent or directly to the Compactor Agent (skipping Rollup) depending on the configuration.
- The Rollup Agent reduces metric data volume by optimizing data point frequency and cardinality, which significantly minimizes storage requirements and can expedite data queries. When the Rollup Agent is not installed logs can be sent directly from the Agent to the Compactor. It delivers logs to the Compactor Agent.
- The Compactor Agent compresses and encodes data before they are sent to the Edge Delta backend, ensuring effective bandwidth use and data processing.
Overview
In environments with a high volume of logs and metrics the default replica counts for rollup and Compactor Agents might not be sufficient.
1. Gather Information.
Check the status of the pods in the environment. Identify and record the NAME
of any pods that are not in the Ready
STATUS
.
kubectl get pods -n edgedelta
View the pod description with the kubectl describe
command.
kubectl describe pods <pod name> -n edgedelta >> describe.txt
Replace <pod name>
with the pod’s name as per the NAME column from the previous step. Also change edgedelta
if the components were installed in a different namespace.
Then view the pod logs with the kubectl logs
command and the --previous
flag:
kubectl logs <pod name> -n edgedelta --previous >> pod.log
Replace <pod name>
with the pod’s name as per the NAME column from the previous step. Also change edgedelta
if the components were installed in a different namespace.
See Debug the Installation of Edge Delta Components for more information.
In addition, there is an Edge Delta component called the Symptom Collector that enables Edge Delta support to help you diagnose issues. It collects information from the cluster and sends it to the Edge Delta S3 bucket. To enable it, include this flag in the Helm upgrade command:
--agentTroubleshooter.symptomCollector.enabled=true
2. Interpret information.
Examine describe.txt. In the Containers
section, find Last State
and the Reason
field. If the Reason is OOMKilled
, this pod was restarted due to memory pressure. In that case either the replica count needs to be increased and/or resource limits should be increased. If CPU consumption is close to the resource limits more resources should be provisioned for agents and/or rollup/Compactor Agents.
There can also be resource pressure in the system even if there is no restart associated with OOMKilled reason. Examine the pod.log file and check for logs indicating resource limitations. Verify that the pod isn’t running out of assigned resources such as CPU and memory, as this could cause crashes and errors.
If you have the Kubernetes Metrics Server installed, you can also identify resource pressure using top pods
:
kubectl top pods -n edgedelta
3. Estimate Resources
If there are either restarts in the system with OOMKilled
as the reason or resource pressure, you need to increase replica count and/or resource limits for the related component (ie. change specs only for impacted components, not for all). Before you can do this you need to estimate resource requirements:
Processor Agent
It is not possible to update replica counts for agents if they are running in DaemonSet mode. For other modes, you increase the replica count. Using higher resource limits should be done only when overall consumption is low but memory spikes happen occasionally (Causing the OOMKilled status). Increasing resource limits can be done by observing the max memory demanded in metric explorer using agent_mem_virtual
grouped by host).
Rollup Agent The Rollup Agent is capable of processing approximately 10GB/h metric volume with maximum 1000m CPU and 500MB memory consumption per replica. For example, if the Agent is sending 90GB/hour volume, the total resource requirement would be 9000m CPU and 4.5GB memory. With a 10% safety margin this volume could be processed with 5 replicas with 2000m CPU and 1GB memory. For high availability there should be at least 2 replicas. After rollup, the data volume per hour sent to the Compactor Agent should be decreased by 60x.
Compactor Agent The Compactor Agent is capable of processing approximately 1GB/hour volume of logs or metrics coming from the Rollup Agent per 300m CPU and 200MB memory. For example, if the Rollup Agent is sending 22GB/hour volume, the total resource requirement would be 6600m CPU and 4.4GB memory. With a 10% safety margin this volume could be processed with 3 replicas with 2500m CPU and 1.8GB memory. For high availability there should be at least 2 replicas.
4. Update Resource Limits
Performing these steps will help ensure that your Edge Delta agent pods have the necessary resources they need to function effectively and reduce the likelihood of evictions due to resource constraints. Assigning too many resources can be costly and inefficient, it’s important to find the right balance based on the actual usage metrics and performance of the pods.
Follow either the Helm or Kubernetes steps:
Helm
The Helm chart includes a source values.yaml file for configuring resource limits for respective components. The current version can be seen here. For changing the rollup and Compactor Agent resource limits, use the compactorProps.resources
and rollUpProps.resources
fields. For increasing replica counts, use the compactorProps.replicas
and rollUpProps.replicas
fields.
compactorProps:
enabled: true
port: 9199
usePVC: true
storageClass: ""
diskSize: 30Gi
replicas: 1
serviceDNSSuffix: svc.cluster.local
traceFiles: ""
updateStrategy:
type: RollingUpdate
goMemLimit: ""
resources:
limits:
cpu: 2000m
memory: 2000Mi
requests:
cpu: 200m
memory: 300Mi
rollUpProps:
enabled: true
port: 9200
replicas: 2
serviceDNSSuffix: svc.cluster.local
updateStrategy:
type: RollingUpdate
goMemLimit: "900MiB"
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 200m
memory: 256Mi
Agent resource limits are in the resources field.
# Resource constraints
annotations: {}
resources:
limits:
memory: 2048Mi
requests:
cpu: 200m
memory: 256Mi
- Apply the changes:
In the current directory, create a value.yaml file containing the updated resource limits and pass it in with -f values.yaml
during the Helm upgrade:
helm upgrade edgedelta edgedelta/edgedelta -i --version v0.1.96 --set secretApiKey.value=123456789 -n edgedelta --create-namespace -f values.yaml
OR
Pass in individual parameters with --set
. Suppose you want to change the replica count for the Rollup Agent from 2 to 3 and you want to change CPU and memory resource limits:
helm upgrade edgedelta edgedelta/edgedelta -i --version v0.1.96 --set secretApiKey.value=123456789 -n edgedelta --create-namespace --set rollUpProps.replicas=3 --set rollUpProps.resources.limits.cpu=2000m --set rollUpProps.resources.limits.memory=2Gi
Note: Individual parameters passed in with –set take precedence over passed in values files.
- Monitor the Rollout: Observe the rollout of the updated configuration to ensure that pods are scheduled and running:
kubectl rollout status daemonset/edgedelta-agent -n edgedelta
Also replace daemonset/edgedelta-agent with the relevant workload type and name generated by your Helm release, for example:
kubectl rollout status deployment/edgedelta-rollup -n edgedelta
- Verify No Evictions: After applying the changes, monitor the pods to ensure that they’re not being evicted by issuing:
kubectl get pods -n edgedelta
In the output, verify that the pods remain in a Running state and do not go into an Evicted state.
- Double-check Resource Usage: To ensure that the new resource requests and limits are appropriate, check the resource usage of the pods:
kubectl top pods -n edgedelta
This will help confirm that the pods have sufficient resources and aren’t using more than anticipated.
Kubernetes
If you are not using Helm, you need to download the deployment spec file and edit it manually:
Compactor Agent
Search for name: edgedelta-compactor
under metadata
for the StatefulSet
. Change spec.replicas
for replica count and spec.template.spec.containers[0].resources
(the first container) for resource limits.
# Source: edgedelta/templates/compactor.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: edgedelta-compactor
...
spec:
replicas: 1
...
template:
...
spec:
...
containers:
...
resources:
limits:
cpu: 2000m
memory: 2000Mi
requests:
cpu: 200m
memory: 300Mi
Rollup Agent
Search for name: edgedelta-rollup
under metadata
for the StatefulSet
. Change spec.replicas
for replica count and spec.template.spec.containers[0].resources
(the first container) for resource limits.
# Source: edgedelta/templates/rollup.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: edgedelta-rollup
...
spec:
replicas: 2
...
template:
...
spec:
...
containers:
...
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 200m
memory: 256Mi
Edge Delta Agent
Search for name: edgedelta
under metadata
for the DaemonSet
. Change spec.template.spec.containers[0].resources
(the first container) for resource limits. Replica count cannot be changed.
# Source: edgedelta/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: edgedelta
...
spec:
...
template:
...
spec:
...
containers:
# agent container
...
resources:
limits:
memory: 2048Mi
requests:
cpu: 200m
memory: 256Mi
In the current directory, create an edgedelta.yaml file containing the updated deployment spec and pass it in with -f edgedelta.yaml
during the kubectl apply command:
kubectl apply -f edgedelta.yaml