Scale Edge Delta Deployments

Debugging the Edge Delta Installed Components in Kubernetes.

7 minute read

Recap

The Edge Delta Fleet includes the Processor agent, a Compactor Agent, and a Rollup Agent.

The Processor agent pre-processes data: extracting insights, generating alerts, creating summarized datasets, and more. It delivers logs to the Rollup Agent or directly to the Compactor (skipping Rollup) depending on the configuration.
The Rollup Agent reduces metric data volume by optimizing data point frequency and cardinality, which significantly minimizes storage requirements and can expedite data queries. When the Rollup Agent is not installed logs can be sent directly from the Agent to the Compactor. It delivers logs to the Compactor Agent.
The Compactor Agent compresses and encodes data before they are sent to the Edge Delta backend, ensuring effective bandwidth use and data processing.

Overview

In environments with a high volume of logs and metrics the default replica counts for Rollup and Compactor Agents might not be sufficient.

1. Gather Information.

Check the status of the pods in the environment. Identify and record the NAME of any pods that are not in the Ready STATUS.

kubectl get pods -n edgedelta

View the pod description with the kubectl describe command.

kubectl describe pods <pod name> -n edgedelta   >> describe.txt

Replace <pod name> with the pod’s name as per the NAME column from the previous step. Also change edgedelta if the components were installed in a different namespace.

Then view the pod logs with the kubectl logs command and the --previous flag:

kubectl logs <pod name> -n edgedelta --previous  >> pod.log

Replace <pod name> with the pod’s name as per the NAME column from the previous step. Also change edgedelta if the components were installed in a different namespace.

See Debug the Installation of Edge Delta Components for more information.

In addition, there is an Edge Delta component called the Symptom Collector that enables Edge Delta support to help you diagnose issues. It collects information from the cluster and sends it to the Edge Delta S3 bucket. To enable it, include this flag in the Helm upgrade command:

--agentTroubleshooter.symptomCollector.enabled=true

2. Interpret information.

Examine describe.txt. In the Containers section, find Last State and the Reason field. If the Reason is OOMKilled, this pod was restarted due to memory pressure. In that case either the replica count needs to be increased and/or resource limits should be increased. If CPU consumptionm is close to the resource limits more resources should be provisioned for agents and/or rollup/Compactor Agents.

There can also be resource pressure in the system even if there is no restart associated with OOMKilled reason. Examine the pod.log file and check for logs indicating resource limitations. Verify that the pod isn’t running out of assigned resources such as CPU and memory, as this could cause crashes and errors.

If you have the Kubernetes Metrics Server installed, you can also identify resource pressure using top pods:

kubectl top pods -n edgedelta

3. Estimate Resources

If there are either restarts in the system with OOMKilled as the reason or resource pressure, you need to increase replica count and/or resource limits for the related component (ie. change specs only for impacted components, not for all). Before you can do this you need to estimate resource requirements:

Processor Agent It is not possible to update replica counts for agents if they are running in DaemonSet mode. For other modes, you increase the replica count. Using higher resource limits should be done only when overall consumption is low but memory spikes happen occasionally (Causing the OOMKilled status). Increasing resource limits can be done by observing the max memory demanded in metric explorer using agent_mem_virtual grouped by host).

Rollup Agent The Rollup Agent is capable of processing approximately 10GB/h metric volume with maximum 1000m CPU and 500MB memory consumption per replica. For example, if the Agent is sending 90GB/hour volume, the total resource requirement would be 9000m CPU and 4.5GB memory. With a 10% safety margin this volume could be processed with 5 replicas with 2000m CPU and 1GB memory. For high availability there should be at least 2 replicas. After rollup, the data volume per hour sent to the Compactor should be decreased by 60x.

Compactor Agent The Compactor Agent is capable of processing approximately 1GB/hour volume of logs or metrics coming from the Rollup Agent per 300m CPU and 200MB memory. For example, if the Rollup Agent is sending 22GB/hour volume, the total resource requirement would be 6600m CPU and 4.4GB memory. With a 10% safety margin this volume could be processed with 3 replicas with 2500m CPU and 1.8GB memory. For high availability there should be at least 2 replicas.

4. Update Resource Limits

Performing these steps will help ensure that your Edge Delta agent pods have the necessary resources they need to function effectively and reduce the likelihood of evictions due to resource constraints. Assigning too many resources can be costly and inefficient, it’s important to find the right balance based on the actual usage metrics and performance of the pods.

Follow either the Helm or Kubernetes steps:

Helm

The Helm chart includes a source values.yaml file for configuring resource limits for respective components. The current version can be seen here. For changing the rollup and compactor resource limits, use the compactorProps.resources and rollUpProps.resources fields. For increasing replica counts, use the compactorProps.replicas and rollUpProps.replicas fields.

compactorProps:
  enabled: true
  port: 9199
  usePVC: true
  storageClass: ""
  diskSize: 30Gi
  replicas: 1
  serviceDNSSuffix: svc.cluster.local
  traceFiles: ""
  updateStrategy:
    type: RollingUpdate
  goMemLimit: ""
  resources:
    limits:
      cpu: 2000m
      memory: 2000Mi
    requests:
      cpu: 200m
      memory: 300Mi

rollUpProps:
  enabled: true
  port: 9200
  replicas: 2
  serviceDNSSuffix: svc.cluster.local
  updateStrategy:
    type: RollingUpdate
  goMemLimit: "900MiB"
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 200m
      memory: 256Mi

Agent resource limits are in the resources field.

# Resource constraints

annotations: {}
resources:
  limits:
    memory: 2048Mi
  requests:
    cpu: 200m
    memory: 256Mi

Apply the changes:

In the current directory, create a value.yaml file containing the updated resource limits and pass it in with -f values.yaml during the Helm upgrade:

helm upgrade edgedelta edgedelta/edgedelta -i --version v0.1.96 --set secretApiKey.value=123456789 -n edgedelta --create-namespace -f values.yaml

Pass in individual parameters with --set. Suppose you want to change the replica count for the Rollup Agent from 2 to 3 you want to change CPU and memory resource limits:

helm upgrade edgedelta edgedelta/edgedelta -i --version v0.1.96 --set secretApiKey.value=123456789 -n edgedelta --create-namespace --set rollUpProps.replicas=3 --set rollUpProps.resources.limits.cpu=2000m --set rollUpProps.resources.limits.memory=2Gi

Note: Individual parameters passed in with –set take precedence over passed in values files.

Monitor the Rollout: Observe the rollout of the updated configuration to ensure that pods are scheduled and running:

kubectl rollout status daemonset/edgedelta-agent -n edgedelta

Replace daemonset/edgedelta-agent with the relevant workload type and name generated by your Helm release.

Verify No Evictions: After applying the changes, monitor the pods to ensure that they’re not being evicted by issuing:

kubectl get pods -n edgedelta

In the output, verify that the pods remain in a Running state and do not go into an Evicted state.

Double-check Resource Usage: To ensure that the new resource requests and limits are appropriate, check the resource usage of the pods:

kubectl top pods -n edgedelta

This will help confirm that the pods have sufficient resources and aren’t using more than anticipated.

Kubernetes

If you are not using Helm, you need to download the deployment spec file and edit it manually:

Compactor Agent

Search for name: edgedelta-compactor under metadata for the StatefulSet. Change spec.replicas for replica count and spec.template.spec.containers[0].resources (the first container) for resource limits.

# Source: edgedelta/templates/compactor.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: edgedelta-compactor
...
spec:
  replicas: 1
...
  template:
...
    spec:
...
      containers:
...
        resources:
            limits:
              cpu: 2000m
              memory: 2000Mi
            requests:
              cpu: 200m
              memory: 300Mi

Rollup Agent

Search for name: edgedelta-rollup under metadata for the StatefulSet. Change spec.replicas for replica count and spec.template.spec.containers[0].resources (the first container) for resource limits.

# Source: edgedelta/templates/rollup.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: edgedelta-rollup
...
spec:
  replicas: 2
...
  template:
...
    spec:
...
      containers:
...
        resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 200m
              memory: 256Mi

Edge Delta Agent

Search for name: edgedelta under metadata for the DaemonSet. Change spec.template.spec.containers[0].resources (the first container) for resource limits. Replica count cannot be changed.

# Source: edgedelta/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: edgedelta
...
spec:
...
  template:
...
    spec:
...
      containers:
        # agent container
...
          resources:
            limits:
              memory: 2048Mi
            requests:
              cpu: 200m
              memory: 256Mi

In the current directory, create an edgedelta.yaml file containing the updated deployment spec and pass it in with -f edgedelta.yaml during the kubectl apply command:

kubectl apply -f edgedelta.yaml