Troubleshoot the Edge Delta Agent with Helm

Troubleshooting the Edge Delta Fleet using Helm.

Overview

You can diagnose and resolve common issues with Edge Delta Fleets deployed using Helm.

Note: Ensure that your helm and kubectl contexts are set to the correct cluster, namespace, and release names. Verify context with helm list and kubectl config current-context. Adjust the commands to suit your environment’s specific context and configuration. Helm release names, chart values, and namespaces must be appropriately used.

1. Run the agent troubleshooter.

Run the agent troubleshooter replacing 123456789 with your Pipeline ID.

kubectl run agent-troubleshooter -i --rm --image=gcr.io/edgedelta/agent-troubleshooter:latest  -- /edgedelta/agent_troubleshooter --mode=post-install-checks --api_key=123456789

2. Check the Release Status

Determine if the Edge Delta Helm release is deployed correctly.

helm status edgedelta -n edgedelta

Check that the components are ready and verify the release’s revision number, update history, and deployed resources.

  • If the Status is “deployed,” the release has been deployed successfully.
  • The Revision can indicate if the version you’re inspecting is the initial installation or an upgrade.
  • The Updated timestamp helps you determine when the last change took place.
  • Notes frequently contain commands for further interaction with your release, such as how to access a deployed web application or tips for troubleshooting.

3. View the Helm Values

Inspect the configuration of the Edge Delta release.

helm get values edgedelta -n edgedelta -o yaml

Verify that the values returned match the intended configuration, focusing on resource limits, environment variables, and any custom settings.

4. Inspect the Pod’s Details via Helm

Review detailed information about agent pods deployed by Helm.

kubectl get pods -n edgedelta  

Look for pods that don’t have a status of Running or have a high number of restarts. Pods in other states, such as Pending, CrashLoopBackOff, or Error, can indicate configuration errors, scheduling issues, or runtime problems. Once you have identified a pod that seems problematic, get more detailed information about that specific pod:

kubectl describe pod <pod-name> -n edgedelta

Replace with the name of the pod you need to investigate. The output of this command provides a plethora of information:

  • Metadata: Includes labels, annotations, and other identifiers.
  • Spec: The desired state as defined in the pod’s manifest.
  • Status: Current status of the pod, including phase, conditions, and events which may point to issues during scheduling or running the pod.
  • Conditions: These include PodScheduled, ContainersReady, Initialized, and Ready. They provide insight into the lifecycle state of the pod.
  • Events: This is usually the most useful in diagnosing problems. Events describe actions taken by the Kubernetes system (like attempting to start a pod), as well as any issues it’s encountered. Error messages here can point you toward problems with volume mounts, image pull issues, resource shortages, and health checks failures.
  • Resource requests and limits to spot CPU and memory constraints.
  • Volumes and mount points for ensuring data persistence and access to required files.
  • Environment variables configured for the Edge Delta agent.

See Debug the Installation of Edge Delta Components for more detailed steps.

5. Restart the Fleet using Helm

Restart the Edge Delta pods managed by the Helm release.

helm upgrade --reuse-values edgedelta edgedelta/edgedelta -n edgedelta

The --reuse-values flag tells Helm to reuse the values from the last release, ensuring that no configuration changes are made other than what is necessary to restart the pods. Optionally add --set flags to modify specific values if needed.

Investigate events related to the Helm release for additional context.

kubectl get events -n edgedelta --field-selector involvedObject.kind=DaemonSet
  • Pod Scheduling Issues: Events that include FailedScheduling indicating that a pod cannot be scheduled, often due to insufficient resources or configured affinity/anti-affinity rules.
  • Pod Lifecycle Events: These events document the creation, starting, killing, and scheduling of pods, which may reveal transient or recurring problems.
  • Image Pull Problems: ErrImagePull or ImagePullBackOff events suggest that there is a problem with pulling the container image from the registry.
  • Resource Evictions: Evicted events signal that a pod has been terminated due to resource scarcity, often tied to the node’s available CPU or memory.
  • Node Issues: Events indicating that a node is in a NotReady state or other problems affecting node health that could impact pod scheduling and operation.
  • Volume Attachment Issues: FailedMount or FailedAttachVolume events are critical if the pod relies on persistent volumes, indicating issues with volume binding or access rights.
  • Probe Failures: Unhealthy events related to liveness and readiness probes reveal that a container may not be running as expected.
  • Excessive Restarts or Failures: Frequent Restarting events or BackOff statuses point to issues with pod stability.
  • ConfigMap or Secret Issues: Problems with accessing ConfigMaps or Secrets required by the Edge Delta agent may cause the pod to fail to start or operate correctly.
  • Warnings and Errors: Any event classified as a Warning or Error should be promptly investigated, as these typically denote significant issues that may impact service operations.

7. Update Agent Image

Roll out an updated version for the Edge Delta agent if needed. Follow the upgrade/deploy steps in the UI:

To view the deployment commands for an existing v3 pipeline configuration:

  1. Click Pipelines and select the Fleet.
  2. Click Add Agents.
  3. Select the deployment type.

8. Utilize Helm Rollbacks

If an upgrade introduces an issue, use Helm to rollback to a previous working version.

helm rollback edgedelta [REVISION] -n edgedelta

Replace [REVISION] with the desired chart revision number.

9. Troubleshoot Helm Release Resource Limits

See Scale Edge Delta Deployments.