Level 1 Metrics Maturity with Edge Delta

Basic Kubernetes Object State Monitoring (KSM)

Overview

Check what the Kubernetes API says about the desired and reported state of workload, and detect misalignments between the spec and the actual runtime states as seen by Kubernetes itself. At this level, KSM metrics are used. They are enabled by default in the Metrics Source node configuration.

Check if a Pod is stuck or unhealthy at the phase level

What is the pod’s lifecycle phase?

The k8s.ksm.pod.status_phase.value metric is used to detect pods stuck in pending, pods that failed, or pods that completed normally. If a pod is stuck in Pending, it may be unscheduled (lack of node resources) or pulling images.

Verify if Containers inside a Pod are running

Are containers in the pod running?

The k8s.ksm.pod.container_status_running.value metric is used to confirm if containers inside a pod are running. If the value is 1, the container is running; if 0, the container is not running and could be waiting, failed, or initializing. This helps detect containers that haven’t started or are stuck due to issues like image pull errors or resource constraints.

Identify Containers that have Terminated

Have any containers terminated unexpectedly?

The k8s.ksm.pod.container_status_terminated.value metric is used to detect containers that have exited, either normally or due to errors. A value of 1 indicates the container has terminated, which can help identify crashes or containers that have completed their tasks.

Detect Containers stuck in Waiting state

Are any containers stuck waiting instead of running?

The k8s.ksm.pod.container_status_waiting.value metric is used to identify containers that are still in the waiting state. A value of 1 means the container is waiting, which may indicate issues such as image pull errors or startup delays.

Monitor Container Restart Frequency

Are containers restarting frequently (potential crash loops)?

The k8s.ksm.pod.container_status_restarts.value metric is used to monitor how often containers restart. High restart counts over time may indicate instability or CrashLoopBackOff behavior, and can help you identify containers that are repeatedly crashing.

Validate Resource Configuration of Containers

Are CPU and Memory requests/limits properly configured?

The k8s.ksm.pod.container_resource_limits_cpu.value, k8s.ksm.pod.container_resource_limits_memory.value, k8s.ksm.pod.container_resource_requests_cpu.value, and k8s.ksm.pod.container_resource_requests_memory.value metrics are used to check if resource limits and requests are set for each container. Monitoring these metrics helps detect missing configurations, which can lead to OOMKills, CPU throttling, or resource contention, ensuring pods have appropriate resource guarantees.

Verify Deployment Health and Availability

Are Deployments maintaining the expected number of available pods?

The k8s.ksm.deployment_metadata_generation.value and k8s.ksm.deployment.status_replicas_available.value metrics are used to compare the desired and available replicas for a deployment. These metrics help verify that deployments have the intended number of available pods and detect if rollouts are stuck, delayed, or degraded.

Detect Updates to DaemonSets and StatefulSets

Have DaemonSets or StatefulSets been updated recently?

The k8s.ksm.daemonset_metadata_generation.value and k8s.ksm.statefulset_metadata_generation.value metrics are used to track updates to DaemonSets and StatefulSets. Increments in generation numbers indicate configuration changes, helping you detect if updates have been made that still require pods to reconcile.

Monitor the Count and Status of Cluster Resources

Are all expected cluster resources (Jobs, CronJobs, Namespaces, Nodes) present and active?

The k8s.ksm.job_info.value, k8s.ksm.cronjob_info.value, k8s.ksm.namespace.status_phase.value, and k8s.ksm.node.info.value metrics are used to track the count and status of key Kubernetes resources. Monitoring these metrics helps detect missing or unexpectedly deleted resources, abnormal growth or shrinkage, and issues with namespaces or nodes.