Level 2 Metrics Maturity with Edge Delta

Node and Resource Usage Monitoring.

Overview

At Level 2 maturity, you extend your Kubernetes observability practices beyond basic workload health to include comprehensive monitoring of node and resource utilization. This step is essential for detecting early signs of instability, resource exhaustion, or underlying infrastructure issues that could impact application reliability. By systematically tracking CPU, memory, and storage metrics at both the container and node levels, you gain deeper visibility into how resources are consumed across the cluster, allowing for more accurate capacity planning and quicker response to resource-related incidents.

This maturity level leverages metrics to monitor resource consumption and identify bottlenecks before they affect application performance. Additionally, node-level metrics help ensure that the underlying infrastructure is healthy and performant.

At this level, Kubelet, cAdvisor, and Node Exporter metrics are required. Ensure these are enabled in the Metrics Source node configuration.

Track Resource Utilization of Pods and Nodes

Are pods and nodes exceeding recommended thresholds for CPU and Memory usage?

To track CPU usage at the container level, the k8s.container.cpu.usage_seconds.value and k8s.container.cpu.usage_seconds.rate metrics can be used. These metrics monitor the total CPU time consumed by containers and the rate at which CPU is being used. High or rapidly increasing values can signal excessive CPU consumption by workloads, which may lead to node instability or trigger CPU throttling.

For memory usage, the k8s.container.memory.usage_bytes.value, k8s.container.memory.max_usage_bytes.value, and k8s.container.memory.working_set_bytes.value metrics allow you to track the current memory usage, maximum memory usage, and working set memory for containers. Monitoring these metrics helps identify when containers are approaching or exceeding their memory limits, potentially causing OOMKills or degraded performance.

At the node level, the k8s.node.cpu.seconds.value and k8s.node.cpu.seconds.rate metrics track the total CPU consumption across the node. Additionally, the k8s.node.memory.available_bytes.value, k8s.node.memory.total_bytes.value, and k8s.node.memory.free_bytes.value metrics provide insight into the memory resources available on each node. Monitoring these metrics helps ensure nodes have sufficient resources for scheduling and running workloads effectively.

By continuously observing CPU and memory metrics at both the container and node levels, you can proactively detect and address resource bottlenecks before they impact application performance or node stability.

Monitor Node Availability and Readiness

Are all nodes in the cluster ready and available for scheduling?

The k8s.node.info.value metric is used to track the presence and registration of nodes in the cluster. Monitoring the occurrence and total count of this metric across your cluster helps detect when nodes become unavailable, are removed, or are newly added. An unexpected drop in the number of nodes or if a specific node is missing from this metric indicates that the node may no longer be registered or reachable, which could impact workload scheduling and the overall capacity of the cluster.

By routinely checking this metric, you can ensure that all nodes expected to participate in the cluster remain available for pod scheduling, and quickly detect any issues related to node registration or connectivity.

Detect Node-Level Storage Issues Impacting Persistent Volume Reliability

Are node storage resources healthy, and could they be affecting volume provisioning or performance?

To monitor storage health at the node level, metrics such as k8s.node.filesystem.avail_bytes.value, k8s.node.filesystem.free_bytes.value, and k8s.node.filesystem.size_bytes.value provide insights into available disk space. Low available space on nodes can prevent the provisioning of new volumes or cause failures in workloads that rely on existing volumes. These metrics help identify nodes that are running low on disk space, enabling you to take action before storage exhaustion disrupts your workloads.

In addition to disk space, tracking disk performance with the k8s.node.disk.reads_completed.value, k8s.node.disk.writes_completed.value, k8s.node.disk.read_bytes.value, and k8s.node.disk.written_bytes.value metrics is essential for understanding the volume of read and write operations and the amount of data being processed by the disks. High or abnormal values may indicate I/O bottlenecks that can degrade the performance of applications relying on persistent storage.

Finally, the k8s.node.load.15min.value and k8s.node.memory.available_bytes.value metrics provide insights into system load and memory pressure, which can indirectly affect disk responsiveness and overall node reliability. By monitoring these metrics, you can proactively detect and resolve issues that may impact persistent volume operations or node performance.

Observe Deployment and Rollout Progression

Is the rollout of new Deployments, StatefulSets, or DaemonSets progressing as expected without stuck or failed updates?

To monitor the state of Kubernetes Deployments, metrics such as k8s.deployment.replicas.available.value and k8s.deployment.replicas.updated.value are useful. By comparing the available replicas with the updated replicas, you can detect if a rollout is stuck, delayed, or if the deployment has failed to reach the desired state. A difference between “desired” and “available” replicas often indicates an issue with pod scheduling or readiness, which may be blocking the update.

For StatefulSets and DaemonSets, the k8s.statefulset.replicas.available.value and k8s.daemonset.replicas.available.value metrics track updates to the configuration. Monitoring these values, along with the corresponding pod counts, can help determine whether updates are being applied successfully or if pods are failing to reconcile with the new configuration.

By regularly monitoring these metrics, you can identify rollout progression issues early and respond proactively to minimize service disruptions or incomplete updates.

Correlate Application Performance Metrics with Infrastructure State

Are changes in pod or node states affecting application response times, error rates, or user experience?

Correlating application performance metrics with infrastructure state can help identify the root causes of performance issues. For containers and pods, metrics like k8s.container.cpu.usage_seconds.value, k8s.container.memory.usage_bytes.value, and k8s.container.memory.working_set_bytes.value are essential for monitoring resource consumption and stability. Sudden spikes or sustained high usage in these metrics may coincide with a drop in application performance or user experience.

At the node level, metrics such as k8s.node.cpu.seconds.value, k8s.node.cpu.seconds.rate, k8s.node.memory.available_bytes.value, and k8s.node.load.15min.value provide insight into node resource contention and systemic pressures. These can indirectly affect workloads and end-user experience, making it essential to monitor both infrastructure and application performance metrics to diagnose potential issues quickly.

By analyzing these metrics in conjunction with application logs or telemetry, you can identify whether performance degradation is caused by container, pod, or node-level events and address the root cause efficiently.

Validate Cluster Autoscaler Behavior

Are cluster autoscaling events triggering when needed, and are additional nodes joining the cluster as expected?

The k8s.node.info.value metric is used to monitor the total number of nodes in the cluster. An increase in the count following an autoscaling event indicates that new nodes have successfully joined the cluster, while a decrease may signal scale-down activity or unexpected node loss. Correlating this metric with resource usage metrics such as k8s.node.cpu.seconds.value, k8s.node.cpu.seconds.rate, k8s.node.memory.available_bytes.value, and k8s.node.memory.total_bytes.value can help you assess whether scaling events are triggered by actual resource pressure, such as high CPU or low available memory.

By observing node count changes alongside resource utilization trends, you can ensure the Cluster Autoscaler is responding appropriately to workload demands, with newly added nodes becoming available for pod scheduling as expected.

Assess Workload Distribution and Balance

Are workloads evenly distributed across nodes, or are some nodes overloaded or underutilized?

The k8s.node.info.value metric can be used to identify all active nodes in the cluster. By comparing the CPU and memory usage across nodes, using metrics like k8s.node.cpu.seconds.value, k8s.node.cpu.seconds.rate, k8s.node.memory.available_bytes.value, and k8s.node.memory.total_bytes.value, you can detect imbalances in workload distribution. Nodes with consistently high resource consumption are considered hotspots, while nodes with low usage may be underutilized.

Monitoring these metrics regularly helps ensure that workloads are distributed efficiently, supporting better resource utilization and improved reliability. It also aids in identifying opportunities for optimizing scheduling policies or adjusting autoscaling configurations to achieve a more balanced distribution of workloads across the cluster.

Monitor Job and CronJob Execution Outcomes

Are jobs and cronjobs completing successfully and on schedule, with abnormal failures immediately detected?

The k8s.job.replicas.value and k8s.cronjob.replicas.value metrics track the presence and count of jobs and cronjobs within the cluster. Monitoring these metrics helps identify when new jobs or cronjobs are created or removed, but they do not directly provide success or failure counts. However, an unexpected drop in job counts or an increase in job resources may signal abnormal completions, failures, or irregularities in the job schedule.

To investigate execution issues further, correlate these job metrics with resource usage metrics like k8s.container.cpu.usage_seconds.value and k8s.container.memory.usage_bytes.value for the pods owned by jobs. This helps identify resource-related issues that may cause job failures or irregular behavior.

By closely monitoring these metrics, you can quickly detect abnormal job or cronjob activity, investigate the causes, and respond promptly to failures or scheduling issues.

Track Resource Quotas and Limits Enforcement

Are resource quotas and limit ranges enforced and preventing resource exhaustion or abuse across namespaces?

Metrics such as k8s.pod.container_resource_limits_cpu.value, k8s.pod.container_resource_limits_memory.value, k8s.pod.container_resource_requests_cpu.value, and k8s.pod.container_resource_requests_memory.value help ensure that containers are properly configured with appropriate CPU and memory limits and requests. Monitoring these metrics across all namespaces allows you to identify containers that lack resource specifications or those with unusually high limits, which could lead to resource exhaustion and instability if abused.

Tracking these metrics over time supports compliance with organizational resource policies, helps prevent resource exhaustion, and promotes healthy multi-tenancy within the Kubernetes environment.

Detect Early Signs of Node Resource Exhaustion

Are upcoming resource limits being approached on any node, risking pod eviction or node instability?

Metrics such as k8s.node.cpu.seconds.value and k8s.node.cpu.seconds.rate track the total and per-second CPU consumption by each node. Rapidly increasing or consistently high values may indicate nodes nearing CPU saturation, which can lead to pod throttling or evictions. Similarly, the k8s.node.memory.available_bytes.value, k8s.node.memory.total_bytes.value, and k8s.node.memory.free_bytes.value metrics are used to track available and total memory resources on nodes. Low available memory is a key early warning sign of potential resource exhaustion, which could cause pod evictions or node instability.

By continuously monitoring these node-level CPU and memory metrics, you can proactively detect resource pressure and intervene before workloads are disrupted or node health is compromised.