Level 4 Metrics Maturity with Edge Delta

Control Plane Monitoring.

Overview

At this level, the focus is on monitoring the health and responsiveness of Kubernetes control plane components—including the API server, etcd, controller manager, and scheduler. Failures or performance issues in these components can cause systemic delays in workload orchestration, autoscaling, and cluster state reconciliation.

To ingest control plane metrics with Edge Delta, you must configure a Prometheus input node. These metrics are not available by default and must be explicitly scraped from the appropriate control plane endpoints (e.g., API server, scheduler, controller manager). Edge Delta supports Prometheus-style metric ingestion through static configurations or via integration with a Target Allocator. For a working example of how to configure scrape jobs for these endpoints, refer to the Prometheus Source configuration guide.

By observing these metrics, you can detect when controllers are lagging in applying desired cluster state, which may impact workload availability, autoscaling behavior, or resource finalization.

Detect Kubernetes API Server Slowness

Are API server response times increasing, indicating underlying performance or availability issues?

The apiserver_request_duration_seconds metric provides insight into how long API requests take to complete. It is available from the API server’s /metrics endpoint and must be scraped explicitly. This metric is particularly useful when broken down by HTTP verb and resource type dimensions to understand which operations are degraded.

Consistently high latencies in this metric may indicate control plane stress, overloaded etcd backends, or network congestion between components.

Detect etcd Health and Leadership Issues

Is the etcd cluster maintaining a healthy leader and responding to client requests?

The etcd_server_has_leader metric indicates whether each etcd node currently sees a leader. A value of 0 suggests a loss of quorum or split-brain condition. This metric is emitted from etcd’s /metrics endpoint and must be scraped explicitly.

Leadership instability in etcd can lead to issues persisting state, delayed control loop execution, and failed Kubernetes API operations.

Detect Scheduler Latency or Failure

Is the Kubernetes scheduler functioning properly and placing pods without delay?

The scheduler_e2e_scheduling_latency_seconds and scheduler_schedule_attempts_total metrics reflect how long it takes to schedule pods and how many scheduling attempts are made. These metrics are exposed by the kube-scheduler component and must be scraped from its /metrics endpoint.

Spikes in scheduling latency or repeated failed attempts can indicate resource pressure, taints/tolerations misconfigurations, or pod affinity/anti-affinity conflicts. By monitoring this data, you can ensure pods are scheduled quickly and efficiently.

Monitor Controller Manager Health

Are controllers (e.g., deployment, replication, node) executing reconciliation loops successfully?

The workqueue_depth and workqueue_queue_duration_seconds metrics indicate the depth and processing latency of controller queues within the kube-controller-manager. These metrics are exposed via the controller manager’s /metrics endpoint and require scraping.

A growing queue depth or increased queue duration may signal that control loops are falling behind—whether due to external API slowness (e.g., cloud provider APIs), internal contention, or misconfigurations.