Level 4 Metrics Maturity with Edge Delta
3 minute read
Overview
At this level, the focus is on monitoring the health and responsiveness of Kubernetes control plane components—including the API server, etcd, controller manager, and scheduler. Failures or performance issues in these components can cause systemic delays in workload orchestration, autoscaling, and cluster state reconciliation.
To ingest control plane metrics with Edge Delta, you must configure a Prometheus input node. These metrics are not available by default and must be explicitly scraped from the appropriate control plane endpoints (e.g., API server, scheduler, controller manager). Edge Delta supports Prometheus-style metric ingestion through static configurations or via integration with a Target Allocator. For a working example of how to configure scrape jobs for these endpoints, refer to the Prometheus Source configuration guide.
By observing these metrics, you can detect when controllers are lagging in applying desired cluster state, which may impact workload availability, autoscaling behavior, or resource finalization.
Detect Kubernetes API Server Slowness
Are API server response times increasing, indicating underlying performance or availability issues?
The apiserver_request_duration_seconds
metric provides insight into how long API requests take to complete. It is available from the API server’s /metrics
endpoint and must be scraped explicitly. This metric is particularly useful when broken down by HTTP verb and resource type dimensions to understand which operations are degraded.
Consistently high latencies in this metric may indicate control plane stress, overloaded etcd backends, or network congestion between components.
Detect etcd Health and Leadership Issues
Is the etcd cluster maintaining a healthy leader and responding to client requests?
The etcd_server_has_leader
metric indicates whether each etcd node currently sees a leader. A value of 0
suggests a loss of quorum or split-brain condition. This metric is emitted from etcd’s /metrics
endpoint and must be scraped explicitly.
Leadership instability in etcd can lead to issues persisting state, delayed control loop execution, and failed Kubernetes API operations.
Detect Scheduler Latency or Failure
Is the Kubernetes scheduler functioning properly and placing pods without delay?
The scheduler_e2e_scheduling_latency_seconds
and scheduler_schedule_attempts_total
metrics reflect how long it takes to schedule pods and how many scheduling attempts are made. These metrics are exposed by the kube-scheduler component and must be scraped from its /metrics
endpoint.
Spikes in scheduling latency or repeated failed attempts can indicate resource pressure, taints/tolerations misconfigurations, or pod affinity/anti-affinity conflicts. By monitoring this data, you can ensure pods are scheduled quickly and efficiently.
Monitor Controller Manager Health
Are controllers (e.g., deployment, replication, node) executing reconciliation loops successfully?
The workqueue_depth
and workqueue_queue_duration_seconds
metrics indicate the depth and processing latency of controller queues within the kube-controller-manager. These metrics are exposed via the controller manager’s /metrics
endpoint and require scraping.
A growing queue depth or increased queue duration may signal that control loops are falling behind—whether due to external API slowness (e.g., cloud provider APIs), internal contention, or misconfigurations.