一、k8s组件简介

Kubernetes（简称K8s）是一个开源的容器编排平台，它可以自动化应用程序的部署、扩展和管理。K8s由一组组件组成，这些组件共同协作来提供平台的完整功能。

下面是一些常见的K8s组件及其功能：

kube-apiserver: 是Kubernetes API的前端，接收REST请求并将其转换为内部数据结构操作。
kube-controller-manager: 负责维护集群的状态，例如Pod数量、节点健康状况等。
kube-scheduler: 负责将新的Pod分配给集群中的节点。
kubelet: 运行在每个节点上，负责启动Pod并与kube-apiserver通信以确保Pod状态正确。
kube-proxy: 负责在集群内部的网络中实现服务发现和负载均衡。

二、k8s组件监控指标梳理

1. kube-apiserver

# HELP apiserver_request_duration_seconds [STABLE] Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
# TYPE apiserver_request_duration_seconds histogram
apiserver响应的时间分布，按照url 和 verb 分类
一般按照instance和verb+时间 汇聚
 
# HELP apiserver_request_total [STABLE] Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
# TYPE apiserver_request_total counter
apiserver的请求总数，按照verb、 version、 group、resource、scope、component、 http返回码分类统计
 
# HELP apiserver_current_inflight_requests [STABLE] Maximal number of currently used inflight request limit of this apiserver per request kind in last second.
# TYPE apiserver_current_inflight_requests gauge
最大并发请求数, 按mutating(非get list watch的请求)和readOnly(get list watch)分别限制
超过max-requests-inflight(默认值400)和max-mutating-requests-inflight(默认200)的请求会被限流
apiserver变更时要注意观察，也是反馈集群容量的一个重要指标
 
# HELP apiserver_response_sizes [STABLE] Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
# TYPE apiserver_response_sizes histogram
apiserver 响应大小，单位byte, 按照verb、 version、 group、resource、scope、component分类统计
 
# HELP watch_cache_capacity [ALPHA] Total capacity of watch cache broken by resource type.
# TYPE watch_cache_capacity gauge
按照资源类型统计的watch缓存大小
 
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
每秒钟用户态和系统态cpu消耗时间, 计算apiserver进程的cpu的使用率
 
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
apiserver的内存使用量(单位:Byte)
 
# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
apiserver中包含的controller的工作队列，已处理的任务总数
 
# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
apiserver中包含的controller的工作队列深度，表示当前队列中要处理的任务的数量，数值越小越好 
例如APIServiceRegistrationController admission_quota_controller

apiserver_request_count：API server 接收到的请求总数；
apiserver_request_duration_seconds：API server 请求的持续时间，按出现次数分组输出；
apiserver_request_latencies：API server 请求的延迟时间（秒）的直方图；
apiserver_response_sizes：API server 响应数据的大小（字节）的直方图；
apiserver_storage_data_key_value_size：存储每个 Etcd 键值对的平均大小和数量的直方图。

2. kube-controller-manager

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.
# TYPE rest_client_request_duration_seconds histogram
请求apiserver的耗时分布，按照url+verb统计
 
# HELP cronjob_controller_cronjob_job_creation_skew_duration_seconds [ALPHA] Time between when a cronjob is scheduled to be run, and when the corresponding job is created
# TYPE cronjob_controller_cronjob_job_creation_skew_duration_seconds histogram
cronjob 创建到运行的时间分布
 
# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. 'name' is the string used to identify the lease. Please make sure to group by name.
# TYPE leader_election_master_status gauge
控制器的选举状态，0表示backup， 1表示master 
 
# HELP node_collector_zone_health [ALPHA] Gauge measuring percentage of healthy nodes per zone.
# TYPE node_collector_zone_health gauge
每个zone的健康node占比
 
# HELP node_collector_zone_size [ALPHA] Gauge measuring number of registered Nodes per zones.
# TYPE node_collector_zone_size gauge
每个zone的node数
 
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
cpu使用量（也可以理解为cpu使用率）
 
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
控制器打开的fd数
 
# HELP pv_collector_bound_pv_count [ALPHA] Gauge measuring number of persistent volume currently bound
# TYPE pv_collector_bound_pv_count gauge
当前绑定的pv数量
 
# HELP pv_collector_unbound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently unbound
# TYPE pv_collector_unbound_pvc_count gauge
当前没有绑定的pvc数量 
 
 
# HELP pv_collector_bound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently bound
# TYPE pv_collector_bound_pvc_count gauge
当前绑定的pvc数量
 
# HELP pv_collector_total_pv_count [ALPHA] Gauge measuring total number of persistent volumes
# TYPE pv_collector_total_pv_count gauge
pv总数量
 
 
# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
# TYPE workqueue_adds_total counter
各个controller已接受的任务总数
与apiserver的workqueue_adds_total指标类似
 
# HELP workqueue_depth [ALPHA] Current depth of workqueue
# TYPE workqueue_depth gauge
各个controller队列深度，表示一个controller中的任务的数量
与apiserver的workqueue_depth类似，这个是指各个controller中队列的深度，数值越小越好
 
# HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested.
# TYPE workqueue_queue_duration_seconds histogram
任务在队列中的等待耗时,按照控制器分别统计
 
# HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes.
# TYPE workqueue_work_duration_seconds histogram
任务出队到被处理完成的时间，按照控制分别统计
 
# HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue
# TYPE workqueue_retries_total counter
任务进入队列重试的次数
 
# HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running.
# TYPE workqueue_longest_running_processor_seconds gauge
正在处理的任务中，最长耗时任务的处理时间
 
# HELP endpoint_slice_controller_syncs [ALPHA] Number of EndpointSlice syncs
# TYPE endpoint_slice_controller_syncs counter
endpoint_slice 同步的数量(1.20以上)
 
# HELP get_token_fail_count [ALPHA] Counter of failed Token() requests to the alternate token source
# TYPE get_token_fail_count counter
获取token失败的次数
 
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
controller gc的cpu使用率

kube_controller_manager_out_of_sync_total：控制器 manger 检测到的没有同步数据的对象的总数；
kube_controller_manager_event_total：事件处理计数器；