Tekton 组件的指标采集

概述

Tekton 组件通过 HTTP 端点暴露兼容 Prometheus 的指标。通过部署 ServiceMonitor 资源,Prometheus(或 VictoriaMetrics)可以自动发现并抓取这些指标。

命名空间说明:本文档默认使用 tekton-pipelines 作为控制平面组件(Pipelines、Triggers、Results、Chains)的命名空间。 主要例外是 EventListener 服务,它们运行在创建 EventListener 的应用命名空间中。

如果您的部署使用了不同的命名空间,请同时更新下面命令和 ServiceMonitor 资源中的 namespaceSelector 字段。

本文档涵盖以下 Tekton 组件的指标:

  • Tekton Pipelines - PipelineRun / TaskRun 执行指标
  • Tekton Triggers - EventListener、TriggerBinding 及相关资源指标
  • Tekton Results - 运行删除和存储指标
  • Tekton Chains - 签名和溯源指标
  • Controller Framework - 所有控制器共享的基础设施指标

还涵盖:

  • 如何通过 config-observability 配置指标行为
  • 如何部署 ServiceMonitor 资源进行抓取
  • 如何验证指标采集是否正常

前提条件

  • 已安装并运行 Tekton 控制平面组件(至少包括您计划抓取的组件:Pipelines、Triggers、Results 和/或 Chains)。
  • kubectl 已配置指向目标集群,且您的账户有权限在监控命名空间创建 ServiceMonitor 资源。
  • 已部署监控栈(Prometheus 或兼容的 VictoriaMetrics),且能发现/抓取 ServiceMonitor 资源(或您平台使用的等效抓取发现对象)。
  • 您的 Prometheus/VictoriaMetrics 实例配置了发现您创建的 ServiceMonitor 对象(命名空间和标签选择器匹配)。
  • 网络策略和防火墙允许抓取 Pod 访问 Tekton 指标端口(大多数控制平面服务为 9090,Triggers controller 和 EventListener sink 为 9000)。
  • 若需 EventListener sink 指标,EventListener 必须存在于目标命名空间并暴露 http-metrics 端口。

Tekton Pipelines

Tekton Pipelines 组件包含多个子服务,在端口 9090 暴露指标:

服务描述指标端口
tekton-pipelines-controllerPipelineRun / TaskRun 的主 reconciler9090
tekton-pipelines-webhookAdmission webhook9090
tekton-events-controllerCloudEvents 控制器9090
tekton-pipelines-remote-resolvers远程资源解析9090

Pipeline controller 指标使用前缀 tekton_pipelines_controller_

PipelineRun 指标

指标名称类型描述标签
pipelinerun_duration_secondsHistogram / LastValuePipelineRun 执行时间(秒)statusnamespacepipelinepipelinerunreason*
pipelinerun_totalCounter完成的 PipelineRun 总数status
running_pipelinerunsLastValue (Gauge)当前正在运行的 PipelineRun 数量metrics.running-pipelinerun.level 控制(见下文)
running_pipelineruns_waiting_on_pipeline_resolutionLastValue (Gauge)等待 Pipeline 引用解析的 PipelineRun 数量-
running_pipelineruns_waiting_on_task_resolutionLastValue (Gauge)等待 Task 引用解析的 PipelineRun 数量-

* 标记为 * 的标签为可选,取决于 config-observability 配置。

running_pipelineruns 标签级别

running_pipelineruns 指标的标签由 metrics.running-pipelinerun.level 控制:

级别标签
""(默认,集群级)无标签
"namespace"namespace
"pipeline"namespacepipeline
"pipelinerun"namespacepipelinepipelinerun

Status 标签值

PipelineRun 指标的状态标签值:

  • success - PipelineRun 成功完成
  • failed - PipelineRun 失败
  • cancelled - PipelineRun 被取消

TaskRun 指标的状态标签值:

  • success - TaskRun 成功完成
  • failed - TaskRun 失败

TaskRun 指标

指标名称类型描述标签
taskrun_duration_secondsHistogram / LastValue独立 TaskRun 执行时间(秒)statusnamespacetasktaskrunreason*
pipelinerun_taskrun_duration_secondsHistogram / LastValuePipelineRun 中 TaskRun 的执行时间statusnamespacetasktaskrunpipelinepipelinerunreason*
taskrun_totalCounter完成的 TaskRun 总数status
running_taskrunsLastValue (Gauge)当前正在运行的 TaskRun 数量-
running_taskruns_waiting_on_task_resolution_countLastValue (Gauge)等待 Task 引用解析的 TaskRun 数量-
running_taskruns_throttled_by_quotaLastValue (Gauge)被 ResourceQuota 限制的 TaskRun 数量namespace*
running_taskruns_throttled_by_nodeLastValue (Gauge)被节点级资源限制限流的 TaskRun 数量namespace*
taskruns_pod_latency_millisecondsLastValueTaskRun Pod 调度延迟(毫秒)namespacepodtasktaskrun

config-observability 配置

tekton-pipelines 命名空间中的 config-observability ConfigMap 控制 Pipeline controller 的指标行为。该 ConfigMap 由 Tekton Operator 管理,应通过 TektonConfig 资源的 spec.pipeline.options.configMaps 字段配置。详情请参见 调整子组件的可选配置项

热重载行为config-observability 在运行时被监视。大多数关键更改(如 metrics.*)无需重启 Pod 即可生效。仪表盘/查询更改可能需等待一两个抓取周期。仅当 Pod 规格设置更改时(例如 Deployment 中的 CONFIG_OBSERVABILITY_NAME 变更)才需重启。

通过 TektonConfig 配置示例:

apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
  name: config
spec:
  pipeline:
    options:
      disabled: false
      configMaps:
        config-observability:
          data:
            metrics.backend-destination: prometheus

            # PipelineRun 指标聚合级别。
            # 可选值:"pipelinerun" | "pipeline"(默认)| "namespace"
            #   - "pipelinerun":包含 pipeline + pipelinerun 标签;duration 使用 LastValue
            #   - "pipeline":仅包含 pipeline 标签
            #   - "namespace":无 pipeline/pipelinerun 标签
            metrics.pipelinerun.level: "pipeline"

            # TaskRun 指标聚合级别。
            # 可选值:"taskrun" | "task"(默认)| "namespace"
            #   - "taskrun":包含 task + taskrun 标签;duration 使用 LastValue
            #   - "task":仅包含 task 标签
            #   - "namespace":无 task/taskrun 标签
            metrics.taskrun.level: "task"

            # PipelineRun / TaskRun 的持续时间指标类型。
            # 可选值:"histogram"(默认)| "lastvalue"
            # 注意:当 pipelinerun.level 为 "pipelinerun" 或 taskrun.level 为 "taskrun" 时,
            #       duration 类型强制为 "lastvalue",无视此设置。
            metrics.pipelinerun.duration-type: "histogram"
            metrics.taskrun.duration-type: "histogram"

            # 运行中 PipelineRun 指标聚合级别。
            # 可选值:"pipelinerun" | "pipeline" | "namespace" | ""(默认,集群级)
            metrics.running-pipelinerun.level: ""

            # 在持续时间指标(pipelinerun_duration_seconds、
            # taskrun_duration_seconds、pipelinerun_taskrun_duration_seconds)中包含 reason 标签。
            # 可选值:"true" | "false"(默认)
            # 警告:启用会增加标签基数。
            # 注意:尽管键名如此,此设置不影响计数指标
            # (pipelinerun_total / taskrun_total),仅影响持续时间指标。
            metrics.count.enable-reason: "false"

            # 在被限流的 TaskRun 指标中包含 namespace 标签。
            # 可选值:"true" | "false"(默认)
            metrics.taskrun.throttle.enable-namespace: "false"

Histogram 桶边界

当持续时间类型为 histogram 时,使用以下桶边界(单位:秒):

10, 30, 60, 300, 900, 1800, 3600, 5400, 10800, 21600, 43200, 86400

对应:10秒、30秒、1分钟、5分钟、15分钟、30分钟、1小时、1.5小时、3小时、6小时、12小时、24小时。

推荐生产环境配置

生产环境建议使用聚合级别以控制标签基数:

metrics.pipelinerun.level: "pipeline"
metrics.taskrun.level: "task"
metrics.pipelinerun.duration-type: "histogram"
metrics.taskrun.duration-type: "histogram"
metrics.count.enable-reason: "false"

若需调试时的单次运行粒度,可临时切换为:

metrics.pipelinerun.level: "pipelinerun"
metrics.taskrun.level: "taskrun"

注意这会显著增加时间序列数量。


Tekton Triggers

Tekton Triggers 组件从不同进程暴露两类指标。

Controller 指标(端口 9000)

Triggers controller 每 60 秒报告一次资源计数指标。

服务指标端口
tekton-triggers-controller9000

Triggers controller 指标使用前缀 controller_

指标名称类型描述标签
eventlistener_countLastValue (Gauge)EventListener 资源数量-
triggerbinding_countLastValue (Gauge)TriggerBinding 资源数量-
clustertriggerbinding_countLastValue (Gauge)ClusterTriggerBinding 资源数量-
triggertemplate_countLastValue (Gauge)TriggerTemplate 资源数量-
clusterinterceptor_countLastValue (Gauge)ClusterInterceptor 资源数量-

EventListener Sink 指标

每个 EventListener Pod 暴露额外的 HTTP 和事件处理指标。这些指标来自 EventListener sink 进程(非 controller)。Prometheus 指标前缀为 eventlistener_

指标名称(Prometheus)类型描述标签
eventlistener_http_duration_secondsHistogramEventListener HTTP 请求时长-
eventlistener_event_received_countCountersink 接收的事件总数status
eventlistener_triggered_resourcesCounter触发器创建的资源总数kind
  • eventlistener_http_duration_seconds 的桶边界:0.001, 0.01, 0.1, 1, 10(秒)
  • eventlistener_event_received_count 的状态值:succeededfailed
  • eventlistener_triggered_resources 的 kind 值为创建对象的 Kubernetes 资源 Kind(如 PipelineRunTaskRun

这些 sink 指标按 EventListener Pod 暴露,不来自中央 controller。如果 EventListener Pod 暴露指标端口,可能需要单独的 ServiceMonitorPodMonitor 进行抓取。


Tekton Results

Tekton Results 有两个子服务暴露指标。

服务描述指标端口
tekton-results-watcher监视并清理 PipelineRun/TaskRun 资源9090
tekton-results-apigRPC/REST API 服务器9090

Watcher 指标

Watcher 指标使用前缀 watcher_

删除指标

指标名称类型描述标签
pipelinerun_delete_countCounter删除的 PipelineRun 总数statusnamespace
pipelinerun_delete_duration_secondsHistogram / LastValuePipelineRun 完成到删除的时间statusnamespacepipeline*
taskrun_delete_countCounter删除的 TaskRun 总数statusnamespace
taskrun_delete_duration_secondsHistogram / LastValueTaskRun 完成到删除的时间statusnamespacepipelinetask

* 可选标签取决于 Results Watcher 的 config-observability 配置。

注意pipelinerun_delete_countpipelinerun_delete_duration_secondstaskrun_delete_counttaskrun_delete_duration_seconds 仅在 Watcher 实际删除运行时记录。除非在 tekton-results-watcher Deployment 上将 --completed_run_grace_period 参数设置为非零值,否则这些指标将保持空(无数据点)。默认该参数为 0,禁用自动删除。设置为正值(如 10m)可启用宽限期后删除,负值则表示归档后立即删除。

Results Watcher 状态标签值:

  • success - 运行成功完成
  • failed - 运行失败
  • cancelled - 运行被取消

共享指标

这些指标由 Watcher 中 PipelineRun 和 TaskRun reconciler 共同注册,跟踪存储相关事件。

指标名称类型描述标签
runs_not_stored_countCounter删除时未存储到 Results 的运行数量kindnamespace
run_storage_latency_secondsHistogram运行完成到成功存储的时间kindnamespace

kind 标签标识运行类型(某些指标系列中为 PipelineRun / TaskRun,另一些为 pipelinerun / taskrun)。

注意runs_not_stored_count 仅在运行被外部删除(如通过 kubectl delete)且 Watcher 持有 finalizer 以协调归档时记录。除非满足以下全部条件,否则该指标保持空:

  1. --logs_api 参数为 false(禁用日志存储)——启用日志时,Watcher 会跳过基于 finalizer 的协调。
  2. --disable_crd_update 参数为 false(启用注解更新)。
  3. --store_deadline 参数设置为非零时长——Watcher 等待归档完成的最长时间,超时则放弃并允许删除。
  4. 运行在成功归档前被外部删除(无 results.tekton.dev/stored=true 注解),且 store_deadline 已过。

正常情况下(运行先归档后删除,或由 Watcher 自身通过 --completed_run_grace_period 触发删除),该计数器保持为零。非零值表示可能存在数据丢失:运行在状态保存到 Results API 前被删除。

快速复现(测试环境): 若未看到该指标,通常表示触发条件未满足,而非指标缺失。

  1. 通过 TektonConfig 配置 Results Watcher,确保 logs_api=falsedisable_crd_update=false,且 store_deadline 非零(例如 30s)。
  2. 临时将 Results API 副本数设为 0spec.result.options.deployments.tekton-results-api.spec.replicas: 0),使运行无法归档。
  3. 创建一个 TaskRun 或 PipelineRun,等待完成。
  4. 等待 store_deadline 过期后,外部删除该运行(kubectl delete ...)。
  5. 查看 Watcher /metrics 或 Prometheus 中的 watcher_runs_not_stored_count(组件前缀名称),应会增加。
  6. 恢复原有 TektonConfig(重新启用 Results API 副本和正常的 logs_api 设置)。

run_storage_latency_seconds 直方图桶边界(秒):

0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300, 600, 1800

Watcher config-observability

Results Watcher 有自己的 config-observability ConfigMap(通过 CONFIG_OBSERVABILITY_NAME 环境变量命名,通常为 tekton-results-config-observability)。该 ConfigMap 由 Tekton Operator 管理,应通过 TektonConfig 资源的 spec.results.options.configMaps 字段配置。详情请参见 调整子组件的可选配置项

热重载行为:Results Watcher 也监视此 ConfigMap,大多数关键更改无需重启 Pod。仅当 Deployment 级别设置(如环境变量/参数)变更时需重启。

支持以下键:

默认值可选值描述
metrics.pipelinerun.levelpipelinepipelinenamespace控制删除持续时间指标中的 pipeline 标签
metrics.taskrun.leveltasktasknamespace控制删除持续时间指标中的 task 标签
metrics.pipelinerun.duration-typehistogramhistogramlastvaluePipelineRun 和 TaskRun 删除的持续时间指标聚合类型
metrics.taskrun.duration-typehistogramhistogramlastvalue已解析但当前未使用;metrics.pipelinerun.duration-type 控制两者

注意:与 Tekton Pipelines 不同,Results Watcher 不支持 pipelinerun / taskrun 单次运行粒度级别,也没有 metrics.count.enable-reasonmetrics.running-pipelinerun.levelmetrics.taskrun.throttle.enable-namespace 键。

上游已知问题taskrun_delete_duration_seconds 使用 metrics.pipelinerun.duration-type(而非 metrics.taskrun.duration-type)决定聚合类型,疑为源码中的复制粘贴错误。

API Server 指标

API 服务器通过 go-grpc-prometheus 库在端口 9090 暴露标准 gRPC Prometheus 指标,包括:

  • grpc_server_handled_total - 服务器完成的 RPC 总数
  • grpc_server_started_total - 服务器启动的 RPC 总数
  • grpc_server_msg_received_total / grpc_server_msg_sent_total - 消息计数
  • grpc_server_handling_seconds(启用 PROMETHEUS_HISTOGRAM 时)- RPC 处理时长

Tekton Chains

Tekton Chains 是一个安全组件,用于生成、签名并存储 Tekton Pipelines 构建工件的溯源信息。它观察已完成的 TaskRun 和 PipelineRun,然后创建证明和签名。

服务描述指标端口
tekton-chains-metricsChains watcher/controller9090 (http-metrics)

Chains controller 指标使用前缀 watcher_(与 Results Watcher 相同,但自定义指标名称不同,无冲突)。

Chains 指标

所有 Chains 指标均为无标签的计数器。

指标名称(Prometheus)类型描述
watcher_taskrun_sign_created_totalCounterTaskRun 签名消息总数
watcher_taskrun_payload_stored_totalCounterTaskRun 存储的负载总数
watcher_taskrun_marked_signed_totalCounter标记为已签名的 TaskRun 总数
watcher_pipelinerun_sign_created_totalCounterPipelineRun 签名消息总数
watcher_pipelinerun_payload_stored_totalCounterPipelineRun 存储的负载总数
watcher_pipelinerun_marked_signed_totalCounter标记为已签名的 PipelineRun 总数

注意:官方 Tekton Chains 文档还提及 TaskRun 和 PipelineRun 的 *_signing_failures_total 计数器,但当前上游源码中未见此指标。请根据部署版本核实。


Controller Framework 指标

所有 Tekton 控制器自动暴露以下基础设施指标。指标使用与组件自定义指标相同的前缀(如 tekton_pipelines_controller_controller_watcher_)。

指标名称(无前缀)类型描述
client_latencyHistogramKubernetes API 客户端请求延迟(秒)
client_resultsCounterKubernetes API 请求计数(按状态码)
workqueue_depthGauge当前工作队列深度
workqueue_adds_totalCounter工作队列添加总数
workqueue_queue_latency_secondsHistogram工作队列中等待时间
workqueue_work_duration_secondsHistogram处理工作队列项所用时间
workqueue_retries_totalCounter工作队列重试总数
workqueue_unfinished_work_secondsHistogram未完成工作队列项的持续时间
workqueue_longest_running_processor_secondsHistogram运行时间最长的工作队列处理器持续时间
reconcile_countCounterreconciler 调用总数(带标签 reconcilersuccessnamespace_name
reconcile_latencyHistogramreconciler 调用延迟(带标签 reconcilersuccessnamespace_name

部署 ServiceMonitor

要启用 Prometheus 对 Tekton 组件的抓取,请部署 ServiceMonitor 资源。

前提条件见 前提条件

根据您的监控栈,参考以下指导:

  • 若使用 Prometheus(Prometheus Operator),标签如 metadata.labels.prometheus: kube-prometheus 必须与 Prometheus CR 的 spec.serviceMonitorSelector 匹配,否则该 ServiceMonitor 不会被抓取。
  • 若使用 VictoriaMetrics,通常不需要类似 prometheus: kube-prometheus 的标签;请根据您的监控配置创建 ServiceMonitor/VMServiceScrape

使用 Prometheus 时,可用以下命令查找并验证选择器:

# 1) 查找 Prometheus CR(资源类型:monitoring.coreos.com/v1,Kind=Prometheus)
$ kubectl get prometheus -A

# 2) 查看目标 Prometheus 实例的 ServiceMonitor 选择器
$ kubectl get prometheus -n <prometheus-namespace> <prometheus-name> -o yaml | yq '.spec.serviceMonitorSelector'

若集群中无 Prometheus CR,通常监控由平台管理(如 VictoriaMetrics)或采用其他方式实现。此时通常不需 prometheus: kube-prometheus 标签,请遵循平台抓取规则。

更多信息请参见 集成外部指标

Pipeline ServiceMonitor

Pipeline ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-pipelines-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-pipelines
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-pipelines
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

ServiceMonitor 匹配带有标签 app.kubernetes.io/part-of: tekton-pipelines(包括 remote-resolvers)的 Pipeline 服务,并在 tekton-pipelines 命名空间中抓取。

Triggers ServiceMonitor

Triggers ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-triggers-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-triggers
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-triggers
      app.kubernetes.io/component: controller
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

ServiceMonitor 仅收集 Triggers controller 指标(controller_*),不包含 EventListener sink 指标。

EventListener Sink ServiceMonitor

EventListener Sink ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-eventlistener-sink-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-eventlistener-sink
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: eventlistener
      operator: Exists
    - key: app.kubernetes.io/managed-by
      operator: In
      values:
      - EventListener
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    any: true

EventListener 服务通常运行在应用命名空间,因此此示例使用 namespaceSelector.any: true 以支持跨命名空间抓取。若需更严格的范围,请改用 matchNames 并显式列出允许的命名空间。

Results ServiceMonitor

Results 服务同时带有 app.kubernetes.io/part-of: tekton-resultsapp.kubernetes.io/name 标签。为精确定位 API 和 Watcher(排除 Postgres),此示例匹配 app.kubernetes.io/name

Results ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-results-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-results
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: app.kubernetes.io/name
      operator: In
      values:
      - tekton-results-api
      - tekton-results-watcher
  endpoints:
  - port: prometheus
    path: /metrics
    interval: 30s
  - port: metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

Results API 服务器使用端口名 prometheus(9090),Watcher 使用端口名 metrics(9090)。每个服务仅暴露其中一个端口名,故只会抓取匹配的端点。

Chains ServiceMonitor

Chains ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-chains-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-chains
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-chains
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

验证

部署 ServiceMonitor 资源后,验证 Prometheus 是否抓取目标。

直接检查指标端点

# Pipeline controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-pipelines-controller 9090:9090
$ curl -s http://localhost:9090/metrics | grep tekton_pipelines_controller_

# HELP tekton_pipelines_controller_client_latency How long Kubernetes API requests take
# TYPE tekton_pipelines_controller_client_latency histogram
tekton_pipelines_controller_client_latency_bucket{name="",le="1e-05"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.0001"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.001"} 0

# Triggers controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-triggers-controller 9000:9000
$ curl -s http://localhost:9000/metrics | grep controller_

# HELP controller_client_latency How long Kubernetes API requests take
# TYPE controller_client_latency histogram
controller_client_latency_bucket{name="",le="1e-05"} 0
controller_client_latency_bucket{name="",le="0.0001"} 1
controller_client_latency_bucket{name="",le="0.001"} 2

# EventListener sink metrics (替换命名空间/服务名)
$ kubectl port-forward -n <eventlistener-namespace> svc/<eventlistener-service> 9000:9000
$ curl -s http://localhost:9000/metrics | grep eventlistener_

# HELP eventlistener_client_latency How long Kubernetes API requests take
# TYPE eventlistener_client_latency histogram
eventlistener_client_latency_bucket{name="",le="1e-05"} 0
eventlistener_client_latency_bucket{name="",le="0.0001"} 0
eventlistener_client_latency_bucket{name="",le="0.001"} 0

# HELP eventlistener_triggered_resources Count of the number of triggered eventlistener resources
# TYPE eventlistener_triggered_resources counter
eventlistener_triggered_resources{kind="PipelineRun"} 10

# Results watcher
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-watcher 9091:9090
$ curl -s http://localhost:9091/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

# Results API
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-api-service 9092:9090
$ curl -s http://localhost:9092/metrics | grep grpc_server_

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Aborted",grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0

# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 337606
grpc_server_started_total{grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 10301
grpc_server_started_total{grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 832

# Chains controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-chains-metrics 9093:9090
$ curl -s http://localhost:9093/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

EventListener sink 指标如 eventlistener_event_received_counteventlistener_http_duration_seconds 是请求驱动的。验证这些指标前,请至少向 EventListener 发送一次请求。

检查 Prometheus 目标

# 验证 ServiceMonitor 资源是否存在
$ kubectl get servicemonitor -n tekton-pipelines

NAME                                AGE
tekton-chains-metrics               10m
tekton-eventlistener-sink-metrics   10m
tekton-pipelines-metrics            10m
tekton-results-metrics              10m
tekton-triggers-metrics             10m

# 检查 Prometheus 目标(通过 Prometheus UI 或 API)
# 查找 job 标签匹配 ServiceMonitor 名称的目标

示例 PromQL 查询

# PipelineRun 累积成功率(避免空完成窗口误判)
100 * sum(tekton_pipelines_controller_pipelinerun_total{status="success"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)

# 最近 5 分钟完成的 PipelineRun(吞吐量)
round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))

# PipelineRun 持续时间 P95(直方图模式)
histogram_quantile(0.95,
  rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)

# TaskRun 持续时间 P95(直方图模式,包含独立和 Pipeline 中的 TaskRun)
histogram_quantile(0.95,
  (
    sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m]))
    +
    sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m]))
  )
)

# PipelineRun 持续时间(lastvalue 模式)
avg_over_time(tekton_pipelines_controller_pipelinerun_duration_seconds[5m])

# 当前运行中的 PipelineRun(单条序列避免重复图例)
max(tekton_pipelines_controller_running_pipelineruns)

# 被资源配额限流的 TaskRun
max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)

# 触发器资源计数
controller_eventlistener_count
controller_triggertemplate_count

# Chains 签名活动
watcher_taskrun_sign_created_total
watcher_pipelinerun_sign_created_total

MonitorDashboard 示例

以下 MonitorDashboard 资源提供了可直接使用的 Tekton 组件监控仪表盘。部署到 cpaas-system 命名空间的 tekton 文件夹下。

重要:每个面板必须包含唯一整数 iddatasource: prometheus 和空的 transformations: []。每个目标必须包含 datasource: prometheusrefId。本文档中的持续时间 P50/P95 面板使用 *_bucket 查询,需 metrics.*.duration-type=histogram;若使用 lastvalue,请用类似 avg_over_time(...) 的 LastValue 风格表达式替换。

Tekton Pipeline 仪表盘

Tekton Pipeline Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-pipeline
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Pipeline 概览
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun 总数(按状态)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_pipelinerun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun 总数(按状态)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_taskrun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun 成功率(累积)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 16, y: 0 }
        targets:
          - datasource: prometheus
            expr: "100 * sum(tekton_pipelines_controller_pipelinerun_total{status=\"success\"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)"
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: thresholds }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds:
              mode: absolute
              steps:
                - { color: red, value: null }
                - { color: orange, value: 80 }
                - { color: green, value: 95 }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 12
        title: 最近 5 分钟完成的 PipelineRun
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: "round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))"
            legendFormat: completed
            refId: A
        fieldConfig:
          defaults:
            unit: short
            decimals: 0
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: 正在运行的 PipelineRun
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: 正在运行的 TaskRun
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: 被限流的 TaskRun
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)
            legendFormat: 按配额
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_node)
            legendFormat: 按节点
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: PipelineRun 持续时间 P50 / P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: TaskRun 持续时间 P50 / P95(独立)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 13
        title: TaskRun 持续时间 P50 / P95(Pipeline 内)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 9
        title: 工作队列深度
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_workqueue_depth)
            legendFormat: depth
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 10
        title: Reconcile 调用次数(按成功)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 24 }
        targets:
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="true"}[5m]))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="false"}[5m]))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 11
        title: 解析等待
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_pipeline_resolution)
            legendFormat: PR 等待 pipeline
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_task_resolution)
            legendFormat: PR 等待 task
            refId: B
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_waiting_on_task_resolution_count)
            legendFormat: TR 等待 task
            refId: C
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Pipeline 仪表盘解读(常见问题)

  • PipelineRun 总数(按状态) 是控制器记录的完成事件计数器,不是 PipelineRun 对象总数。当前实现中,用户触发的取消(spec.status=Cancelled)可能未进入此计数路径,故 cancelled 序列可能缺失。验证取消量请查看 PipelineRun 对象和事件。
  • 正在运行的 PipelineRun 是实时快照(当前运行数量),可独立于 PipelineRun 总数 变化。
  • 最近 5 分钟完成的 PipelineRun 是吞吐量指标,低流量或空闲时出现 0 是正常。
  • PipelineRun 成功率(累积) 是控制器启动以来的累积值,不是 5 分钟窗口的成功率。短期失败不会立即导致大幅波动。
  • Reconcile 调用次数(按成功) 统计控制器 reconcile 循环次数,不是 PipelineRun 数量。
  • 状态序列仅显示在选定时间范围内有样本的标签值。无样本时曲线/图例不显示。
  • TaskRun 持续时间 P50 / P95(独立)TaskRun 持续时间 P50 / P95(Pipeline 内) 分开显示,避免混合查询不稳定。若环境仅暴露一类直方图,另一面板为空属正常。

Tekton Triggers 仪表盘

Tekton Triggers Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-triggers
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Triggers 概览
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: EventListener 数量
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TriggerTemplate 数量
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 5, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: TriggerBinding 数量
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 10, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: ClusterTriggerBinding
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 15, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: ClusterInterceptor
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: 所有触发器资源计数(趋势)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 24, x: 0, y: 6 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: B
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: C
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: D
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: E
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Triggers 仪表盘解读(常见问题)

  • EventListener 数量TriggerTemplate 数量TriggerBinding 数量ClusterTriggerBindingClusterInterceptor 是对象计数快照,不是请求量或事件处理吞吐量。
  • 所有触发器资源计数(趋势) 显示相同资源计数的合并趋势。抓取周期内与单资源趋势面板的短暂偏差属正常。
  • 无 Triggers 资源时显示 0 是正常,不代表抓取失败。

Tekton Results 仪表盘

Tekton Results Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-results
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Results 概览
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun Reconcile 调用次数(最近 5 分钟)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun Reconcile 调用次数(最近 5 分钟)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: TaskRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: Workqueue Depth (PipelineRun vs TaskRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"})
            legendFormat: pipelinerun
            refId: A
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"})
            legendFormat: taskrun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: Workqueue Adds (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 16 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: pipelinerun adds
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: taskrun adds
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: gRPC Request Rate (Results API)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: "sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m]))"
            legendFormat: requests
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: gRPC Error Percentage (Results API, excl. NotFound/Canceled)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 24 }
        targets:
          - datasource: prometheus
            expr: "100 * ((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\",grpc_code!~\"OK|NotFound|Canceled\"}[5m])) or vector(0)) / clamp_min((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m])) or vector(0)), 0.001))"
            legendFormat: error %
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: red, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Results 监控面板解读(常见问题)

  • 此监控面板版本基于 Results Watcher 的 reconcile/workqueue 指标以及 Results API 的 gRPC 指标,因此在常见部署场景下(logs_api=true,自动删除禁用)保持数据充足。
  • PipelineRun Reconcile Count (last 5m)TaskRun Reconcile Count (last 5m) 分别展示了 success=truesuccess=false 的 5 分钟增量。
  • PipelineRun Reconcile Latency P95TaskRun Reconcile Latency P95 是基于 watcher reconcile 延迟直方图计算得出,流量较低时曲线可能较稀疏。
  • Workqueue Depth 显示当前队列深度,Workqueue Adds (last 5m) 显示最近 5 分钟的入队量。
  • gRPC Error Percentage (Results API, excl. NotFound/Canceled) 表示异常错误占总请求的百分比,排除了常见业务返回码(NotFoundCanceled)。

Tekton Chains 监控面板

Tekton Chains 监控面板 YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-chains
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Chains 概览
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: TaskRun 签名创建数(最近 5 分钟)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: PipelineRun 签名创建数(最近 5 分钟)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: Payloads 存储数(最近 5 分钟,TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_payload_stored_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_payload_stored_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: 标记已签名(最近 5 分钟,TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_marked_signed_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_marked_signed_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Chains 监控面板解读(常见问题)

  • TaskRun Signatures Created (last 5m)PipelineRun Signatures Created (last 5m)Payloads Stored (last 5m)Marked Signed (last 5m) 使用 increase(...[5m]),表示最近五分钟的增量。
  • 当没有新的签名或存储活动时,这些曲线会降至 0,这并不表示组件故障。
  • Payloads StoredMarked Signed 代表不同的处理阶段,因此它们的数值不一定总是匹配。