Tekton 组件的指标采集

概览

Tekton 组件通过 HTTP endpoint 暴露与 Prometheus 兼容的指标。通过部署 ServiceMonitor 资源,Prometheus(或 VictoriaMetrics)可以自动发现并抓取这些指标。

命名空间说明:本文档使用 tekton-pipelines 作为控制平面组件(Pipelines、Triggers、Results、Chains)的默认命名空间。 主要例外是 EventListener Services,它们运行在创建 EventListener 的应用命名空间中。

如果你的部署使用不同的命名空间,请同时更新下面命令中的命名空间,以及 ServiceMonitor 资源中的 namespaceSelector 字段。

本文档涵盖以下 Tekton 组件的指标:

  • Tekton Pipelines - PipelineRun / TaskRun 执行指标
  • Tekton Triggers - EventListener、TriggerBinding 及相关资源指标
  • Tekton Results - Run 删除和存储指标
  • Tekton Chains - 签名和 provenance 指标
  • Controller Framework - 所有 controller 共享的基础设施指标

还涵盖以下内容:

  • 如何通过 config-observability 配置指标行为
  • 如何部署用于抓取的 ServiceMonitor 资源
  • 如何验证指标采集是否生效

前提条件

  • Tekton 控制平面组件已安装并运行(至少包括你计划抓取的组件:Pipelines、Triggers、Results 和/或 Chains)。
  • kubectl 已配置为连接目标集群,且你的账户可以在监控命名空间中创建 ServiceMonitor 资源。
  • 已部署监控栈(Prometheus 或兼容的 VictoriaMetrics),并且可以发现/抓取 ServiceMonitor 资源(或平台使用的等价抓取发现对象)。
  • 你的 Prometheus/VictoriaMetrics 实例已配置为发现你创建的 ServiceMonitor 对象(命名空间和 label selector 必须匹配)。
  • 网络策略和防火墙允许抓取端 Pod 访问 Tekton 指标端口(大多数控制平面 Service 使用 9090,Triggers controller 和 EventListener sink 使用 9000)。
  • 如果你需要 EventListener sink 指标,EventListener 必须存在于其目标命名空间中,并暴露 http-metrics 端口。

Tekton Pipelines

Tekton Pipelines 组件包含多个暴露 9090 端口指标的子服务:

Service描述指标端口
tekton-pipelines-controllerPipelineRun / TaskRun 的主 reconciler9090
tekton-pipelines-webhookAdmission webhook9090
tekton-events-controllerCloudEvents controller9090
tekton-pipelines-remote-resolversRemote resource resolution9090

Pipeline controller 指标使用前缀 tekton_pipelines_controller_

PipelineRun 指标

Metric Name类型描述标签
pipelinerun_duration_secondsHistogram / LastValuePipelineRun 执行耗时(秒)status, namespace, pipeline*, pipelinerun*, reason*
pipelinerun_totalCounter已完成 PipelineRun 总数status
running_pipelinerunsLastValue (Gauge)当前正在运行的 PipelineRun 数量metrics.running-pipelinerun.level 控制(见下文)
running_pipelineruns_waiting_on_pipeline_resolutionLastValue (Gauge)正在等待 Pipeline 引用解析的 PipelineRun 数量-
running_pipelineruns_waiting_on_task_resolutionLastValue (Gauge)正在等待 Task 引用解析的 PipelineRun 数量-

* 标有 * 的标签为可选项,取决于 config-observability 配置。

running_pipelineruns 标签级别

running_pipelineruns 指标的标签由 metrics.running-pipelinerun.level 控制:

级别标签
""(默认,cluster)无标签
"namespace"namespace
"pipeline"namespace, pipeline
"pipelinerun"namespace, pipeline, pipelinerun

状态标签值

对于 PipelineRun 指标:

  • success - PipelineRun 成功完成
  • failed - PipelineRun 失败
  • cancelled - PipelineRun 被取消

对于 TaskRun 指标:

  • success - TaskRun 成功完成
  • failed - TaskRun 失败

TaskRun 指标

Metric Name类型描述标签
taskrun_duration_secondsHistogram / LastValue独立 TaskRun 的执行耗时(秒)status, namespace, task*, taskrun*, reason*
pipelinerun_taskrun_duration_secondsHistogram / LastValue作为 PipelineRun 一部分时的 TaskRun 执行耗时status, namespace, task*, taskrun*, pipeline*, pipelinerun*, reason*
taskrun_totalCounter已完成 TaskRun 总数status
running_taskrunsLastValue (Gauge)当前正在运行的 TaskRun 数量-
running_taskruns_waiting_on_task_resolution_countLastValue (Gauge)正在等待 Task 引用解析的 TaskRun 数量-
running_taskruns_throttled_by_quotaLastValue (Gauge)被 ResourceQuota 限流的 TaskRun 数量namespace*
running_taskruns_throttled_by_nodeLastValue (Gauge)被节点级资源约束限流的 TaskRun 数量namespace*
taskruns_pod_latency_millisecondsLastValueTaskRun 的 Pod 调度延迟(毫秒)namespace, pod, task*, taskrun*

config-observability 配置

tekton-pipelines 命名空间中的 config-observability ConfigMap 控制 Pipeline controller 的指标行为。该 ConfigMap 由 Tekton Operator 管理,应通过 TektonConfig 资源的 spec.pipeline.options.configMaps 字段进行配置。详见 调整子组件的可选配置项

热加载行为config-observability 会在运行时被监听。大多数 key 的变更(例如 metrics.*)无需重启 Pod 即可生效。请等待一到两个抓取周期,让 dashboard/query 的变化显示出来。只有当 Pod spec 设置发生变化时才需要重启(例如在 Deployment 中修改 CONFIG_OBSERVABILITY_NAME)。

通过 TektonConfig 的示例配置:

apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
  name: config
spec:
  pipeline:
    options:
      disabled: false
      configMaps:
        config-observability:
          data:
            metrics.backend-destination: prometheus

            # PipelineRun metrics aggregation level.
            # Values: "pipelinerun" | "pipeline" (default) | "namespace"
            #   - "pipelinerun": includes pipeline + pipelinerun labels; duration uses LastValue
            #   - "pipeline": includes pipeline label only
            #   - "namespace": no pipeline/pipelinerun labels
            metrics.pipelinerun.level: "pipeline"

            # TaskRun metrics aggregation level.
            # Values: "taskrun" | "task" (default) | "namespace"
            #   - "taskrun": includes task + taskrun labels; duration uses LastValue
            #   - "task": includes task label only
            #   - "namespace": no task/taskrun labels
            metrics.taskrun.level: "task"

            # Duration metric type for PipelineRun / TaskRun.
            # Values: "histogram" (default) | "lastvalue"
            # Note: When pipelinerun.level is "pipelinerun" or taskrun.level is "taskrun",
            #       duration type is forced to "lastvalue" regardless of this setting.
            metrics.pipelinerun.duration-type: "histogram"
            metrics.taskrun.duration-type: "histogram"

            # Running PipelineRun metrics aggregation level.
            # Values: "pipelinerun" | "pipeline" | "namespace" | "" (default, cluster-level)
            metrics.running-pipelinerun.level: ""

            # Include reason label on duration metrics (pipelinerun_duration_seconds,
            # taskrun_duration_seconds, pipelinerun_taskrun_duration_seconds).
            # Values: "true" | "false" (default)
            # Warning: Enabling this increases label cardinality.
            # Note: Despite the key name, this does NOT affect count metrics
            # (pipelinerun_total / taskrun_total), only duration metrics.
            metrics.count.enable-reason: "false"

            # Include namespace label on throttled TaskRun metrics.
            # Values: "true" | "false" (default)
            metrics.taskrun.throttle.enable-namespace: "false"

Histogram Buckets

当 duration 类型为 histogram 时,将使用以下 bucket 边界(单位:秒):

10, 30, 60, 300, 900, 1800, 3600, 5400, 10800, 21600, 43200, 86400

对应于:10s、30s、1m、5m、15m、30m、1h、1.5h、3h、6h、12h、24h。

推荐的生产环境配置

在生产环境中,建议使用聚合级别来控制 label 基数:

metrics.pipelinerun.level: "pipeline"
metrics.taskrun.level: "task"
metrics.pipelinerun.duration-type: "histogram"
metrics.taskrun.duration-type: "histogram"
metrics.count.enable-reason: "false"

如果你需要按单次运行的粒度进行排查,可以临时切换为:

metrics.pipelinerun.level: "pipelinerun"
metrics.taskrun.level: "taskrun"

请注意,这会显著增加时间序列数量。


Tekton Triggers

Tekton Triggers 组件通过不同进程暴露两类指标。

Controller 指标(端口 9000)

Triggers controller 每 60 秒报告一次资源数量指标。

Service指标端口
tekton-triggers-controller9000

Triggers controller 指标使用前缀 controller_

Metric Name类型描述标签
eventlistener_countLastValue (Gauge)EventListener 资源数量-
triggerbinding_countLastValue (Gauge)TriggerBinding 资源数量-
clustertriggerbinding_countLastValue (Gauge)ClusterTriggerBinding 资源数量-
triggertemplate_countLastValue (Gauge)TriggerTemplate 资源数量-
clusterinterceptor_countLastValue (Gauge)ClusterInterceptor 资源数量-

EventListener Sink 指标

每个 EventListener Pod 都会暴露额外的 HTTP 和事件处理指标。这些指标来自 EventListener sink 进程(不是 controller)。Prometheus 指标前缀为 eventlistener_

Metric Name (Prometheus)类型描述标签
eventlistener_http_duration_secondsHistogramEventListener HTTP 请求耗时-
eventlistener_event_received_countCountersink 接收到的事件总数status
eventlistener_triggered_resourcesCounter由 trigger 创建的资源总数kind
  • eventlistener_http_duration_seconds 的 histogram buckets:0.001, 0.01, 0.1, 1, 10(秒)
  • eventlistener_event_received_count 的 status 值:succeededfailed
  • eventlistener_triggered_resources 的 kind 值:所创建对象的 Kubernetes 资源 Kind(例如 PipelineRunTaskRun

这些 sink 指标是按每个 EventListener Pod 暴露的,而不是由中心 controller 暴露。如果 EventListener Pod 暴露了指标端口,你可能需要单独的 ServiceMonitorPodMonitor 来抓取它们。


Tekton Results

Tekton Results 有两个暴露指标的子服务。

Service描述指标端口
tekton-results-watcher监视并清理 PipelineRun/TaskRun 资源9090
tekton-results-apigRPC/REST API server9090

Watcher 指标

Watcher 指标使用前缀 watcher_

删除指标

Metric Name类型描述标签
pipelinerun_delete_countCounter已删除 PipelineRun 总数status, namespace
pipelinerun_delete_duration_secondsHistogram / LastValue从 PipelineRun 完成到删除的耗时status, namespace, pipeline*
taskrun_delete_countCounter已删除 TaskRun 总数status, namespace
taskrun_delete_duration_secondsHistogram / LastValue从 TaskRun 完成到删除的耗时status, namespace, pipeline*, task*

* 可选标签取决于 Results Watcher 的 config-observability 设置。

注意:只有当 Watcher 实际执行删除时,才会记录 pipelinerun_delete_countpipelinerun_delete_duration_secondstaskrun_delete_counttaskrun_delete_duration_seconds。除非在 tekton-results-watcher Deployment 上将 --completed_run_grace_period 标志设置为非零值,否则这些指标会一直为空(没有数据点)。默认情况下该标志为 0,会禁用自动删除。将其设置为正持续时间(例如 10m)可在宽限期后启用删除,或者设置为负值以在归档后立即删除。

Results Watcher 的状态标签值:

  • success - Run 成功完成
  • failed - Run 失败
  • cancelled - Run 被取消

共享指标

这些指标由 Watcher 中的 PipelineRun 和 TaskRun reconciler 共同注册,用于跟踪与存储相关的事件。

Metric Name类型描述标签
runs_not_stored_countCounter已删除但未存储到 Results 的 Run 数量kind, namespace
run_storage_latency_secondsHistogram从 run 完成到成功存储的耗时kind, namespace

kind 标签用于标识 run 类型(某些 metric series 中为 PipelineRun / TaskRun,另一些中为 pipelinerun / taskrun)。

注意:只有在 Watcher 持有 finalizer 用于协调归档时,run 被外部删除(例如通过 kubectl delete)才会记录 runs_not_stored_count。除非同时满足以下所有条件,否则该指标会一直为空:

  1. --logs_api 标志为 false(禁用 log 存储)——如果启用了 logs,Watcher 会完全跳过基于 finalizer 的协调。
  2. --disable_crd_update 标志为 false(启用 annotation 更新)。
  3. --store_deadline 标志设置为非零持续时间——这是 Watcher 等待归档完成后放弃并允许删除前的最长等待时间。
  4. Run 在成功归档之前被外部删除(没有 results.tekton.dev/stored=true annotation),且 store_deadline 已经过期。

在正常运行中(run 在删除前已归档,或由 Watcher 通过 --completed_run_grace_period 自身触发删除),该 counter 会保持为 0。非零值表示可能存在数据丢失:run 在其状态保存到 Results API 之前就被删除了。

快速复现(测试环境): 如果你看不到这个指标,通常说明触发条件未满足,而不是指标缺失。

  1. 通过 TektonConfig 配置 Results Watcher,使 logs_api=falsedisable_crd_update=false,并且 store_deadline 非零(例如 30s)。
  2. 通过 TektonConfig 临时将 Results API 副本数设为 0spec.result.options.deployments.tekton-results-api.spec.replicas: 0),使 run 无法被归档。
  3. 创建一个 TaskRun 或 PipelineRun,并等待其完成。
  4. 等待 store_deadline 经过后,再外部删除该 run(kubectl delete ...)。
  5. 检查 Watcher 的 /metrics 或 Prometheus 中的 watcher_runs_not_stored_count(在 exposition 格式中为带组件前缀的名称);它应该会增加。
  6. 恢复原始 TektonConfig(重新启用 Results API 副本和正常的 logs_api 设置)。

run_storage_latency_seconds histogram 使用以下 bucket 边界(单位:秒):

0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300, 600, 1800

Watcher config-observability

Results Watcher 有自己的 config-observability ConfigMap(通过 CONFIG_OBSERVABILITY_NAME 环境变量命名,通常为 tekton-results-config-observability)。该 ConfigMap 由 Tekton Operator 管理,应通过 TektonConfig 资源的 spec.result.options.configMaps 字段进行配置。详见 调整子组件的可选配置项

热加载行为:Results Watcher 同样会监听这个 ConfigMap,并在不重启 Pod 的情况下应用大多数 key 的变更。只有在 Deployment 级别设置(例如 env var/args)发生变化时才需要重启。

它支持以下 key:

KeyDefaultValues描述
metrics.pipelinerun.levelpipelinepipeline, namespace控制 delete duration metrics 上的 pipeline 标签
metrics.taskrun.leveltasktask, namespace控制 delete duration metrics 上的 task 标签
metrics.pipelinerun.duration-typehistogramhistogram, lastvaluePipelineRun 和 TaskRun 删除的 duration metric 聚合类型
metrics.taskrun.duration-typehistogramhistogram, lastvalue已解析但当前未使用;由 metrics.pipelinerun.duration-type 同时控制两者

注意:不同于 Tekton Pipelines,Results Watcher 不支持 pipelinerun / taskrun 的单次运行粒度级别。它也没有 metrics.count.enable-reasonmetrics.running-pipelinerun.levelmetrics.taskrun.throttle.enable-namespace 这些 key。

上游已知问题taskrun_delete_duration_seconds 使用的是 metrics.pipelinerun.duration-type(而不是 metrics.taskrun.duration-type)来确定聚合类型。这看起来是 Results 源码中的一个复制粘贴 bug。

API Server 指标

API server 通过 go-grpc-prometheus 库在 9090 端口暴露标准的 gRPC Prometheus 指标。这些指标包括:

  • grpc_server_handled_total - server 上已完成的 RPC 总数
  • grpc_server_started_total - server 上已启动的 RPC 总数
  • grpc_server_msg_received_total / grpc_server_msg_sent_total - 消息计数
  • grpc_server_handling_seconds(如果启用了 PROMETHEUS_HISTOGRAM)- RPC 处理耗时

Tekton Chains

Tekton Chains 是一个安全组件,用于为通过 Tekton Pipelines 构建的 artifact 生成、签名并存储 provenance。它会观察已完成的 TaskRun 和 PipelineRun,然后创建 attestations 和签名。

Service描述指标端口
tekton-chains-metricsChains watcher/controller9090 (http-metrics)

Chains controller 指标使用前缀 watcher_(与 Results Watcher 相同,但自定义指标名称不同,因此不会冲突)。

Chains 指标

所有 Chains 指标都是不带标签的 Counters。

Metric Name (Prometheus)类型描述
watcher_taskrun_sign_created_totalCounterTaskRun 的已签名消息总数
watcher_taskrun_payload_stored_totalCounterTaskRun 的已存储 payload 总数
watcher_taskrun_marked_signed_totalCounter被标记为已签名的 TaskRun 总数
watcher_pipelinerun_sign_created_totalCounterPipelineRun 的已签名消息总数
watcher_pipelinerun_payload_stored_totalCounterPipelineRun 的已存储 payload 总数
watcher_pipelinerun_marked_signed_totalCounter被标记为已签名的 PipelineRun 总数

注意:官方 Tekton Chains 文档还提到了 TaskRun 和 PipelineRun 的 *_signing_failures_total counter,但当前上游源码中并不存在这些指标。请以你部署的版本为准进行验证。


Controller Framework 指标

所有 Tekton controller 都会自动暴露以下基础设施指标。这些指标使用与组件自定义指标相同的前缀(例如 tekton_pipelines_controller_controller_watcher_)。

Metric Name (without prefix)类型描述
client_latencyHistogramKubernetes API client 请求延迟(秒)
client_resultsCounterKubernetes API 请求计数(按状态码)
workqueue_depthGauge当前 workqueue 深度
workqueue_adds_totalCounterworkqueue 添加总数
workqueue_queue_latency_secondsHistogramitem 在 workqueue 中等待的时间
workqueue_work_duration_secondsHistogram处理 workqueue item 所花费的时间
workqueue_retries_totalCounterworkqueue 重试总数
workqueue_unfinished_work_secondsHistogram未完成 workqueue item 的持续时间
workqueue_longest_running_processor_secondsHistogram运行时间最长的 workqueue processor 的持续时间
reconcile_countCounterreconciler 调用总数(按 reconcilersuccessnamespace_name 标记)
reconcile_latencyHistogramreconciler 调用延迟(按 reconcilersuccessnamespace_name 标记)

设置 ServiceMonitor

要为 Tekton 组件启用 Prometheus 抓取,需要部署 ServiceMonitor 资源。

前提条件见 前提条件

请根据你的监控栈采用以下指导:

  • 如果你使用 Prometheus(Prometheus Operator),如 metadata.labels.prometheus: kube-prometheus 之类的 label 必须与 Prometheus CR 的 spec.serviceMonitorSelector 匹配;否则该 ServiceMonitor 不会被抓取。
  • 如果你使用 VictoriaMetrics,通常不需要类似 prometheus: kube-prometheus 的 label;请根据你的监控配置创建 ServiceMonitor / VMServiceScrape

使用 Prometheus 时,可通过以下命令查找并验证 selector:

# 1) 定位 Prometheus CR(resource type: monitoring.coreos.com/v1, Kind=Prometheus)
$ kubectl get prometheus -A

# 2) 检查目标 Prometheus 实例上的 ServiceMonitor selector
$ kubectl get prometheus -n <prometheus-namespace> <prometheus-name> -o yaml | yq '.spec.serviceMonitorSelector'

如果集群中不存在 Prometheus CR,监控通常由平台托管(例如 VictoriaMetrics)或以不同方式实现。在这种情况下,通常不需要 prometheus: kube-prometheus 之类的 label;请遵循你的平台抓取规则。

更多信息请参阅 集成外部指标

Pipeline ServiceMonitor

Pipeline ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-pipelines-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-pipelines
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-pipelines
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

这个 ServiceMonitor 会匹配带有 app.kubernetes.io/part-of: tekton-pipelines label 的 Pipeline services(包括 remote-resolvers),并在 tekton-pipelines 命名空间中对它们进行抓取。

Triggers ServiceMonitor

Triggers ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-triggers-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-triggers
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-triggers
      app.kubernetes.io/component: controller
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

这个 ServiceMonitor 只收集 Triggers controller 指标(controller_*)。它不包含 EventListener sink 指标。

EventListener Sink ServiceMonitor

EventListener Sink ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-eventlistener-sink-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-eventlistener-sink
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: eventlistener
      operator: Exists
    - key: app.kubernetes.io/managed-by
      operator: In
      values:
      - EventListener
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    any: true

EventListener Services 通常运行在应用命名空间中,因此此示例使用 namespaceSelector.any: true 来实现跨命名空间抓取。如果你需要更严格的范围,可以切换为 matchNames 并显式列出允许的命名空间。

Results ServiceMonitor

Results services 同时带有 app.kubernetes.io/part-of: tekton-resultsapp.kubernetes.io/name 标签。为了精确定位 API + Watcher(并排除 Postgres),此示例按 app.kubernetes.io/name 进行匹配:

Results ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-results-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-results
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: app.kubernetes.io/name
      operator: In
      values:
      - tekton-results-api
      - tekton-results-watcher
  endpoints:
  - port: prometheus
    path: /metrics
    interval: 30s
  - port: metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

Results API server 使用端口名 prometheus(9090),Watcher 使用端口名 metrics(9090)。每个 service 只会暴露其中一个端口名,因此只有匹配的 endpoint 会被抓取。

Chains ServiceMonitor

Chains ServiceMonitor YAML
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-chains-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-chains
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-chains
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

验证

部署 ServiceMonitor 资源后,请验证 Prometheus 是否正在抓取这些目标。

直接检查指标 endpoint

# Pipeline controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-pipelines-controller 9090:9090
$ curl -s http://localhost:9090/metrics | grep tekton_pipelines_controller_

# HELP tekton_pipelines_controller_client_latency How long Kubernetes API requests take
# TYPE tekton_pipelines_controller_client_latency histogram
tekton_pipelines_controller_client_latency_bucket{name="",le="1e-05"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.0001"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.001"} 0

# Triggers controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-triggers-controller 9000:9000
$ curl -s http://localhost:9000/metrics | grep controller_

# HELP controller_client_latency How long Kubernetes API requests take
# TYPE controller_client_latency histogram
controller_client_latency_bucket{name="",le="1e-05"} 0
controller_client_latency_bucket{name="",le="0.0001"} 1
controller_client_latency_bucket{name="",le="0.001"} 2

# EventListener sink metrics (replace namespace/service)
$ kubectl port-forward -n <eventlistener-namespace> svc/<eventlistener-service> 9000:9000
$ curl -s http://localhost:9000/metrics | grep eventlistener_

# HELP eventlistener_client_latency How long Kubernetes API requests take
# TYPE eventlistener_client_latency histogram
eventlistener_client_latency_bucket{name="",le="1e-05"} 0
eventlistener_client_latency_bucket{name="",le="0.0001"} 0
eventlistener_client_latency_bucket{name="",le="0.001"} 0

# HELP eventlistener_triggered_resources Count of the number of triggered eventlistener resources
# TYPE eventlistener_triggered_resources counter
eventlistener_triggered_resources{kind="PipelineRun"} 10

# Results watcher
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-watcher 9091:9090
$ curl -s http://localhost:9091/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

# Results API
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-api-service 9092:9090
$ curl -s http://localhost:9092/metrics | grep grpc_server_

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Aborted",grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0

# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 337606
grpc_server_started_total{grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 10301
grpc_server_started_total{grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 832

# Chains controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-chains-metrics 9093:9090
$ curl -s http://localhost:9093/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

eventlistener_event_received_counteventlistener_http_duration_seconds 这类 EventListener sink 指标是由请求驱动的。请在验证这些指标之前,至少向 EventListener 发送一次请求。

检查 Prometheus 目标

# 验证 ServiceMonitor 资源是否存在
$ kubectl get servicemonitor -n tekton-pipelines

NAME                                AGE
tekton-chains-metrics               10m
tekton-eventlistener-sink-metrics   10m
tekton-pipelines-metrics            10m
tekton-results-metrics              10m
tekton-triggers-metrics             10m

# 检查 Prometheus 目标(通过 Prometheus UI 或 API)
# 查找 job label 与 ServiceMonitor 名称匹配的目标

PromQL 示例查询

# PipelineRun 累积成功率(避免在空完成窗口中产生误解)
100 * sum(tekton_pipelines_controller_pipelinerun_total{status="success"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)

# 最近 5 分钟内完成的 PipelineRun 数量(吞吐量)
round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))

# PipelineRun 持续时间 P95(histogram 模式)
histogram_quantile(0.95,
  rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)

# TaskRun 持续时间 P95(histogram 模式,包含独立 + Pipeline 内 TaskRun)
histogram_quantile(0.95,
  (
    sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m]))
    +
    sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m]))
  )
)

# PipelineRun 持续时间(lastvalue 模式)
avg_over_time(tekton_pipelines_controller_pipelinerun_duration_seconds[5m])

# 当前正在运行的 PipelineRun(单序列,避免重复 legend)
max(tekton_pipelines_controller_running_pipelineruns)

# 被资源配额限流的 TaskRun
max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)

# Trigger 资源数量
controller_eventlistener_count
controller_triggertemplate_count

# Chains 签名活动
watcher_taskrun_sign_created_total
watcher_pipelinerun_sign_created_total

MonitorDashboard 示例

以下 MonitorDashboard 资源提供了可直接使用的 Tekton 组件监控 dashboard。请将它们部署到 cpaas-system 命名空间下的 tekton 目录中。

重要:每个 panel 都必须包含 id(唯一整数)、datasource: prometheustransformations: []。每个 target 都必须包含 datasource: prometheusrefId。本文档中的 Duration P50/P95 panel 使用 *_bucket 查询,并要求 metrics.*.duration-type=histogram;如果你使用 lastvalue,请将这些查询替换为类似 avg_over_time(...) 的 LastValue 风格表达式。

Tekton Pipeline Dashboard

Tekton Pipeline Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-pipeline
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Pipeline Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun Total (by status)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_pipelinerun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun Total (by status)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_taskrun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun Success Rate (cumulative)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 16, y: 0 }
        targets:
          - datasource: prometheus
            expr: "100 * sum(tekton_pipelines_controller_pipelinerun_total{status=\"success\"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)"
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: thresholds }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds:
              mode: absolute
              steps:
                - { color: red, value: null }
                - { color: orange, value: 80 }
                - { color: green, value: 95 }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 12
        title: Completed PipelineRuns (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: "round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))"
            legendFormat: completed
            refId: A
        fieldConfig:
          defaults:
            unit: short
            decimals: 0
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: Running PipelineRuns
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: Running TaskRuns
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: TaskRuns Throttled
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)
            legendFormat: by quota
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_node)
            legendFormat: by node
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: PipelineRun Duration P50 / P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: TaskRun Duration P50 / P95 (Standalone)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 13
        title: TaskRun Duration P50 / P95 (In-Pipeline)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 9
        title: Workqueue Depth
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_workqueue_depth)
            legendFormat: depth
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 10
        title: Reconcile Count (by success)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 24 }
        targets:
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="true"}[5m]))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="false"}[5m]))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 11
        title: Resolution Waiting
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_pipeline_resolution)
            legendFormat: PR waiting pipeline
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_task_resolution)
            legendFormat: PR waiting task
            refId: B
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_waiting_on_task_resolution_count)
            legendFormat: TR waiting task
            refId: C
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Pipeline Dashboard 解读(常见问题)

  • PipelineRun Total (by status) 是 controller 记录的完成事件 counter,不是 PipelineRun 对象总数。在当前实现中,用户触发的取消(spec.status=Cancelled)可能不会进入这个统计路径,因此 cancelled 序列可能不会出现。要验证取消数量,请检查 PipelineRun 对象和事件。
  • Running PipelineRuns 是实时快照(当前正在运行的数量)。它可以独立于 PipelineRun Total 变化。
  • Completed PipelineRuns (last 5m) 表示吞吐量(最近 5 分钟内新完成的 run 数)。在低流量或空闲时段看到 0 是正常的。
  • PipelineRun Success Rate (cumulative) 是从 controller 启动以来的累积值,而不是 5 分钟窗口成功率。短期失败不会立即导致大幅变化。
  • Reconcile Count (by success) 统计的是 controller 的 reconcile 循环,而不是 PipelineRun 数量。
  • 状态序列只会为所选时间范围内实际存在样本的标签值显示。如果某个状态在窗口内没有样本,其曲线/legend 不会出现。
  • TaskRun Duration P50 / P95 (Standalone)TaskRun Duration P50 / P95 (In-Pipeline) 被拆分以避免混合查询不稳定。在只暴露一种 histogram family 的环境中,另一张图表为空是正常现象。

Tekton Triggers Dashboard

Tekton Triggers Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-triggers
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Triggers Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: EventListener Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TriggerTemplate Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 5, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: TriggerBinding Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 10, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: ClusterTriggerBinding
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 15, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: ClusterInterceptor
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: All Trigger Resource Counts (trend)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 24, x: 0, y: 6 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: B
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: C
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: D
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: E
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Triggers Dashboard 解读(常见问题)

  • EventListener CountTriggerTemplate CountTriggerBinding CountClusterTriggerBindingClusterInterceptor 是对象数量快照,不是请求量或事件处理吞吐量。
  • All Trigger Resource Counts (trend) 展示的是相同资源计数的组合趋势。在一个抓取周期内,与单资源趋势图之间出现短暂偏差是正常的。
  • 当不存在 Triggers 资源时显示 0 是正常现象,并不表示抓取失败。

Tekton Results Dashboard

Tekton Results Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-results
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Results Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun Reconcile Count (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun Reconcile Count (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: TaskRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: Workqueue Depth (PipelineRun vs TaskRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"})
            legendFormat: pipelinerun
            refId: A
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"})
            legendFormat: taskrun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: Workqueue Adds (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 16 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: pipelinerun adds
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: taskrun adds
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: gRPC Request Rate (Results API)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: "sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m]))"
            legendFormat: requests
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: gRPC Error Percentage (Results API, excl. NotFound/Canceled)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 24 }
        targets:
          - datasource: prometheus
            expr: "100 * ((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\",grpc_code!~\"OK|NotFound|Canceled\"}[5m])) or vector(0)) / clamp_min((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m])) or vector(0)), 0.001))"
            legendFormat: error %
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: red, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Results Dashboard 解读(常见问题)

  • 此 dashboard 版本基于 Results Watcher 的 reconcile/workqueue 指标以及 Results API 的 gRPC 指标,因此在常见部署方式下(logs_api=true、禁用自动删除)仍会持续显示数据。
  • PipelineRun Reconcile Count (last 5m)TaskRun Reconcile Count (last 5m) 分别显示 success=truesuccess=false 在最近 5 分钟内的增量。
  • PipelineRun Reconcile Latency P95TaskRun Reconcile Latency P95 由 watcher reconcile latency histogram 计算得出。在低流量情况下,该曲线可能较稀疏。
  • Workqueue Depth 显示当前队列深度,Workqueue Adds (last 5m) 显示最近 5 分钟的入队量。
  • gRPC Error Percentage (Results API, excl. NotFound/Canceled) 是异常错误占总请求量的百分比,排除了常见业务返回码(NotFoundCanceled)。

Tekton Chains Dashboard

Tekton Chains Dashboard YAML
kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-chains
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Chains Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: TaskRun Signatures Created (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: PipelineRun Signatures Created (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: Payloads Stored (last 5m, TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_payload_stored_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_payload_stored_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: Marked Signed (last 5m, TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_marked_signed_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_marked_signed_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Tekton Chains Dashboard 解读(常见问题)

  • TaskRun Signatures Created (last 5m)PipelineRun Signatures Created (last 5m)Payloads Stored (last 5m)Marked Signed (last 5m) 都使用了 increase(...[5m]),表示最近 5 分钟内的增量。
  • 当没有新的签名或存储活动时,这些曲线会降为 0;这并不意味着组件异常。
  • Payloads StoredMarked Signed 表示不同的处理阶段,因此它们的数值不一定始终一致。