与 Alauda DevOps Pipelines 集成

本页展示了如何在运行 Alauda DevOps Pipelines(Tekton Pipelines)时,利用 Alauda Build 版本的 Kueue 的调度和资源管理能力。

前提条件

  • 已安装 Alauda DevOps Pipelines。
  • 已安装 Alauda Build 版本的 Kueue。
  • 已安装 Alauda Build 版本的 Hami(用于演示 vGPU)。
  • Alauda Container Platform Web CLI 能够与您的集群通信。

操作步骤

  1. 在 Alauda Container Platform 中创建一个项目和命名空间,例如,项目名称为 test,命名空间名称为 test-1

  2. 通过运行以下命令创建资源:

    cat <<EOF| kubectl create -f -
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cluster-queue
    spec:
      namespaceSelector: {}
      resourceGroups:
      - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
        flavors:
        - name: "default-flavor"
          resources:
          - name: "cpu"
            nominalQuota: 9
          - name: "memory"
            nominalQuota: 36Gi
          - name: "pods"
            nominalQuota: 5
          - name: "nvidia.com/gpualloc"
            nominalQuota: "2"
          - name: "nvidia.com/total-gpucores"
            nominalQuota: "50"
          - name: "nvidia.com/total-gpumem"
            nominalQuota: "20000"
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ResourceFlavor
    metadata:
      name: default-flavor
    ---
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: LocalQueue
    metadata:
      namespace: test-1
      name: test
    spec:
      clusterQueue: cluster-queue
    EOF
  3. 通过 Web CLI 或 UI 在 Alauda Container Platform 中创建一个 Pipeline 资源:

    apiVersion: tekton.dev/v1
    kind: Pipeline
    metadata:
      name: test
      namespace: test-1
    spec:
      tasks:
        - name: run-script
          taskSpec:
            description: test
            metadata: {}
            spec: null
            steps:
              - computeResources:
                  limits:
                    cpu: "2"
                    memory: 2Gi
                    nvidia.com/gpualloc: "2"
                    nvidia.com/gpucores: "50"
                    nvidia.com/gpumem: 8k
                  requests:
                    cpu: "1"
                    memory: 1Gi
                image: nvidia/cuda:11.0-base
                imagePullPolicy: IfNotPresent
                name: run-script
                script: |
                  #!/bin/sh
                  nvidia-smi
                securityContext:
                  allowPrivilegeEscalation: false
                  capabilities:
                    drop:
                      - ALL
                  runAsNonRoot: true
                  runAsUser: 65532
                  seccompProfile:
                    type: RuntimeDefault
          timeout: 30m0s
  4. 通过 Web CLI 或 UI 在 Alauda Container Platform 中创建一个 PipelineRun 资源:

    apiVersion: tekton.dev/v1
    kind: PipelineRun
    metadata:
      annotations:
        chains.tekton.dev/signed: "true"
        integrations.tekton.dev/integrations: |
          []
      generateName: test-
      labels:
        tekton.dev/pipeline: test
        kueue.x-k8s.io/queue-name: test
      namespace: test-1
    spec:
      pipelineRef:
        name: test
      taskRunTemplate:
        podTemplate:
          securityContext:
            fsGroup: 65532
            fsGroupChangePolicy: OnRootMismatch
        serviceAccountName: default
      timeouts:
        pipeline: 1h0m0s
    1. kueue.x-k8s.io/queue-name: test 标签:指定管理该 PipelineRun 所有 Pod 的 LocalQueue
    2. spec.pipelineRef.name:指定 PipelineRun 所引用的 Pipeline 资源。
  5. 观察 PipelineRun 的 Pod:

    kubectl -n test-1 get pod | grep test

    您将看到该 Pod 处于 SchedulingGated 状态:

    test-dw4q7-run-script-pod   0/1     SchedulingGated   0          13s   <none>   <none>   <none>           <none>
  6. 更新 nvidia.com/total-gpucores 配额:

    cat <<EOF| kubectl apply -f -
    apiVersion: kueue.x-k8s.io/v1beta2
    kind: ClusterQueue
    metadata:
      name: cluster-queue
    spec:
      namespaceSelector: {}
      resourceGroups:
      - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
        flavors:
        - name: "default-flavor"
          resources:
          - name: "cpu"
            nominalQuota: 9
          - name: "memory"
            nominalQuota: 36Gi
          - name: "pods"
            nominalQuota: 5
          - name: "nvidia.com/gpualloc"
            nominalQuota: "2"
          - name: "nvidia.com/total-gpucores"
            nominalQuota: "100"
          - name: "nvidia.com/total-gpumem"
            nominalQuota: "20000"
    EOF

    您将看到该 Pod 处于 Running 状态:

    test-dw4q7-run-script-pod   1/1     Running   0          13s   <none>   <none>   <none>           <none>