使用 Kubeflow Trainer v2 进行微调

本教程将介绍如何使用 Kubeflow Trainer v2 在 Kubernetes 上结合 LlamaFactory 运行监督式微调（SFT）任务。

概述

Kubeflow Trainer v2 将 作业模板（TrainingRuntime）与 作业运行（TrainJob）分离，这使你可以：

定义一个可复用的 TrainingRuntime，其中包含容器镜像、训练流水线步骤（dataset init → model init → trainer）以及 LlamaFactory 配置。
提交多个引用同一 runtime 的 TrainJob 运行，仅覆盖每次实验发生变化的部分——基础模型、数据集 URL、超参数或 GPU 资源。

前提条件

开始之前，请确保已满足以下条件：

Requirement	Details
Kubeflow Trainer v2	已安装在你的集群中（`trainer.kubeflow.org` API group 可用）
Kueue	已安装在你的集群中，用于作业调度和配额管理（可选，但推荐）
共享 PVC	一个可被所有 Pod 访问的 `PersistentVolumeClaim`（例如 `team-model-cache-pvc`），底层存储为 NFS、Ceph 或 topolvm 之类的本地存储
Git 凭据	一个名为 `aml-image-builder-secret` 的 Kubernetes `Secret`，包含 `MODEL_REPO_GIT_USER` 和 `MODEL_REPO_GIT_TOKEN` 键，用于访问私有 Git 仓库
GPU 节点	配备 NVIDIA GPU 的节点；示例使用 `Tesla-T4` 节点——请调整 `nodeSelector` 以匹配你的集群
`kubectl` 访问	已配置 `kubectl`，并具有在目标 namespace 中创建 `TrainingRuntime` 和 `TrainJob` 资源的权限

RBAC 权限

如果在创建或管理 Kubeflow Trainer v2 资源时遇到 RBAC 权限错误，请先停止操作并联系你的集群管理员，然后再继续。请管理员创建一个临时 role 并将其绑定到你的账户或 namespace，以便你拥有 trainjobs 和 trainingruntimes 自定义资源的读写权限。

以下示例展示了集群管理员如何向名为 aml-editor 的 workbench 使用的 ServiceAccount 授予这些权限：

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: aml-editor-trainer-rw
  namespace: mlops-demo-ai-test
rules:
  - apiGroups:
      - trainer.kubeflow.org
    resources:
      - trainjobs
      - trainingruntimes
    verbs:
      - get
      - list
      - watch
      - create
      - update
      - patch
      - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: aml-editor-trainer-rw
  namespace: mlops-demo-ai-test
subjects:
  - kind: ServiceAccount
    name: aml-editor
    namespace: mlops-demo-ai-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: aml-editor-trainer-rw

将 mlops-demo-ai-test 替换为 workbench 和 Trainer v2 资源运行所在的 namespace。

构建 Trainer 镜像或使用预构建镜像

你可以使用我们预构建的镜像 alaudadockerhub/fine_tune_with_llamafactory:v0.1.11，也可以使用 aml-docs 中提供的 Containerfile 构建自己的镜像。

下载 notebook 并运行示例

将 notebook 下载到 Alauda AI 中当前的 workbench；如果你还没有 workbench，请创建一个新的，然后打开该 notebook。
按照 notebook 中的说明创建 TrainingRuntime，并提交一个用于微调 LLaMA-Factory 模型的 TrainJob。该 notebook 包含使用 team-model-cache-pvc 共享 PVC 和 Git 凭据的示例配置。

使用 MindSpeed-LLM 在 Ascend NPU 上进行微调

对于 Huawei Ascend NPU 集群，请使用 MindSpeed-LLM NPU notebook，而不是 LlamaFactory GPU notebook。

MindSpeed-LLM notebook 演示了如何：

使用预构建的 alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7 镜像。
创建一个 Trainer v2 TrainingRuntime，其 runtimeClassName: ascend 且 schedulerName: hami-scheduler。
提交一个请求 Ascend 资源（例如 huawei.com/Ascend910B4）的 Qwen3 微调 TrainJob。
运行 MindSpeed-LLM 工作流：Hugging Face checkpoint 转换、数据集预处理和 SFT 训练。

当你的集群提供 Ascend NPU，且模型训练镜像必须包含 torch_npu、mindspeed 和 mindspeed_llm 时，请使用此 notebook。

使用 Kueue 进行调度

Kueue 为 Kubernetes 工作负载提供作业排队、配额管理和公平调度功能。当你的集群安装了 Kueue 后，TrainJob 会保持在 suspended 状态，直到 Kueue 根据可用配额接纳它们。

工作原理

集群管理员创建一个包含资源配额（CPU、内存、GPU）的 ClusterQueue。
namespace 管理员创建一个指向该 ClusterQueue 的 LocalQueue。
用户为其 TrainJob 添加 kueue.x-k8s.io/queue-name 标签，将其提交到某个 LocalQueue。
Kueue 评估资源请求，在配额可用时接纳该工作负载，并取消作业的挂起状态。

有关配置 ClusterQueue 和 LocalQueue 的更多详细信息，请参阅 Kueue 文档。

创建 LocalQueue（可选）

在使用 Kueue 提交 TrainJob 之前，请先在你的 namespace 中创建一个引用现有 ClusterQueue 的 LocalQueue：

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: local-queue
  namespace: mlops-demo-ai-test
spec:
  clusterQueue: cluster-queue

kubectl apply -f kf-local-queue.yaml

使用 Kueue 提交 TrainJob（可选）

要集成 Kueue，请在 TrainJob 的 metadata.labels 中添加 kueue.x-k8s.io/queue-name 标签。这会告诉 Kueue 该作业属于哪个 LocalQueue：

metadata:
  generateName: trainjob-sft-qwen3-
  namespace: mlops-demo-ai-test
  labels:
    kueue.x-k8s.io/queue-name: local-queue

其余的 TrainJob spec 保持不变。完整示例请参阅 notebook。

NOTE

启用 Kueue 时，集群可能配置了 PodsReady timeout（例如 5 分钟）。如果你的训练镜像较大且尚未缓存在节点上，首次尝试可能会因 image pull 超时而被驱逐。通常重新提交作业即可成功，因为镜像会被缓存在本地。

#使用 Kubeflow Trainer v2 进行微调

#目录

#概述

#前提条件

#RBAC 权限

#构建 Trainer 镜像或使用预构建镜像

#下载 notebook 并运行示例

#使用 MindSpeed-LLM 在 Ascend NPU 上进行微调

#使用 Kueue 进行调度

#工作原理

#创建 LocalQueue（可选）

#使用 Kueue 提交 TrainJob（可选）