创建纯 CPU 与 GPU 加速的硬件配置文件

在生产环境的 AI 平台中，您通常需要为不同类型的机器学习工作负载提供服务。例如，传统的机器学习模型（如 scikit-learn 或 XGBoost）或简单的数据处理任务只需要 CPU 资源，而大语言模型（LLMs）或复杂的深度学习模型则需要 GPU 加速。

通过为纯 CPU 和 GPU 加速的工作负载创建不同的 Hardware Profile，您可以有效隔离这两类服务，并防止轻量级 CPU 模型无意中消耗昂贵的 GPU 资源。

示例 1：纯 CPU 硬件配置文件

纯 CPU 配置文件不会包含任何加速器标识符（例如 nvidia.com/gpu），而是严格依赖 cpu 和 memory 标识符。

创建纯 CPU 配置文件时，请确保：

完全排除 Accelerator 资源类型。
Node Selector 不会指向任何特定于 GPU 的节点。
名称和描述能够清楚表明该配置文件用于标准 ML 推理或轻量级模型。

以下是纯 CPU 硬件配置文件的示例：

apiVersion: infrastructure.opendatahub.io/v1alpha1
kind: HardwareProfile
metadata:
  name: standard-cpu-profile
  namespace: kube-public
spec:
  # Do not include nvidia.com/gpu
  identifiers:
    - identifier: "cpu"
      displayName: "CPU"
      minCount: "1"
      maxCount: "8"
      defaultCount: "2"
      resourceType: CPU
    - identifier: "memory"
      displayName: "Memory"
      minCount: "2Gi"
      maxCount: "16Gi"
      defaultCount: "4Gi"
      resourceType: Memory
  # Standard CPU nodes
  scheduling:
    type: Node
    node:
      nodeSelector:
        node-role.kubernetes.io/worker: "true"

示例 2：GPU 加速硬件配置文件

GPU 加速配置文件会显式要求使用 nvidia.com/gpu 标识符，从而确保选择该配置文件的任何工作负载都会分配到物理 GPU 资源。

创建 GPU 加速配置文件时：

包含特定加速器的标识符（例如 nvidia.com/gpu）。
如果 GPU 节点设置了污点，请添加相应的 Tolerations（例如 nvidia.com/gpu:NoSchedule）。
可选地添加 Node Selector 以定位特定的 GPU 架构（例如 accelerator: nvidia-t4）。

以下是 GPU 加速硬件配置文件的示例：

apiVersion: infrastructure.opendatahub.io/v1alpha1
kind: HardwareProfile
metadata:
  name: gpu-t4-profile
  namespace: kube-public
spec:
  identifiers:
    # Crucially include the GPU resource
    - identifier: "nvidia.com/gpu"
      displayName: "GPU"
      minCount: "1"
      maxCount: "4"
      defaultCount: "1"
      resourceType: Accelerator
    - identifier: "cpu"
      displayName: "CPU"
      minCount: "4"
      maxCount: "16"
      defaultCount: "8"
      resourceType: CPU
    - identifier: "memory"
      displayName: "Memory"
      minCount: "16Gi"
      maxCount: "64Gi"
      defaultCount: "32Gi"
      resourceType: Memory
  scheduling:
    type: Node
    node:
      nodeSelector:
        accelerator: nvidia-t4
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

通过提供这两种截然不同的配置文件，平台管理员可以确保数据科学家获得他们所需的精确环境，同时又不会将高价值计算资源浪费在简单任务上。

#创建纯 CPU 与 GPU 加速的硬件配置文件

#目录

#示例 1：纯 CPU 硬件配置文件

#示例 2：GPU 加速硬件配置文件

创建纯 CPU 与 GPU 加速的硬件配置文件

目录

示例 1：纯 CPU 硬件配置文件

示例 2：GPU 加速硬件配置文件