Create Inference Service using CLI

本文档介绍如何使用命令行创建 kserve 推理服务,并提供示例 YAML 文件。

前提条件

  • 您已可以访问安装了 KServe 的 Kubernetes 集群。
  • 您已配置好 kubectl 以与您的集群通信。
  • 您拥有一个命名空间,并且有权限创建 InferenceService 资源。

创建 InferenceService

创建一个名为 qwen-2-vllm.yaml 的 YAML 文件,内容如下:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    # 1. 可选:模型仓库引用注解
    aml-model-repo: Qwen2.5-0.5B-Instruct
    aml-model-repo-branch: main
    aml-model-repo-id: "5010"
    aml-model-repo-tag-commit: 81e07c4246b054b3f3bd2b2c998fdd6da98b1d7f
    aml-pipeline-tag: text-generation
    cpaas.io/description: ""
    serving.knative.dev/progress-deadline: 2400s
    serving.kserve.io/deploymentMode: Standard
  labels:
    # 2. 必需:模型仓库相关标签
    aml-model-group: kubeflow-admin-cpaas-io
    aml-model-repo: Qwen2.5-0.5B-Instruct
    aml-model-subgroup: amlmodels
    aml-pipeline-tag: text-generation
    aml.cpaas.io/runtime-type: vllm
    service.subdomain: "qwen-2-mlops-demo-ai-test"
    # 3. 必需:服务名称和命名空间
  name: qwen-2
  namespace: mlops-demo-ai-test
spec:
  predictor:
    # 4. GPU 亲和性设置
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: nvidia.com/cuda.runtime.major
              operator: In
              values:
              - "12"
            - key: nvidia.com/cuda.runtime.minor
              operator: Gt
              values:
              - "5"
          weight: 100
    maxReplicas: 1
    minReplicas: 1
    model:
      command:
      - bash
      - -c
      - |
        set -ex
        # 1. 检查 GPU 数量
        GPU_COUNT=$(python3 -c "import torch; print(torch.cuda.device_count())")
        echo "Starting serving model name: ${MODEL_NAME}, num gpus: ${GPU_COUNT}"
        if [ ${GPU_COUNT} -lt 1 ]; then
            echo "No GPUs found. Please check if the container have aquired any GPU device"
            exit 1
        fi

        # 2. 检查模型路径
        MODEL_DIR="/mnt/models/${MODEL_NAME}"
        # a. 使用 git lfs 存储初始化器,模型将位于 /mnt/models/<model_name>
        # b. 使用 hf 存储初始化器,模型将位于 /mnt/models
        if [ ! -d "${MODEL_DIR}" ]; then
            MODEL_DIR="/mnt/models"
            echo "[WARNING] Model directory ${MODEL_DIR}/${MODEL_NAME} not found, using ${MODEL_DIR} instead"
        fi

        # 3. 检查是否使用 gguf 模型
        c=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' | wc -l`
        echo "find ${c} gguf files"
        if [ "${c}" -gt 1 ]; then
            echo "[ERROR] More than one gguf file found in ${MODEL_DIR}"
            echo "Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use gguf-split tool to merge them to a single-file model."
            exit 1
        elif [ "${c}" -eq 1 ]; then
            n=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' -print`
            echo "[INFO] Using GGUF model file: ${n}"
            MODEL_PATH="${n}"
        else
            echo "[INFO] Using standard model directory"
            MODEL_PATH="${MODEL_DIR}"
        fi

        # 4. 启动 vllm 服务器
        if [ "$ENABLE_CHUNKED_PREFILL" = "False" ]; then
            PARAM_ENABLE_CHUNKED_PREFILL="--no-enable-chunked-prefill"
        else
            PARAM_ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
        fi
        if [ "$ENFORCE_EAGER" = "True" ]; then
            PARAM_ENFORCE_EAGER="--enforce-eager"
        else
            PARAM_ENFORCE_EAGER=""
        fi
        python3 -m vllm.entrypoints.openai.api_server \
        --port 8080 \
        --served-model-name {{.Name}} {{.Namespace}}/{{.Name}} \
        --model ${MODEL_PATH} \
        --dtype ${DTYPE} \
        --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
        --tensor-parallel-size ${GPU_COUNT} \
        ${PARAM_ENFORCE_EAGER} ${PARAM_ENABLE_CHUNKED_PREFILL} $@
      - bash
      env:
      - name: DTYPE
        value: half
      - name: ENABLE_CHUNKED_PREFILL
        value: "False"
      - name: ENFORCE_EAGER
        value: "True"
      - name: GPU_MEMORY_UTILIZATION
        value: "0.95"
      - name: MODEL_NAME
        value: '{{ index .Annotations "aml-model-repo" }}'
      modelFormat:
        name: transformers
      name: ""
      protocolVersion: v2
      # 4. 必需:设置该服务使用的资源
      resources:
        limits:
          cpu: "4"
          ephemeral-storage: 10Gi
          memory: 16Gi
          nvidia.com/gpualloc: "1"
          nvidia.com/gpucores: "50"
          nvidia.com/gpumem: "8192"
        requests:
          cpu: "1"
          memory: 2Gi
      # 5. 必需:选择 "ClusterServingRuntime" 和 storageUri
      runtime: aml-vllm-0.9.2-cuda-12.6
      storageUri: hf://kubeflow-admin-cpaas-io/Qwen2.5-0.5B-Instruct:81e07c4246b054b3f3bd2b2c998fdd6da98b1d7f
    securityContext:
      seccompProfile:
        type: RuntimeDefault 

然后使用 kubectl 将该 YAML 文件应用到您的集群:

kubectl apply -f qwen-2-vllm.yaml -n <your-namespace>

检查 InferenceService 状态

您可以使用以下命令检查 InferenceService 的状态:

kubectl get inferenceservice qwen-2 -n <your-namespace>

输出应类似如下:

NAME     URL                                           READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION      AGE
qwen-2   http://qwen-2.mlops-demo-ai-test.example.com  True           100                              qwen-2-predictor-00001   2m

READY 列显示为 True 时,表示 InferenceService 已准备好接受请求。

使用 OpenAI 兼容 API 测试 InferenceService

InferenceService 准备好后,您可以使用 OpenAI 兼容的 API 进行测试。

  1. 获取 InferenceService 的 URL:

    SERVICE_URL=$(kubectl get inferenceservice qwen-2 -n <your-namespace> -o jsonpath='{.status.url}')
    echo $SERVICE_URL
  2. 使用 curl/v1/chat/completions 端点发送请求:

    curl -X POST ${SERVICE_URL}/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "qwen-2",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "What is Kubernetes?"}
        ],
        "max_tokens": 50,
        "temperature": 0.7
      }'

    注意:请求体中的 model 参数应与 InferenceService YAML 中配置的 --served-model-name 参数匹配(本例中为 qwen-2mlops-demo-ai-test/qwen-2)。

响应应包含模型生成的文本内容。