扩展推理运行时

简介

本文将引导你了解如何为 Large Language Models (LLMs) 以及图像分类、目标检测和文本分类等其他模型添加新的推理运行时。

Alauda AI 自带一个内置的 vLLM 推理引擎。通过自定义推理运行时，你可以引入其他推理引擎，例如 Seldon MLServer、 Triton Inference Server。

通过引入自定义运行时，你可以扩展平台对更广泛模型类型和 GPU 类型的支持，并针对特定场景优化性能，以满足更广泛的业务需求。

本节将演示如何通过自定义 Xinference 服务运行时扩展当前 AI 平台，以部署 LLMs 并暴露一个兼容 OpenAI 的 API。

场景

如果你遇到以下任一情况，可以考虑扩展 AI 平台的 Inference Service 运行时：

支持新的模型类型：当前默认推理运行时 vLLM 原生不支持你的模型。
兼容其他硬件类型：你需要在 AMD GPU 或 Huawei Ascend NPU 等硬件上执行 LLM 推理。
特定场景下的性能优化：在某些推理场景中，新运行时（例如 Xinference）相比现有运行时可能提供更好的性能或资源利用率。
自定义推理逻辑：你需要引入自定义推理逻辑或依赖库，而这些很难在现有默认运行时中实现。

前提条件

开始之前，请确保满足以下条件：

你的 ACP 集群已部署并且运行正常。
你的 AI Platform 版本为1.3 或更高。
你已准备好必要的推理运行时镜像。例如，对于 Xinference 运行时，镜像可能类似于 xprobe/xinference:v1.2.2（GPU 版本）或 xprobe/xinference:v1.2.2-cpu（CPU 版本）。
你拥有集群管理员权限（创建 CRD 实例所需）。

标准工作流（示例：Xinference）

按照以下步骤扩展平台。我们使用 Xinference 作为基础示例来演示标准流程。

创建 Inference Runtime 资源

你需要基于目标硬件环境（GPU/CPU/NPU）创建相应的 Inference Runtime ClusterServingRuntime 资源。

准备 Runtime YAML 配置：

根据你要添加的运行时类型（例如 Xinference）以及目标硬件环境，准备相应的 YAML 配置文件。以下是 Xinference 运行时在不同硬件环境下的示例：

GPU Runtime 示例

# This is a sample YAML for Xinference GPU runtime
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: aml-xinference-cuda-12.1 # Name of the runtime resource
  labels:
    cpaas.io/runtime-class: xinference # required runtime type label
    cpaas.io/accelerator-type: "nvidia"
    cpaas.io/cuda-version: "12.1"
  annotations:
    cpaas.io/display-name: xinference-cuda-12.1 # Display name in the UI
spec:
  containers:
  - name: kserve-container
    image: xprobe/xinference:v1.2.2  # Replace with your actual GPU runtime image
    env:
    # Required across all runtimes – path to the model directory
    - name: MODEL_PATH
      value: /mnt/models/{{ index .Annotations "aml-model-repo" }}
    # The MODEL_UID parameter is optional for other runtimes.
    - name: MODEL_UID 
      value: '{{ index .Annotations "aml-model-repo" }}'
    # The MODEL_ENGINE parameter is required by the Xinference runtime, while it can be omitted for other runtimes.
    - name: MODEL_ENGINE 
      value: "transformers"
    # Required parameter for xinference runtime, please set it based on your model family, value: "llama" # e.g., "llama", "chatglm", etc.
    - name: MODEL_FAMILY 
      value: ""
    command:
    - bash
    - -c
    - |
        set +e
        if [ "${MODEL_PATH}" == "" ]; then
            echo "Need to set MODEL_PATH!"
            exit 1
        fi
        if [ "${MODEL_ENGINE}" == "" ]; then
            echo "Need to set MODEL_ENGINE!"
            exit 1
        fi
        if [ "${MODEL_UID}" == "" ]; then
            echo "Need to set MODEL_UID!"
            exit 1
        fi
        if [ "${MODEL_FAMILY}" == "" ]; then
            echo "Need to set MODEL_FAMILY!"
            exit 1
        fi

        xinference-local --host 0.0.0.0 --port 8080 &
        PID=$!
        while [ true ];
        do
            curl http://127.0.0.1:8080/docs
            if [ $? -eq 0 ]; then
                break
            else
                echo "waiting xinference-local server to become ready..."
                sleep 1
            fi
        done

        set -e
        xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} -e http://127.0.0.1:8080 $@
        xinference list -e http://127.0.0.1:8080
        echo "model load succeeded, waiting server process: ${PID}..."
        wait ${PID}
    # Add this line to use $@ in the script:
    # see: https://unix.stackexchange.com/questions/144514/add-arguments-to-bash-c
    - bash
    resources:
      limits:
        cpu: 2
        memory: 6Gi
      requests:
        cpu: 2
        memory: 6Gi
    startupProbe:
      httpGet:
        path: /docs
        port: 8080
        scheme: HTTP
      failureThreshold: 60 
      periodSeconds: 10
      timeoutSeconds: 10
  supportedModelFormats:
    - name: transformers # The model format supported by the runtime
      version: "1"

提示：请务必将 image 字段值替换为你实际准备的运行时镜像路径。你也可以修改 annotations.cpaas.io/display-name 字段，以自定义 AI Platform UI 中该运行时的显示名称。

应用 YAML 文件创建资源：

在具有集群管理员权限的终端中，执行以下命令应用你的 YAML 文件并创建推理运行时资源：
kubectl apply -f your-xinference-runtime.yaml
TIP
- 重要提示：请参考上面的示例，并根据你的实际环境和推理需求创建/配置运行时。 这些示例仅供参考。你需要调整诸如镜像、资源 limits 和 requests 等参数，以确保运行时与你的模型和硬件环境兼容并高效运行。
- 注意：只有在创建运行时资源后，才能在推理服务发布页面使用这个自定义运行时！

发布 Xinference Inference Service 并选择运行时

一旦 Xinference 推理运行时资源成功创建，你就可以在 AI Platform 上发布 LLM Inference Service 时选择并配置它。

为模型配置推理框架：

确保你即将发布的模型仓库的模型详情页上，已通过 File Management 元数据编辑功能选择了合适的framework。此处选择的 framework 值必须与创建推理服务运行时时 supportedModelFormats 字段中包含的某个值匹配。请确保模型的 framework 值已列在该推理运行时的 supportedModelFormats 字段中。
进入 Inference Service 发布页面：

登录 AI Platform，进入 “Inference Services” 或 “Model Deployment” 模块，然后点击 “Publish Inference Service”。
选择 Xinference 运行时：

在推理服务创建向导中，找到 “Runtime” 或 “Inference Framework” 选项。从下拉菜单或列表中选择你在第 1 步创建的 Xinference 运行时（例如 “Xinference CPU Runtime” 或 “Xinference GPU Runtime (CUDA)”）。

设置环境变量： Xinference 运行时需要特定的环境变量才能正常工作。在推理服务配置页面中，找到 “Environment Variables” 或 “More Settings” 部分，并添加以下环境变量：

环境变量

参数名	描述
`MODEL_FAMILY`	必填。指定你正在部署的 LLM 模型所属的 family 类型。Xinference 使用此参数识别并加载该模型对应的正确推理逻辑。例如，如果你部署的是 Llama 3 模型，请将其设置为 `llama`；如果是 ChatGLM 模型，请将其设置为 `chatglm`。请根据模型的实际 family 进行设置。

示例：
- 变量名：MODEL_FAMILY
- 变量值：llama（如果你使用的是 Llama 系列模型，请查看 docs 了解更多详情。或者你也可以运行 xinference registrations -t LLM 来列出所有受支持的模型 family。）

特定运行时示例

在了解标准工作流后，可参考以下示例了解其他运行时的具体配置。

MLServer

MLServer 运行时用途广泛，可用于 NVIDIA GPU 和 CPU。

kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
  annotations:
    cpaas.io/display-name: mlserver-cuda11.6-x86-arm
  creationTimestamp: 2026-01-05T07:02:33Z
  generation: 1
  labels:
    cpaas.io/accelerator-type: nvidia
    cpaas.io/cuda-version: "11.6"
    cpaas.io/runtime-class: mlserver
  name: aml-mlserver-cuda-11.6
spec:
  containers:
    - command:
        - /bin/bash
        - -lc
        - |
          if [ "$MODEL_TYPE" = "text-to-image" ]; then
            MODEL_IMPL="mlserver_diffusers.StableDiffusionRuntime"
          else
            MODEL_IMPL="mlserver_huggingface.HuggingFaceRuntime"
          fi            

          MODEL_DIR="${MLSERVER_MODEL_URI}/${MLSERVER_MODEL_NAME}"
          # a. using git lfs storage initializer, model will be in /mnt/models/<model_name>
          # b. using hf storage initializer, model will be in /mnt/models
          if [ ! -d "${MODEL_DIR}" ]; then
              MODEL_DIR="${MLSERVER_MODEL_URI}"
              echo "[WARNING] Model directory ${MODEL_DIR}/${MLSERVER_MODEL_NAME} not found, using ${MODEL_DIR} instead"
          fi

          export MLSERVER_MODEL_IMPLEMENTATION=${MODEL_IMPL}
          export MLSERVER_MODEL_EXTRA="{\"task\":\"${MODEL_TYPE}\",\"pretrained_model\":\"${MODEL_DIR}\"}"            

          mlserver start $MLSERVER_MODEL_URI $@
        - bash
      env:
        - name: MLSERVER_MODEL_URI
          value: /mnt/models
        - name: MLSERVER_MODEL_NAME
          value: '{{ index .Annotations "aml-model-repo" }}'
        - name: MODEL_TYPE
          value: '{{ index .Annotations "aml-pipeline-tag" }}'
      image: alaudadockerhub/seldon-mlserver:1.6.0-cu116-v1.3.1
      name: kserve-container
      resources:
        limits:
          cpu: 2
          memory: 6Gi
        requests:
          cpu: 2
          memory: 6Gi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
        runAsUser: 1000
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /v2/models/{{ index .Annotations "aml-model-repo" }}/ready
          port: 8080
          scheme: HTTP
        periodSeconds: 10
        timeoutSeconds: 10
  labels:
    modelClass: mlserver_sklearn.SKLearnModel
  supportedModelFormats:
    - name: mlflow
      version: "1"
    - name: transformers
      version: "1"

Triton Inference Server

Triton Inference Server 运行时专为 NVIDIA GPU 设计，并支持多种模型格式。与 MLServer 类似，你需要先创建 ClusterServingRuntime 资源，然后再创建推理服务。

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  annotations:
    cpaas.io/display-name: triton-cuda12-x86
  labels:
    cpaas.io/accelerator-type: nvidia
    cpaas.io/cuda-version: "12.1"
    cpaas.io/runtime-class: triton
  name: aml-triton-cuda-12
spec:
  containers:
    - command:
        - /bin/bash
        - -c
        - >
          tritonserver --log-verbose=1 --http-port=8080
          --model-repository=/mnt/models
      env:
        - name: OMP_NUM_THREADS
          value: "1"
        - name: MODEL_REPO
          value: '{{ index .Annotations "aml-model-repo" }}'
      image: alaudadockerhub/tritonserver:25.02-py3
      name: kserve-container
      resources:
        limits:
          cpu: 2
          memory: 6Gi
        requests:
          cpu: 2
          memory: 6Gi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
        runAsUser: 1000
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /v2/models/{{ index .Annotations "aml-model-repo" }}/ready
          port: 8080
          scheme: HTTP
        periodSeconds: 10
        timeoutSeconds: 10
  supportedModelFormats:
    - name: triton
      version: "1"

使用说明：

创建 ClusterServingRuntime：使用 kubectl apply -f triton-runtime.yaml 应用上述 YAML 配置
准备你的模型：确保你的模型是 Triton 支持的格式（例如 TensorFlow、PyTorch、ONNX）
设置模型框架：在模型仓库中，将 framework 元数据设置为 triton，以匹配 supportedModelFormats 字段
创建 Inference Service：发布推理服务时，从运行时下拉菜单中选择 Triton 运行时

vLLM-ascend（Ascend NPU）

vLLM-ascend 运行时适用于 Huawei Ascend NPU。它保留了 vLLM 的 OpenAI 兼容服务风格，但需要为可写路径和组权限额外配置一些 InferenceService 设置。

此示例已在 Ascend 910B4 上验证。它也应适用于其他 Ascend NPU 型号，但你应根据实际环境调整资源键、镜像以及相关版本字段。

1. ClusterServingRuntime

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  annotations:
    aml.cpaas.io/model-type: '["generative"]'
    aml.cpaas.io/user-create: "true"
    cpaas.io/display-name: vllm-ascend-cann8.5
    helm.sh/resource-policy: keep
  labels:
    cpaas.io/accelerator-type: ascend
    cpaas.io/cann-version: "8.5.1"
    cpaas.io/runtime-class: vllm
  name: aml-vllm-ascend-cann-8.5.1
spec:
  containers:
    - command:
        - bash
        - -c
        - >
          set -ex

          # 1. check model path

          MODEL_DIR="/mnt/models/${MODEL_NAME}"

          # a. using git lfs storage initializer, model will be in
          /mnt/models/<model_name>

          # b. using hf storage initializer, model will be in /mnt/models

          if [ ! -d "${MODEL_DIR}" ]; then
              MODEL_DIR="/mnt/models"
              echo "[WARNING] Model directory ${MODEL_DIR}/${MODEL_NAME} not found, using ${MODEL_DIR} instead"
          fi


          # 2. check if using gguf models

          c=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' | wc -l`

          echo "find ${c} gguf files"

          if [ "${c}" -gt 1 ]; then
              echo "[ERROR] More than one gguf file found in ${MODEL_DIR}"
              echo "Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use gguf-split tool to merge them to a single-file model."
              exit 1
          elif [ "${c}" -eq 1 ]; then
              n=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' -print`
              echo "[INFO] Using GGUF model file: ${n}"
              MODEL_PATH="${n}"
          else
              echo "[INFO] Using standard model directory"
              MODEL_PATH="${MODEL_DIR}"
          fi


          # 3. launch vllm server

          python3 -m vllm.entrypoints.openai.api_server \

          --port 8080 \

          --served-model-name {{.Name}} {{.Namespace}}/{{.Name}} \

          --model ${MODEL_PATH} \

          --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \

          $@
        - bash
      env:
        - name: MODEL_NAME
          value: '{{ index .Annotations "aml-model-repo" }}'
        - name: GPU_MEMORY_UTILIZATION
          value: "0.95"
      image: quay.io/ascend/vllm-ascend:v0.18.0rc1
      name: kserve-container
      ports:
        - containerPort: 8080
          name: http1
          protocol: TCP
      resources:
        limits:
          cpu: 2
          memory: 6Gi
        requests:
          cpu: 2
          memory: 6Gi
      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          drop:
            - ALL
        privileged: false
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      startupProbe:
        exec:
          command:
            - sh
            - -c
            - >
              curl -s -o /dev/null -w "%{http_code}" -X POST
              "http://127.0.0.1:8080/v1/completions" -H "Content-Type:
              application/json" -d '{"model": "{{ .Name }}", "prompt": "ping"}'
              | grep -q "200"
        failureThreshold: 60
        periodSeconds: 10
        timeoutSeconds: 180
      volumeMounts:
        - mountPath: /dev/shm
          name: devshm
  protocolVersions:
    - v2
  supportedModelFormats:
    - name: transformers
      version: "1"
  volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 1Gi
      name: devshm

2. InferenceService 示例所需更改

在使用 vLLM-ascend 发布推理服务时，需要对你的 InferenceService 示例进行以下必需更改：

kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
  name: qwen35
  namespace: demo
  annotations:
    aml-model-repo: Qwen3.5-0.8B
    modelFormat: transformers
    serving.kserve.io/deploymentMode: Standard
  labels:
    aml.cpaas.io/runtime-type: vllm
spec:
  predictor:
    model:
      env:
        - name: HOME
          value: /tmp
      modelFormat:
        name: transformers
      protocolVersion: v2
      resources:
        limits:
          cpu: "4"
          huawei.com/Ascend910B4: "1"
          memory: 16Gi
        requests:
          cpu: "2"
          memory: 8Gi
      runtime: aml-vllm-ascend-0.18.0rc1
      storageUri: pvc://qwen35/Qwen3.5-0.8B
    securityContext:
      fsGroup: 1000
      seccompProfile:
        type: RuntimeDefault
      supplementalGroups:
        - 1000

HOME 将临时文件和缓存指向 /tmp，该目录对运行时容器可写。
fsGroup: 1000 会使挂载文件继承组 1000，有助于将文件权限与允许访问 Ascend 设备的组对齐。
supplementalGroups: [1000] 会将容器进程加入组 1000，使其能够以预期的组权限访问 Ascend 设备及相关挂载文件。

MindIE（Ascend NPU）

MindIE 专为 Huawei Ascend 硬件设计。它在资源管理和元数据方面的配置差异较大。

此示例已在 Ascend 310P 上验证。它也应适用于其他 Ascend NPU 型号，但你应根据实际环境调整镜像、资源配置以及相关版本字段。

1.ClusterServingRuntime

# This is a sample YAML for Ascend NPU runtime
kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
  annotations:
    cpaas.io/display-name: mindie-2.2RC1
  labels:
    cpaas.io/accelerator-type: npu
    cpaas.io/cann-version: 8.3.0
    cpaas.io/runtime-class: mindie
  name: mindie-2.2rc1-310p
spec:
  containers:
    - command:
        - bash
        - -c
        - |
          REAL_SCRIPT=$(echo "$RAW_SCRIPT" | sed 's/__LT__/\x3c/g')
          echo "$REAL_SCRIPT" > /tmp/startup.sh
          chmod +x /tmp/startup.sh
          
          if [ ! -d "${MODEL_PATH}" ]; then
              MODEL_DIR="/mnt/models"
              echo "[WARNING] Model directory ${MODEL_PATH} not found, using ${MODEL_DIR} instead"
              MODEL_PATH="${MODEL_DIR}"
          fi

          CONFIG_FILE="${MODEL_PATH}/config.json"
          echo "Checking for file: ${CONFIG_FILE}"

          ls -ld "${MODEL_PATH}"
          chmod -R 755 "${MODEL_PATH}"
          echo "Fixing MODEL_PATH permission..."
          ls -ld "${MODEL_PATH}"

          /tmp/startup.sh --model-name "${MODEL_NAME}" --model-path "${MODEL_PATH}" --ip "${MY_POD_IP}"
      env:
        - name: RAW_SCRIPT
          value: |
            #!/bin/bash
            #
            #  Copyright 2024 Huawei Technologies Co., Ltd
            #
            #  Licensed under the Apache License, Version 2.0 (the "License");
            #  you may not use this file except in compliance with the License.
            #  You may obtain a copy of the License at
            #
            #  http://www.apache.org/licenses/LICENSE-2.0
            #
            #  Unless required by applicable law or agreed to in writing, software
            #  distributed under the License is distributed on an "AS IS" BASIS,
            #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
            #  See the License for the specific language governing permissions and
            #  limitations under the License.
            #  ============================================================================
            #

            ##
            # Script Instruction
            ##
            ### Name:
            ### run_mindie.sh - Use to Start MindIE Service given a specific model
            ###
            ### Usage:
            ###   bash run_mindie.sh --model-name xxx --model-path /path/to/model
            ###
            ### Required:
            ###   --model-name             :Given a model name to identify MindIE Service.
            ###   --model-path             :Given a model path which contain necessary files such as yaml/conf.json/tokenizer/vocab etc.
            ### Options:
            ###   --help                   :Show this message.
            ###   --ip                     :The IP address bound to the MindIE Server business plane RESTful interface,default value: 127.0.0.1.
            ###   --port                   :The port bound to the MindIE Server business plane RESTful interface,default value: 1025.
            ###   --management-ip          :The IP address bound to the MindIE Server management plane RESTful interface,default value: 127.0.0.2.
            ###   --management-port        :The port bound to the MindIE Server management plane RESTful interface,default value: 1026.
            ###   --metrics-port           :The port bound to the performance indicator monitoring interface,default value: 1027.
            ###   --max-seq-len            :Maximum sequence length,default value: 2560.
            ###   --max-iter-times         :The global maximum output length of the model,default value: 512.
            ###   --max-input-token-len    :The maximum length of the token id,default value: 2048.
            ###   --max-prefill-tokens     :Each time prefill occurs, the total number of input tokens in the current batch,default value: 8192
            ###   --truncation             :Whether to perform parameter rationalization check interception,default value: false.
            ###   --template-type          :Reasoning type,default value: "Standard"
            ###   --max-preempt-count      :The upper limit of the maximum number of preemptible requests in each batch,default value: 0.
            ###   --support-select-batch   :Batch selection strategy,default value: false.
            ###   --npu-mem-size           :This can be used to apply for the upper limit of the KV Cache size in the NPU,default value: 8.
            ###   --max-prefill-batch-size :The maximum prefill batch size,default value: 50.
            ###   --world-size             :Enable several cards for inference.
            ###                             1. If it is not set, the parallel config in the YAML file is obtained by default. Set worldsize = dp*mp*pp.
            ###                             2. If set, modify the parallel config in the YAML file. set parallel config: dp:1 mp:worldSize pp:1
            ###   --ms-sched-host          :MS Scheduler IP address,default value: 127.0.0.1.
            ###   --ms-sched-port          :MS Scheduler port,default value: 8119.
            ###   For more details about config description, please check MindIE homepage: https://www.hiascend.com/document/detail/zh/mindie/10RC3/mindiellm/llmdev/mindie_llm0004.html
            help() {
              awk -F'### ' '/^###/ { print $2 }' "$0"
            }

            if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then
              help
              exit 1
            fi

            ##
            # Get device info
            ##
            total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)

            if [[ -z "$total_count" ]]; then
                echo "Error: Unable to retrieve device info. Please check if npu-smi is available for current user (id 1001), or if you are specifying an occupied device."
                exit 1
            fi

            echo "$total_count device(s) detected!"

            ##
            # Set toolkit envs
            ##
            echo "Setting toolkit envs..."
            if [[ -f "/usr/local/Ascend/ascend-toolkit/set_env.sh" ]];then
                            source /usr/local/Ascend/ascend-toolkit/set_env.sh
                    else
                            echo "ascend-toolkit package is incomplete please check it."
                            exit 1
                    fi
            echo "Toolkit envs set succeeded!"

            ##
            # Set MindIE envs
            ##
            echo "Setting MindIE envs..."
            if [[ -f "/usr/local/Ascend/mindie/set_env.sh" ]];then
                            source /usr/local/Ascend/mindie/set_env.sh
                    else
                            echo "mindie package is incomplete please check it."
                            exit 1
                    fi
            echo "MindIE envs set succeeded!"

            ##
            # Default MS envs
            ##

            # Set PYTHONPATH
            MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
            export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH

            ##
            # Receive args and modify config.json
            ##
            export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
            CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
            echo "MindIE Service config path:$CONFIG_FILE"
            #default config
            BACKEND_TYPE="atb"
            MAX_SEQ_LEN=2560
            MAX_PREFILL_TOKENS=8192
            MAX_ITER_TIMES=512
            MAX_INPUT_TOKEN_LEN=2048
            TRUNCATION=false
            HTTPS_ENABLED=false
            MULTI_NODES_INFER_ENABLED=false
            NPU_MEM_SIZE=8
            MAX_PREFILL_BATCH_SIZE=50
            TEMPLATE_TYPE="Standard"
            MAX_PREEMPT_COUNT=0
            SUPPORT_SELECT_BATCH=false
            IP_ADDRESS="127.0.0.1"
            PORT=8080
            MANAGEMENT_IP_ADDRESS="127.0.0.2"
            MANAGEMENT_PORT=1026
            METRICS_PORT=1027

            #modify config
            while [[ "$#" -gt 0 ]]; do
                case $1 in
                    --model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
                    --model-name) MODEL_NAME="$2"; shift ;;
                    --max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
                    --max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
                    --max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
                    --max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
                    --truncation) TRUNCATION="$2"; shift ;;
                    --world-size) WORLD_SIZE="$2"; shift ;;
                    --template-type) TEMPLATE_TYPE="$2"; shift ;;
                    --max-preempt-count) MAX_PREEMPT_COUNT="$2"; shift ;;
                    --support-select-batch) SUPPORT_SELECT_BATCH="$2"; shift ;;
                    --npu-mem-size) NPU_MEM_SIZE="$2"; shift ;;
                    --max-prefill-batch-size) MAX_PREFILL_BATCH_SIZE="$2"; shift ;;
                    --ip) IP_ADDRESS="$2"; shift ;;
                    --port) PORT="$2"; shift ;;
                    --management-ip) MANAGEMENT_IP_ADDRESS="$2"; shift ;;
                    --management-port) MANAGEMENT_PORT="$2"; shift ;;
                    --metrics-port) METRICS_PORT="$2"; shift ;;
                    --ms-sched-host) ENV_MS_SCHED_HOST="$2"; shift ;;
                    --ms-sched-port) ENV_MS_SCHED_PORT="$2"; shift ;;
                    *)
                        echo "Unknown parameter: $1"
                        echo "Please check your inputs."
                        exit 1
                        ;;
                esac
                shift
            done

            if [ -z "$MODEL_WEIGHT_PATH" ] || [ -z "$MODEL_NAME" ]; then
                echo "Error: Both --model-path and --model-name are required."
                exit 1
            fi
            MODEL_NAME=${MODEL_NAME:-$(basename "$MODEL_WEIGHT_PATH")}
            echo "MODEL_NAME is set to: $MODEL_NAME"

            WORLD_SIZE=$total_count
            NPU_DEVICE_IDS=$(seq -s, 0 $(($WORLD_SIZE - 1)))

            #validate config
            if [[ "$BACKEND_TYPE" != "atb" ]]; then
                echo "Error: BACKEND must be 'atb'. Current value: $BACKEND_TYPE"
                exit 1
            fi

            if [[ ! "$IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]] ||
              [[ ! "$MANAGEMENT_IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]]; then
                echo "Error: IP_ADDRESS and MANAGEMENT_IP_ADDRESS must be valid IP addresses. Current values: IP_ADDRESS=$IP_ADDRESS, MANAGEMENT_IP_ADDRESS=$MANAGEMENT_IP_ADDRESS"
                exit 1
            fi

            if [[ ! "$PORT" =~ ^[0-9]+$ ]] || (( PORT __LT__ 1025 || PORT > 65535 )) ||
              [[ ! "$MANAGEMENT_PORT" =~ ^[0-9]+$ ]] || (( MANAGEMENT_PORT __LT__ 1025 || MANAGEMENT_PORT > 65535 )); then
                echo "Error: PORT and MANAGEMENT_PORT must be integers between 1025 and 65535. Current values: PORT=$PORT, MANAGEMENT_PORT=$MANAGEMENT_PORT"
                exit 1
            fi

            if [ "$MAX_PREFILL_TOKENS" -lt "$MAX_SEQ_LEN" ]; then
                MAX_PREFILL_TOKENS=$MAX_SEQ_LEN
                echo "MAX_PREFILL_TOKENS was less than MAX_SEQ_LEN. Setting MAX_PREFILL_TOKENS to $MAX_SEQ_LEN"
            fi

            MODEL_CONFIG_FILE="${MODEL_WEIGHT_PATH}/config.json"
            if [ ! -f "$MODEL_CONFIG_FILE" ]; then
                echo "Error: config.json file not found in $MODEL_WEIGHT_PATH."
                exit 1
            fi
            chmod 600 "$MODEL_CONFIG_FILE"
            #update config file
            chmod u+w ${MIES_INSTALL_PATH}/conf/
            sed -i "s/\"backendType\"\s*:\s*\"[^\"]*\"/\"backendType\": \"$BACKEND_TYPE\"/" $CONFIG_FILE
            sed -i "s/\"modelName\"\s*:\s*\"[^\"]*\"/\"modelName\": \"$MODEL_NAME\"/" $CONFIG_FILE
            sed -i "s|\"modelWeightPath\"\s*:\s*\"[^\"]*\"|\"modelWeightPath\": \"$MODEL_WEIGHT_PATH\"|" $CONFIG_FILE
            sed -i "s/\"maxSeqLen\"\s*:\s*[0-9]*/\"maxSeqLen\": $MAX_SEQ_LEN/" "$CONFIG_FILE"
            sed -i "s/\"maxPrefillTokens\"\s*:\s*[0-9]*/\"maxPrefillTokens\": $MAX_PREFILL_TOKENS/" "$CONFIG_FILE"
            sed -i "s/\"maxIterTimes\"\s*:\s*[0-9]*/\"maxIterTimes\": $MAX_ITER_TIMES/" "$CONFIG_FILE"
            sed -i "s/\"maxInputTokenLen\"\s*:\s*[0-9]*/\"maxInputTokenLen\": $MAX_INPUT_TOKEN_LEN/" "$CONFIG_FILE"
            sed -i "s/\"truncation\"\s*:\s*[a-z]*/\"truncation\": $TRUNCATION/" "$CONFIG_FILE"
            sed -i "s|\(\"npuDeviceIds\"\s*:\s*\[\[\)[^]]*\(]]\)|\1$NPU_DEVICE_IDS\2|" "$CONFIG_FILE"
            sed -i "s/\"worldSize\"\s*:\s*[0-9]*/\"worldSize\": $WORLD_SIZE/" "$CONFIG_FILE"
            sed -i "s/\"httpsEnabled\"\s*:\s*[a-z]*/\"httpsEnabled\": $HTTPS_ENABLED/" "$CONFIG_FILE"
            sed -i "s/\"templateType\"\s*:\s*\"[^\"]*\"/\"templateType\": \"$TEMPLATE_TYPE\"/" $CONFIG_FILE
            sed -i "s/\"maxPreemptCount\"\s*:\s*[0-9]*/\"maxPreemptCount\": $MAX_PREEMPT_COUNT/" $CONFIG_FILE
            sed -i "s/\"supportSelectBatch\"\s*:\s*[a-z]*/\"supportSelectBatch\": $SUPPORT_SELECT_BATCH/" $CONFIG_FILE
            sed -i "s/\"multiNodesInferEnabled\"\s*:\s*[a-z]*/\"multiNodesInferEnabled\": $MULTI_NODES_INFER_ENABLED/" "$CONFIG_FILE"
            sed -i "s/\"maxPrefillBatchSize\"\s*:\s*[0-9]*/\"maxPrefillBatchSize\": $MAX_PREFILL_BATCH_SIZE/" "$CONFIG_FILE"
            sed -i "s/\"ipAddress\"\s*:\s*\"[^\"]*\"/\"ipAddress\": \"$IP_ADDRESS\"/" "$CONFIG_FILE"
            sed -i "s/\"port\"\s*:\s*[0-9]*/\"port\": $PORT/" "$CONFIG_FILE"
            sed -i "s/\"managementIpAddress\"\s*:\s*\"[^\"]*\"/\"managementIpAddress\": \"$MANAGEMENT_IP_ADDRESS\"/" "$CONFIG_FILE"
            sed -i "s/\"managementPort\"\s*:\s*[0-9]*/\"managementPort\": $MANAGEMENT_PORT/" "$CONFIG_FILE"
            sed -i "s/\"metricsPort\"\s*:\s*[0-9]*/\"metricsPort\": $METRICS_PORT/" $CONFIG_FILE
            sed -i "s/\"npuMemSize\"\s*:\s*-*[0-9]*/\"npuMemSize\": $NPU_MEM_SIZE/" "$CONFIG_FILE"

            ##
            # Start service
            ##
            echo "Current configurations are displayed as follows:"
            cat $CONFIG_FILE
            npu-smi info -m > ~/device_info

            ${MIES_INSTALL_PATH}/bin/mindieservice_daemon
        - name: MODEL_NAME
          value: '{{ index .Annotations "aml-model-repo" }}'
        - name: MODEL_PATH
          value: /mnt/models/{{ index .Annotations "aml-model-repo" }}
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
      image: swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-300I-Duo-py311-openeuler24.03-lts
      name: kserve-container
      resources:
        limits:
          cpu: 2
          memory: 6Gi
        requests:
          cpu: 2
          memory: 6Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /v1/models
          port: 8080
          scheme: HTTP
        periodSeconds: 10
        timeoutSeconds: 180
  supportedModelFormats:
    - name: transformers
      version: "1"
  volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 8Gi
      name: dshm

2. InferenceService 的必需注解

与其他运行时不同，MindIE 在最终发布步骤中，InferenceService 的 metadata 必须包含以下注解。这样可以确保平台调度器正确地将 NPU 硬件绑定到该服务。

配置键	值	目的
`storage.kserve.io/readonly`	`"false"`	启用对模型存储卷的写访问。

3. 用户权限（Root 访问）

由于 Ascend 驱动和硬件抽象层的要求，MindIE 镜像必须以 root 用户运行。请确保你的 ClusterServingRuntime 或 InferenceService security context 已相应配置：

注意：上面的 MindIE ClusterServingRuntime YAML 示例未指定 securityContext，这意味着容器将使用镜像的默认设置运行（通常为 root）。与显式设置 runAsNonRoot: true 和 runAsUser: 1000 的 MLServer 不同，MindIE 需要 root 权限才能访问 NPU 硬件。

运行时配置对比

在继续之前，请参考下表了解不同运行时的具体要求：

Runtime	目标硬件	支持的框架	特殊要求
Xinference	CPU / NVIDIA GPU	transformers, pytorch	必须设置 `MODEL_FAMILY` 环境变量
MLServer	CPU / NVIDIA GPU	sklearn, xgboost, mlflow	标准配置
Triton	NVIDIA GPU	triton (TensorFlow, PyTorch, ONNX, etc.)	标准配置
vLLM-ascend	Huawei Ascend NPU（已在 910B4 上验证）	transformers	必须在 `InferenceService` 中添加 `HOME`、`fsGroup` 和 `supplementalGroups`
MindIE	Huawei Ascend NPU（已在 310P 上验证）	mindspore, transformers	必须在 `InferenceService` 中添加所需的 NPU 注解

#扩展推理运行时

#目录

#简介

#场景

#前提条件

#标准工作流（示例：Xinference）

#创建 Inference Runtime 资源

#发布 Xinference Inference Service 并选择运行时

#特定运行时示例

#MLServer

#Triton Inference Server

#vLLM-ascend（Ascend NPU）

#MindIE（Ascend NPU）

#运行时配置对比