#扩展推理运行时
#目录
简介场景前提条件标准工作流(示例:Xinference)特定运行时示例MLServerTriton Inference ServervLLM-ascend(Ascend NPU)MindIE(Ascend NPU)运行时配置对比#简介
本文将引导你了解如何为 Large Language Models (LLMs) 以及图像分类、目标检测和文本分类等其他模型添加新的推理运行时。
Alauda AI 自带一个内置的 vLLM 推理引擎。通过自定义推理运行时,你可以引入其他推理引擎,例如
Seldon MLServer、
Triton Inference Server。
通过引入自定义运行时,你可以扩展平台对更广泛模型类型和 GPU 类型的支持,并针对特定场景优化性能,以满足更广泛的业务需求。
本节将演示如何通过自定义 Xinference 服务运行时扩展当前 AI 平台,以部署 LLMs 并暴露一个兼容 OpenAI 的 API。
#场景
如果你遇到以下任一情况,可以考虑扩展 AI 平台的 Inference Service 运行时:
- 支持新的模型类型:当前默认推理运行时
vLLM原生不支持你的模型。 - 兼容其他硬件类型:你需要在 AMD GPU 或 Huawei Ascend NPU 等硬件上执行 LLM 推理。
- 特定场景下的性能优化:在某些推理场景中,新运行时(例如 Xinference)相比现有运行时可能提供更好的性能或资源利用率。
- 自定义推理逻辑:你需要引入自定义推理逻辑或依赖库,而这些很难在现有默认运行时中实现。
#前提条件
开始之前,请确保满足以下条件:
- 你的 ACP 集群已部署并且运行正常。
- 你的 AI Platform 版本为1.3 或更高。
- 你已准备好必要的推理运行时镜像。例如,对于 Xinference 运行时,镜像可能类似于
xprobe/xinference:v1.2.2(GPU 版本)或xprobe/xinference:v1.2.2-cpu(CPU 版本)。 - 你拥有集群管理员权限(创建 CRD 实例所需)。
#标准工作流(示例:Xinference)
按照以下步骤扩展平台。我们使用 Xinference 作为基础示例来演示标准流程。
#创建 Inference Runtime 资源
你需要基于目标硬件环境(GPU/CPU/NPU)创建相应的 Inference Runtime ClusterServingRuntime 资源。
-
准备 Runtime YAML 配置:
根据你要添加的运行时类型(例如 Xinference)以及目标硬件环境,准备相应的 YAML 配置文件。以下是 Xinference 运行时在不同硬件环境下的示例:
- GPU Runtime 示例
# This is a sample YAML for Xinference GPU runtime apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: aml-xinference-cuda-12.1 # Name of the runtime resource labels: cpaas.io/runtime-class: xinference # required runtime type label cpaas.io/accelerator-type: "nvidia" cpaas.io/cuda-version: "12.1" annotations: cpaas.io/display-name: xinference-cuda-12.1 # Display name in the UI spec: containers: - name: kserve-container image: xprobe/xinference:v1.2.2 # Replace with your actual GPU runtime image env: # Required across all runtimes – path to the model directory - name: MODEL_PATH value: /mnt/models/{{ index .Annotations "aml-model-repo" }} # The MODEL_UID parameter is optional for other runtimes. - name: MODEL_UID value: '{{ index .Annotations "aml-model-repo" }}' # The MODEL_ENGINE parameter is required by the Xinference runtime, while it can be omitted for other runtimes. - name: MODEL_ENGINE value: "transformers" # Required parameter for xinference runtime, please set it based on your model family, value: "llama" # e.g., "llama", "chatglm", etc. - name: MODEL_FAMILY value: "" command: - bash - -c - | set +e if [ "${MODEL_PATH}" == "" ]; then echo "Need to set MODEL_PATH!" exit 1 fi if [ "${MODEL_ENGINE}" == "" ]; then echo "Need to set MODEL_ENGINE!" exit 1 fi if [ "${MODEL_UID}" == "" ]; then echo "Need to set MODEL_UID!" exit 1 fi if [ "${MODEL_FAMILY}" == "" ]; then echo "Need to set MODEL_FAMILY!" exit 1 fi xinference-local --host 0.0.0.0 --port 8080 & PID=$! while [ true ]; do curl http://127.0.0.1:8080/docs if [ $? -eq 0 ]; then break else echo "waiting xinference-local server to become ready..." sleep 1 fi done set -e xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} -e http://127.0.0.1:8080 $@ xinference list -e http://127.0.0.1:8080 echo "model load succeeded, waiting server process: ${PID}..." wait ${PID} # Add this line to use $@ in the script: # see: https://unix.stackexchange.com/questions/144514/add-arguments-to-bash-c - bash resources: limits: cpu: 2 memory: 6Gi requests: cpu: 2 memory: 6Gi startupProbe: httpGet: path: /docs port: 8080 scheme: HTTP failureThreshold: 60 periodSeconds: 10 timeoutSeconds: 10 supportedModelFormats: - name: transformers # The model format supported by the runtime version: "1"- 提示:请务必将
image字段值替换为你实际准备的运行时镜像路径。你也可以修改annotations.cpaas.io/display-name字段,以自定义 AI Platform UI 中该运行时的显示名称。
- 提示:请务必将
-
应用 YAML 文件创建资源:
在具有集群管理员权限的终端中,执行以下命令应用你的 YAML 文件并创建推理运行时资源:
kubectl apply -f your-xinference-runtime.yamlTIP- 重要提示:请参考上面的示例,并根据你的实际环境和推理需求创建/配置运行时。 这些示例仅供参考。你需要调整诸如镜像、资源
limits和requests等参数,以确保运行时与你的模型和硬件环境兼容并高效运行。 - 注意:只有在创建运行时资源后,才能在推理服务发布页面使用这个自定义运行时!
- 重要提示:请参考上面的示例,并根据你的实际环境和推理需求创建/配置运行时。 这些示例仅供参考。你需要调整诸如镜像、资源
#发布 Xinference Inference Service 并选择运行时
一旦 Xinference 推理运行时资源成功创建,你就可以在 AI Platform 上发布 LLM Inference Service 时选择并配置它。
-
为模型配置推理框架:
确保你即将发布的模型仓库的模型详情页上,已通过 File Management 元数据编辑功能选择了合适的framework。此处选择的 framework 值必须与创建推理服务运行时时
supportedModelFormats字段中包含的某个值匹配。请确保模型的 framework 值已列在该推理运行时的supportedModelFormats字段中。 -
进入 Inference Service 发布页面:
登录 AI Platform,进入 “Inference Services” 或 “Model Deployment” 模块,然后点击 “Publish Inference Service”。
-
选择 Xinference 运行时:
在推理服务创建向导中,找到 “Runtime” 或 “Inference Framework” 选项。从下拉菜单或列表中选择你在第 1 步创建的 Xinference 运行时(例如 “Xinference CPU Runtime” 或 “Xinference GPU Runtime (CUDA)”)。
-
设置环境变量: Xinference 运行时需要特定的环境变量才能正常工作。在推理服务配置页面中,找到 “Environment Variables” 或 “More Settings” 部分,并添加以下环境变量:
-
环境变量
参数名 描述 MODEL_FAMILY必填。指定你正在部署的 LLM 模型所属的 family 类型。Xinference 使用此参数识别并加载该模型对应的正确推理逻辑。例如,如果你部署的是 Llama 3 模型,请将其设置为 llama;如果是 ChatGLM 模型,请将其设置为chatglm。请根据模型的实际 family 进行设置。 -
示例:
- 变量名:
MODEL_FAMILY - 变量值:
llama(如果你使用的是 Llama 系列模型,请查看 docs 了解更多详情。或者你也可以运行xinference registrations -t LLM来列出所有受支持的模型 family。)
- 变量名:
-
#特定运行时示例
在了解标准工作流后,可参考以下示例了解其他运行时的具体配置。
#MLServer
MLServer 运行时用途广泛,可用于 NVIDIA GPU 和 CPU。
kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
annotations:
cpaas.io/display-name: mlserver-cuda11.6-x86-arm
creationTimestamp: 2026-01-05T07:02:33Z
generation: 1
labels:
cpaas.io/accelerator-type: nvidia
cpaas.io/cuda-version: "11.6"
cpaas.io/runtime-class: mlserver
name: aml-mlserver-cuda-11.6
spec:
containers:
- command:
- /bin/bash
- -lc
- |
if [ "$MODEL_TYPE" = "text-to-image" ]; then
MODEL_IMPL="mlserver_diffusers.StableDiffusionRuntime"
else
MODEL_IMPL="mlserver_huggingface.HuggingFaceRuntime"
fi
MODEL_DIR="${MLSERVER_MODEL_URI}/${MLSERVER_MODEL_NAME}"
# a. using git lfs storage initializer, model will be in /mnt/models/<model_name>
# b. using hf storage initializer, model will be in /mnt/models
if [ ! -d "${MODEL_DIR}" ]; then
MODEL_DIR="${MLSERVER_MODEL_URI}"
echo "[WARNING] Model directory ${MODEL_DIR}/${MLSERVER_MODEL_NAME} not found, using ${MODEL_DIR} instead"
fi
export MLSERVER_MODEL_IMPLEMENTATION=${MODEL_IMPL}
export MLSERVER_MODEL_EXTRA="{\"task\":\"${MODEL_TYPE}\",\"pretrained_model\":\"${MODEL_DIR}\"}"
mlserver start $MLSERVER_MODEL_URI $@
- bash
env:
- name: MLSERVER_MODEL_URI
value: /mnt/models
- name: MLSERVER_MODEL_NAME
value: '{{ index .Annotations "aml-model-repo" }}'
- name: MODEL_TYPE
value: '{{ index .Annotations "aml-pipeline-tag" }}'
image: alaudadockerhub/seldon-mlserver:1.6.0-cu116-v1.3.1
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 1000
startupProbe:
failureThreshold: 60
httpGet:
path: /v2/models/{{ index .Annotations "aml-model-repo" }}/ready
port: 8080
scheme: HTTP
periodSeconds: 10
timeoutSeconds: 10
labels:
modelClass: mlserver_sklearn.SKLearnModel
supportedModelFormats:
- name: mlflow
version: "1"
- name: transformers
version: "1"
#Triton Inference Server
Triton Inference Server 运行时专为 NVIDIA GPU 设计,并支持多种模型格式。与 MLServer 类似,你需要先创建 ClusterServingRuntime 资源,然后再创建推理服务。
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
annotations:
cpaas.io/display-name: triton-cuda12-x86
labels:
cpaas.io/accelerator-type: nvidia
cpaas.io/cuda-version: "12.1"
cpaas.io/runtime-class: triton
name: aml-triton-cuda-12
spec:
containers:
- command:
- /bin/bash
- -c
- >
tritonserver --log-verbose=1 --http-port=8080
--model-repository=/mnt/models
env:
- name: OMP_NUM_THREADS
value: "1"
- name: MODEL_REPO
value: '{{ index .Annotations "aml-model-repo" }}'
image: alaudadockerhub/tritonserver:25.02-py3
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 1000
startupProbe:
failureThreshold: 60
httpGet:
path: /v2/models/{{ index .Annotations "aml-model-repo" }}/ready
port: 8080
scheme: HTTP
periodSeconds: 10
timeoutSeconds: 10
supportedModelFormats:
- name: triton
version: "1"使用说明:
- 创建 ClusterServingRuntime:使用
kubectl apply -f triton-runtime.yaml应用上述 YAML 配置 - 准备你的模型:确保你的模型是 Triton 支持的格式(例如 TensorFlow、PyTorch、ONNX)
- 设置模型框架:在模型仓库中,将 framework 元数据设置为
triton,以匹配supportedModelFormats字段 - 创建 Inference Service:发布推理服务时,从运行时下拉菜单中选择 Triton 运行时
#vLLM-ascend(Ascend NPU)
vLLM-ascend 运行时适用于 Huawei Ascend NPU。它保留了 vLLM 的 OpenAI 兼容服务风格,但需要为可写路径和组权限额外配置一些 InferenceService 设置。
此示例已在 Ascend 910B4 上验证。它也应适用于其他 Ascend NPU 型号,但你应根据实际环境调整资源键、镜像以及相关版本字段。
1. ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
annotations:
aml.cpaas.io/model-type: '["generative"]'
aml.cpaas.io/user-create: "true"
cpaas.io/display-name: vllm-ascend-cann8.5
helm.sh/resource-policy: keep
labels:
cpaas.io/accelerator-type: ascend
cpaas.io/cann-version: "8.5.1"
cpaas.io/runtime-class: vllm
name: aml-vllm-ascend-cann-8.5.1
spec:
containers:
- command:
- bash
- -c
- >
set -ex
# 1. check model path
MODEL_DIR="/mnt/models/${MODEL_NAME}"
# a. using git lfs storage initializer, model will be in
/mnt/models/<model_name>
# b. using hf storage initializer, model will be in /mnt/models
if [ ! -d "${MODEL_DIR}" ]; then
MODEL_DIR="/mnt/models"
echo "[WARNING] Model directory ${MODEL_DIR}/${MODEL_NAME} not found, using ${MODEL_DIR} instead"
fi
# 2. check if using gguf models
c=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' | wc -l`
echo "find ${c} gguf files"
if [ "${c}" -gt 1 ]; then
echo "[ERROR] More than one gguf file found in ${MODEL_DIR}"
echo "Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use gguf-split tool to merge them to a single-file model."
exit 1
elif [ "${c}" -eq 1 ]; then
n=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' -print`
echo "[INFO] Using GGUF model file: ${n}"
MODEL_PATH="${n}"
else
echo "[INFO] Using standard model directory"
MODEL_PATH="${MODEL_DIR}"
fi
# 3. launch vllm server
python3 -m vllm.entrypoints.openai.api_server \
--port 8080 \
--served-model-name {{.Name}} {{.Namespace}}/{{.Name}} \
--model ${MODEL_PATH} \
--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
$@
- bash
env:
- name: MODEL_NAME
value: '{{ index .Annotations "aml-model-repo" }}'
- name: GPU_MEMORY_UTILIZATION
value: "0.95"
image: quay.io/ascend/vllm-ascend:v0.18.0rc1
name: kserve-container
ports:
- containerPort: 8080
name: http1
protocol: TCP
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
startupProbe:
exec:
command:
- sh
- -c
- >
curl -s -o /dev/null -w "%{http_code}" -X POST
"http://127.0.0.1:8080/v1/completions" -H "Content-Type:
application/json" -d '{"model": "{{ .Name }}", "prompt": "ping"}'
| grep -q "200"
failureThreshold: 60
periodSeconds: 10
timeoutSeconds: 180
volumeMounts:
- mountPath: /dev/shm
name: devshm
protocolVersions:
- v2
supportedModelFormats:
- name: transformers
version: "1"
volumes:
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: devshm2. InferenceService 示例所需更改
在使用 vLLM-ascend 发布推理服务时,需要对你的 InferenceService 示例进行以下必需更改:
kind: InferenceService
apiVersion: serving.kserve.io/v1beta1
metadata:
name: qwen35
namespace: demo
annotations:
aml-model-repo: Qwen3.5-0.8B
modelFormat: transformers
serving.kserve.io/deploymentMode: Standard
labels:
aml.cpaas.io/runtime-type: vllm
spec:
predictor:
model:
env:
- name: HOME
value: /tmp
modelFormat:
name: transformers
protocolVersion: v2
resources:
limits:
cpu: "4"
huawei.com/Ascend910B4: "1"
memory: 16Gi
requests:
cpu: "2"
memory: 8Gi
runtime: aml-vllm-ascend-0.18.0rc1
storageUri: pvc://qwen35/Qwen3.5-0.8B
securityContext:
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
supplementalGroups:
- 1000HOME将临时文件和缓存指向/tmp,该目录对运行时容器可写。fsGroup: 1000会使挂载文件继承组1000,有助于将文件权限与允许访问 Ascend 设备的组对齐。supplementalGroups: [1000]会将容器进程加入组1000,使其能够以预期的组权限访问 Ascend 设备及相关挂载文件。
#MindIE(Ascend NPU)
MindIE 专为 Huawei Ascend 硬件设计。它在资源管理和元数据方面的配置差异较大。
此示例已在 Ascend 310P 上验证。它也应适用于其他 Ascend NPU 型号,但你应根据实际环境调整镜像、资源配置以及相关版本字段。
1.ClusterServingRuntime
# This is a sample YAML for Ascend NPU runtime
kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
annotations:
cpaas.io/display-name: mindie-2.2RC1
labels:
cpaas.io/accelerator-type: npu
cpaas.io/cann-version: 8.3.0
cpaas.io/runtime-class: mindie
name: mindie-2.2rc1-310p
spec:
containers:
- command:
- bash
- -c
- |
REAL_SCRIPT=$(echo "$RAW_SCRIPT" | sed 's/__LT__/\x3c/g')
echo "$REAL_SCRIPT" > /tmp/startup.sh
chmod +x /tmp/startup.sh
if [ ! -d "${MODEL_PATH}" ]; then
MODEL_DIR="/mnt/models"
echo "[WARNING] Model directory ${MODEL_PATH} not found, using ${MODEL_DIR} instead"
MODEL_PATH="${MODEL_DIR}"
fi
CONFIG_FILE="${MODEL_PATH}/config.json"
echo "Checking for file: ${CONFIG_FILE}"
ls -ld "${MODEL_PATH}"
chmod -R 755 "${MODEL_PATH}"
echo "Fixing MODEL_PATH permission..."
ls -ld "${MODEL_PATH}"
/tmp/startup.sh --model-name "${MODEL_NAME}" --model-path "${MODEL_PATH}" --ip "${MY_POD_IP}"
env:
- name: RAW_SCRIPT
value: |
#!/bin/bash
#
# Copyright 2024 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
#
##
# Script Instruction
##
### Name:
### run_mindie.sh - Use to Start MindIE Service given a specific model
###
### Usage:
### bash run_mindie.sh --model-name xxx --model-path /path/to/model
###
### Required:
### --model-name :Given a model name to identify MindIE Service.
### --model-path :Given a model path which contain necessary files such as yaml/conf.json/tokenizer/vocab etc.
### Options:
### --help :Show this message.
### --ip :The IP address bound to the MindIE Server business plane RESTful interface,default value: 127.0.0.1.
### --port :The port bound to the MindIE Server business plane RESTful interface,default value: 1025.
### --management-ip :The IP address bound to the MindIE Server management plane RESTful interface,default value: 127.0.0.2.
### --management-port :The port bound to the MindIE Server management plane RESTful interface,default value: 1026.
### --metrics-port :The port bound to the performance indicator monitoring interface,default value: 1027.
### --max-seq-len :Maximum sequence length,default value: 2560.
### --max-iter-times :The global maximum output length of the model,default value: 512.
### --max-input-token-len :The maximum length of the token id,default value: 2048.
### --max-prefill-tokens :Each time prefill occurs, the total number of input tokens in the current batch,default value: 8192
### --truncation :Whether to perform parameter rationalization check interception,default value: false.
### --template-type :Reasoning type,default value: "Standard"
### --max-preempt-count :The upper limit of the maximum number of preemptible requests in each batch,default value: 0.
### --support-select-batch :Batch selection strategy,default value: false.
### --npu-mem-size :This can be used to apply for the upper limit of the KV Cache size in the NPU,default value: 8.
### --max-prefill-batch-size :The maximum prefill batch size,default value: 50.
### --world-size :Enable several cards for inference.
### 1. If it is not set, the parallel config in the YAML file is obtained by default. Set worldsize = dp*mp*pp.
### 2. If set, modify the parallel config in the YAML file. set parallel config: dp:1 mp:worldSize pp:1
### --ms-sched-host :MS Scheduler IP address,default value: 127.0.0.1.
### --ms-sched-port :MS Scheduler port,default value: 8119.
### For more details about config description, please check MindIE homepage: https://www.hiascend.com/document/detail/zh/mindie/10RC3/mindiellm/llmdev/mindie_llm0004.html
help() {
awk -F'### ' '/^###/ { print $2 }' "$0"
}
if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then
help
exit 1
fi
##
# Get device info
##
total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
if [[ -z "$total_count" ]]; then
echo "Error: Unable to retrieve device info. Please check if npu-smi is available for current user (id 1001), or if you are specifying an occupied device."
exit 1
fi
echo "$total_count device(s) detected!"
##
# Set toolkit envs
##
echo "Setting toolkit envs..."
if [[ -f "/usr/local/Ascend/ascend-toolkit/set_env.sh" ]];then
source /usr/local/Ascend/ascend-toolkit/set_env.sh
else
echo "ascend-toolkit package is incomplete please check it."
exit 1
fi
echo "Toolkit envs set succeeded!"
##
# Set MindIE envs
##
echo "Setting MindIE envs..."
if [[ -f "/usr/local/Ascend/mindie/set_env.sh" ]];then
source /usr/local/Ascend/mindie/set_env.sh
else
echo "mindie package is incomplete please check it."
exit 1
fi
echo "MindIE envs set succeeded!"
##
# Default MS envs
##
# Set PYTHONPATH
MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
##
# Receive args and modify config.json
##
export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
echo "MindIE Service config path:$CONFIG_FILE"
#default config
BACKEND_TYPE="atb"
MAX_SEQ_LEN=2560
MAX_PREFILL_TOKENS=8192
MAX_ITER_TIMES=512
MAX_INPUT_TOKEN_LEN=2048
TRUNCATION=false
HTTPS_ENABLED=false
MULTI_NODES_INFER_ENABLED=false
NPU_MEM_SIZE=8
MAX_PREFILL_BATCH_SIZE=50
TEMPLATE_TYPE="Standard"
MAX_PREEMPT_COUNT=0
SUPPORT_SELECT_BATCH=false
IP_ADDRESS="127.0.0.1"
PORT=8080
MANAGEMENT_IP_ADDRESS="127.0.0.2"
MANAGEMENT_PORT=1026
METRICS_PORT=1027
#modify config
while [[ "$#" -gt 0 ]]; do
case $1 in
--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
--model-name) MODEL_NAME="$2"; shift ;;
--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
--truncation) TRUNCATION="$2"; shift ;;
--world-size) WORLD_SIZE="$2"; shift ;;
--template-type) TEMPLATE_TYPE="$2"; shift ;;
--max-preempt-count) MAX_PREEMPT_COUNT="$2"; shift ;;
--support-select-batch) SUPPORT_SELECT_BATCH="$2"; shift ;;
--npu-mem-size) NPU_MEM_SIZE="$2"; shift ;;
--max-prefill-batch-size) MAX_PREFILL_BATCH_SIZE="$2"; shift ;;
--ip) IP_ADDRESS="$2"; shift ;;
--port) PORT="$2"; shift ;;
--management-ip) MANAGEMENT_IP_ADDRESS="$2"; shift ;;
--management-port) MANAGEMENT_PORT="$2"; shift ;;
--metrics-port) METRICS_PORT="$2"; shift ;;
--ms-sched-host) ENV_MS_SCHED_HOST="$2"; shift ;;
--ms-sched-port) ENV_MS_SCHED_PORT="$2"; shift ;;
*)
echo "Unknown parameter: $1"
echo "Please check your inputs."
exit 1
;;
esac
shift
done
if [ -z "$MODEL_WEIGHT_PATH" ] || [ -z "$MODEL_NAME" ]; then
echo "Error: Both --model-path and --model-name are required."
exit 1
fi
MODEL_NAME=${MODEL_NAME:-$(basename "$MODEL_WEIGHT_PATH")}
echo "MODEL_NAME is set to: $MODEL_NAME"
WORLD_SIZE=$total_count
NPU_DEVICE_IDS=$(seq -s, 0 $(($WORLD_SIZE - 1)))
#validate config
if [[ "$BACKEND_TYPE" != "atb" ]]; then
echo "Error: BACKEND must be 'atb'. Current value: $BACKEND_TYPE"
exit 1
fi
if [[ ! "$IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]] ||
[[ ! "$MANAGEMENT_IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]]; then
echo "Error: IP_ADDRESS and MANAGEMENT_IP_ADDRESS must be valid IP addresses. Current values: IP_ADDRESS=$IP_ADDRESS, MANAGEMENT_IP_ADDRESS=$MANAGEMENT_IP_ADDRESS"
exit 1
fi
if [[ ! "$PORT" =~ ^[0-9]+$ ]] || (( PORT __LT__ 1025 || PORT > 65535 )) ||
[[ ! "$MANAGEMENT_PORT" =~ ^[0-9]+$ ]] || (( MANAGEMENT_PORT __LT__ 1025 || MANAGEMENT_PORT > 65535 )); then
echo "Error: PORT and MANAGEMENT_PORT must be integers between 1025 and 65535. Current values: PORT=$PORT, MANAGEMENT_PORT=$MANAGEMENT_PORT"
exit 1
fi
if [ "$MAX_PREFILL_TOKENS" -lt "$MAX_SEQ_LEN" ]; then
MAX_PREFILL_TOKENS=$MAX_SEQ_LEN
echo "MAX_PREFILL_TOKENS was less than MAX_SEQ_LEN. Setting MAX_PREFILL_TOKENS to $MAX_SEQ_LEN"
fi
MODEL_CONFIG_FILE="${MODEL_WEIGHT_PATH}/config.json"
if [ ! -f "$MODEL_CONFIG_FILE" ]; then
echo "Error: config.json file not found in $MODEL_WEIGHT_PATH."
exit 1
fi
chmod 600 "$MODEL_CONFIG_FILE"
#update config file
chmod u+w ${MIES_INSTALL_PATH}/conf/
sed -i "s/\"backendType\"\s*:\s*\"[^\"]*\"/\"backendType\": \"$BACKEND_TYPE\"/" $CONFIG_FILE
sed -i "s/\"modelName\"\s*:\s*\"[^\"]*\"/\"modelName\": \"$MODEL_NAME\"/" $CONFIG_FILE
sed -i "s|\"modelWeightPath\"\s*:\s*\"[^\"]*\"|\"modelWeightPath\": \"$MODEL_WEIGHT_PATH\"|" $CONFIG_FILE
sed -i "s/\"maxSeqLen\"\s*:\s*[0-9]*/\"maxSeqLen\": $MAX_SEQ_LEN/" "$CONFIG_FILE"
sed -i "s/\"maxPrefillTokens\"\s*:\s*[0-9]*/\"maxPrefillTokens\": $MAX_PREFILL_TOKENS/" "$CONFIG_FILE"
sed -i "s/\"maxIterTimes\"\s*:\s*[0-9]*/\"maxIterTimes\": $MAX_ITER_TIMES/" "$CONFIG_FILE"
sed -i "s/\"maxInputTokenLen\"\s*:\s*[0-9]*/\"maxInputTokenLen\": $MAX_INPUT_TOKEN_LEN/" "$CONFIG_FILE"
sed -i "s/\"truncation\"\s*:\s*[a-z]*/\"truncation\": $TRUNCATION/" "$CONFIG_FILE"
sed -i "s|\(\"npuDeviceIds\"\s*:\s*\[\[\)[^]]*\(]]\)|\1$NPU_DEVICE_IDS\2|" "$CONFIG_FILE"
sed -i "s/\"worldSize\"\s*:\s*[0-9]*/\"worldSize\": $WORLD_SIZE/" "$CONFIG_FILE"
sed -i "s/\"httpsEnabled\"\s*:\s*[a-z]*/\"httpsEnabled\": $HTTPS_ENABLED/" "$CONFIG_FILE"
sed -i "s/\"templateType\"\s*:\s*\"[^\"]*\"/\"templateType\": \"$TEMPLATE_TYPE\"/" $CONFIG_FILE
sed -i "s/\"maxPreemptCount\"\s*:\s*[0-9]*/\"maxPreemptCount\": $MAX_PREEMPT_COUNT/" $CONFIG_FILE
sed -i "s/\"supportSelectBatch\"\s*:\s*[a-z]*/\"supportSelectBatch\": $SUPPORT_SELECT_BATCH/" $CONFIG_FILE
sed -i "s/\"multiNodesInferEnabled\"\s*:\s*[a-z]*/\"multiNodesInferEnabled\": $MULTI_NODES_INFER_ENABLED/" "$CONFIG_FILE"
sed -i "s/\"maxPrefillBatchSize\"\s*:\s*[0-9]*/\"maxPrefillBatchSize\": $MAX_PREFILL_BATCH_SIZE/" "$CONFIG_FILE"
sed -i "s/\"ipAddress\"\s*:\s*\"[^\"]*\"/\"ipAddress\": \"$IP_ADDRESS\"/" "$CONFIG_FILE"
sed -i "s/\"port\"\s*:\s*[0-9]*/\"port\": $PORT/" "$CONFIG_FILE"
sed -i "s/\"managementIpAddress\"\s*:\s*\"[^\"]*\"/\"managementIpAddress\": \"$MANAGEMENT_IP_ADDRESS\"/" "$CONFIG_FILE"
sed -i "s/\"managementPort\"\s*:\s*[0-9]*/\"managementPort\": $MANAGEMENT_PORT/" "$CONFIG_FILE"
sed -i "s/\"metricsPort\"\s*:\s*[0-9]*/\"metricsPort\": $METRICS_PORT/" $CONFIG_FILE
sed -i "s/\"npuMemSize\"\s*:\s*-*[0-9]*/\"npuMemSize\": $NPU_MEM_SIZE/" "$CONFIG_FILE"
##
# Start service
##
echo "Current configurations are displayed as follows:"
cat $CONFIG_FILE
npu-smi info -m > ~/device_info
${MIES_INSTALL_PATH}/bin/mindieservice_daemon
- name: MODEL_NAME
value: '{{ index .Annotations "aml-model-repo" }}'
- name: MODEL_PATH
value: /mnt/models/{{ index .Annotations "aml-model-repo" }}
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-300I-Duo-py311-openeuler24.03-lts
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
startupProbe:
failureThreshold: 60
httpGet:
path: /v1/models
port: 8080
scheme: HTTP
periodSeconds: 10
timeoutSeconds: 180
supportedModelFormats:
- name: transformers
version: "1"
volumes:
- emptyDir:
medium: Memory
sizeLimit: 8Gi
name: dshm
2. InferenceService 的必需注解
与其他运行时不同,MindIE 在最终发布步骤中,InferenceService 的 metadata 必须包含以下注解。这样可以确保平台调度器正确地将 NPU 硬件绑定到该服务。
| 配置键 | 值 | 目的 |
|---|---|---|
storage.kserve.io/readonly | "false" | 启用对模型存储卷的写访问。 |
3. 用户权限(Root 访问)
由于 Ascend 驱动和硬件抽象层的要求,MindIE 镜像必须以 root 用户运行。请确保你的 ClusterServingRuntime 或 InferenceService security context 已相应配置:
注意:上面的 MindIE ClusterServingRuntime YAML 示例未指定 securityContext,这意味着容器将使用镜像的默认设置运行(通常为 root)。与显式设置 runAsNonRoot: true 和 runAsUser: 1000 的 MLServer 不同,MindIE 需要 root 权限才能访问 NPU 硬件。
#运行时配置对比
在继续之前,请参考下表了解不同运行时的具体要求:
| Runtime | 目标硬件 | 支持的框架 | 特殊要求 |
|---|---|---|---|
| Xinference | CPU / NVIDIA GPU | transformers, pytorch | 必须设置 MODEL_FAMILY 环境变量 |
| MLServer | CPU / NVIDIA GPU | sklearn, xgboost, mlflow | 标准配置 |
| Triton | NVIDIA GPU | triton (TensorFlow, PyTorch, ONNX, etc.) | 标准配置 |
| vLLM-ascend | Huawei Ascend NPU(已在 910B4 上验证) | transformers | 必须在 InferenceService 中添加 HOME、fsGroup 和 supplementalGroups |
| MindIE | Huawei Ascend NPU(已在 310P 上验证) | mindspore, transformers | 必须在 InferenceService 中添加所需的 NPU 注解 |