为 Hyperflux 设置本地 Reranker 服务

请按照以下步骤在 Alauda AI 中使用 vLLM 设置 rerank 模型:

  1. 将所需的 rerank 模型上传到模型仓库,例如:Alibaba-NLP/gte-reranker-modernbert-base
  2. 点击“Publish Inference Service”按钮,配置合适的资源,并选择 vLLM 运行时(>=vllm-0.9.2-cuda-12.6-x86)。
  3. 暂时不要点击“Publish”。点击右上角的 YAML 按钮,切换到 YAML 编辑模式。
  4. 修改 YAML 文件中的 spec.model.command 部分(注意只需删除原有的 python3 启动部分,替换为下面的 vllm serve 启动命令;前面的脚本无需修改):
  5. rerank 模型启动后,确保模型的 API 地址对 global 集群(部署 Hyperflux 的集群)可访问。如果是跨集群访问,需要配置为主机端口、Ingress 或 AI Gateway 等方式。
  6. 修改 Hyperflux 配置项:将 Cohere Reranker BaseUrl 改为上述推理服务的访问地址,将 Cohere Reranker Model 改为模型名称(通常是创建的 InferenceService 名称),Cohere Reranker API key 填写任意内容;vLLM 默认不设置该项。
  7. 智能文档容器重启成功后,完成配置。

vLLM 启动命令示例:

set -ex

# 1. check num GPUs

GPU_COUNT=$(python3 -c "import torch;
print(torch.cuda.device_count())")

echo "Starting serving model name: ${MODEL_NAME}, num gpus:
${GPU_COUNT}"

if [ ${GPU_COUNT} -lt 1 ]; then
    echo "No GPUs found. Please check if the container have aquired any GPU device"
    exit 1
fi


# 2. check model path

MODEL_DIR="/mnt/models/${MODEL_NAME}"

# a. using git lfs storage initializer, model will be in
/mnt/models/<model_name>

# b. using hf storage initializer, model will be in /mnt/models

if [ ! -d "${MODEL_DIR}" ]; then
    MODEL_DIR="/mnt/models"
    echo "[WARNING] Model directory ${MODEL_DIR}/${MODEL_NAME} not found, using ${MODEL_DIR} instead"
fi


# 3. check if using gguf models

c=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' | wc -l`

echo "find ${c} gguf files"

if [ "${c}" -gt 1 ]; then
    echo "[ERROR] More than one gguf file found in ${MODEL_DIR}"
    echo "Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use gguf-split tool to merge them to a single-file model."
    exit 1
elif [ "${c}" -eq 1 ]; then
    n=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' -print`
    echo "[INFO] Using GGUF model file: ${n}"
    MODEL_PATH="${n}"
else
    echo "[INFO] Using standard model directory"
    MODEL_PATH="${MODEL_DIR}"
fi


# 4. launch vllm server

if [ "$ENABLE_CHUNKED_PREFILL" = "False" ]; then
    PARAM_ENABLE_CHUNKED_PREFILL="--no-enable-chunked-prefill"
else
    PARAM_ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
fi

if [ "$ENFORCE_EAGER" = "True" ]; then
    PARAM_ENFORCE_EAGER="--enforce-eager"
else
    PARAM_ENFORCE_EAGER=""
fi

vllm serve ${MODEL_PATH} --task score --port 8080 --served-model-name
{{.Name}} {{.Namespace}}/{{.Name}}