Setup On-Premise Reranker Service for Hyperflux

Follow below steps in Alauda AI to setup a rerank model using vLLM:

  1. Upload the desired rerank model to the model repository, for example: Alibaba-NLP/gte-reranker-modernbert-base
  2. Click the "Publish Inference Service" button, configure appropriate resources, and select the vLLM runtime (>=vllm-0.9.2-cuda-12.6-x86).
  3. Do not click "Publish" yet. Click the YAML button in the upper right corner to switch to YAML editing mode.
  4. Modify the spec.model.command section in the YAML file as follows (note that you only need to delete the original python3 startup part and replace it with the vllm serve startup command below; the preceding script does not need to be modified):
  5. After the rerank model starts, ensure that the model's API address is accessible by the global cluster (the cluster where Hyperflux is deployed). If it's across clusters, you need to configure it as NodePort, Ingress, or AI Gateway, etc.
  6. Modify the Hyperflux configuration items: change Cohere Reranker BaseUrl to the access address of the inference service mentioned above, change Cohere Reranker Model to the model name (usually the name of the created InferenceService), and Cohere Reranker API key (fill in anything; vLLM does not set this by default).
  7. After the smart-doc container restarts successfully, the process is complete.

Sample vLLM startup command:

set -ex

# 1. check num GPUs

GPU_COUNT=$(python3 -c "import torch;
print(torch.cuda.device_count())")

echo "Starting serving model name: ${MODEL_NAME}, num gpus:
${GPU_COUNT}"

if [ ${GPU_COUNT} -lt 1 ]; then
    echo "No GPUs found. Please check if the container have aquired any GPU device"
    exit 1
fi


# 2. check model path

MODEL_DIR="/mnt/models/${MODEL_NAME}"

# a. using git lfs storage initializer, model will be in
/mnt/models/<model_name>

# b. using hf storage initializer, model will be in /mnt/models

if [ ! -d "${MODEL_DIR}" ]; then
    MODEL_DIR="/mnt/models"
    echo "[WARNING] Model directory ${MODEL_DIR}/${MODEL_NAME} not found, using ${MODEL_DIR} instead"
fi


# 3. check if using gguf models

c=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' | wc -l`

echo "find ${c} gguf files"

if [ "${c}" -gt 1 ]; then
    echo "[ERROR] More than one gguf file found in ${MODEL_DIR}"
    echo "Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use gguf-split tool to merge them to a single-file model."
    exit 1
elif [ "${c}" -eq 1 ]; then
    n=`find "${MODEL_DIR}" -maxdepth 1 -type f -name '*.gguf' -print`
    echo "[INFO] Using GGUF model file: ${n}"
    MODEL_PATH="${n}"
else
    echo "[INFO] Using standard model directory"
    MODEL_PATH="${MODEL_DIR}"
fi


# 4. launch vllm server

if [ "$ENABLE_CHUNKED_PREFILL" = "False" ]; then
    PARAM_ENABLE_CHUNKED_PREFILL="--no-enable-chunked-prefill"
else
    PARAM_ENABLE_CHUNKED_PREFILL="--enable-chunked-prefill"
fi

if [ "$ENFORCE_EAGER" = "True" ]; then
    PARAM_ENFORCE_EAGER="--enforce-eager"
else
    PARAM_ENFORCE_EAGER=""
fi

vllm serve ${MODEL_PATH} --task score --port 8080 --served-model-name
{{.Name}} {{.Namespace}}/{{.Name}}