Upgrade

This article will introduce how to upgrade from GPU-manager or old Hami(version 2.5) to the newest Hami version.

TOC

GPU-manager to Hami

Note

  1. GPU-manager and Hami can not deploy in same node but can deploy in same cluster.
  2. When you start upgrading, applications need to be modified one by one, which will cause the business pod to restart.
  3. When you have only one gpu node, you need to uninstall GPU-manager and then install Hami. You can do this by modifying the node label when you both deploy the two plugin. For example, you can remove the nvidia-device-enable=vgpu node label to delete the gpu-manager instance on this node, and then add the gpu=on label to deploy the hami plugin on this node.

Procedure

Modified Your applications one by one, example:

Your old GPU-manager instance:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          tencent.com/vcuda-core: "50"
          tencent.com/vcuda-memory: "8000"

Migrate to Hami:

spec:
  containers:
    - image: your-image
      imagePullPolicy: IfNotPresent
      name: gpu
      resources:
        limits:
          cpu: '2'
          memory: 4Gi
          nvidia.com/gpualloc: 1     # Request 1 physical GPU (required)
          nvidia.com/gpucores: "50"  # Request 50% of the compute resources per GPU (optional)
          nvidia.com/gpumem: 8000    # Request 8000MB of video memory per GPU (optional)

Hami to Hami

Important Changes (v2.5.0 → v2.7.1)

VersionParameter AvailabilityRequired Action After Upgrade
Hami v2.5Nvidia Runtime Class Name and Create Nvidia Runtime Class not included in the pop-up form.N/A
Hami v2.6These parameters must be configured when deploying a plugin instance on a new node.Update plugin deployment params:
- Nvidia Runtime Class Name: hami-nvidia
- Create Nvidia Runtime Class: true (enable switch)

⚠️ Upgrading from v2.5 to v2.7.1 should not affect existing applications. ✅ It is recommended to restart applications with a rolling update to avoid unexpected issues.


Procedure

  1. Updrade ACP version if needed.
  2. Upload the package of Hami v2.7.1 plugin to ACP.
  3. Go to the Administrator -> Clusters -> Tartget Cluster -> Functional Components page, then click the Updrade button and you will see the Alauda Build of HAMi can be updraded. Clusters -> Tartget Cluster -> Functional Components page, then click the Updrade button and you will see the Alauda Build of HAMi can be updraded.
  4. Update some ConfigMaps that defines extended resources, which can be used to set extended resources on the ACP. Run the following script in your gpu cluster:
Click to expand code
kubectl apply -f - <<EOF
apiVersion: v1
data:
  dataType: integer
  defaultValue: "1"
  descriptionEn: Number of GPU jobs for resource quota. When create workload, declare how many physical GPUs needs and the requests of gpu core and gpu memory are the usage of per physical GPU
  descriptionZh: 资源配额代表 GPU 任务数。创建负载时代表申请的物理 gpu 个数, 申请的算力和显存都是每个物理 GPU 的使用量
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpualloc
  labelEn: gpu number
  labelZh: gpu 个数
  limits: optional
  requests: disabled
  resourceUnit: "count"
  relatedResources: "nvidia.com/gpucores,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpualloc
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "20"
  descriptionEn: vgpu cores, 100 cores represents the all computing power of a physical GPU
  descriptionZh: vgpu 算力, 100 算力代表一个物理 GPU 的全部算力
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpucores
  prefix: limits
  labelEn: vgpu cores
  labelZh: vgpu 算力
  limits: optional
  requests: disabled
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpumem"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpucores
  namespace: kube-public
---
apiVersion: v1
data:
  dataType: integer
  defaultValue: "4000"
  group: hami-nvidia
  groupI18n: '{"zh": "HAMi NVIDIA", "en": "HAMi NVIDIA"}'
  key: nvidia.com/gpumem
  prefix: limits
  labelEn: vgpu memory
  labelZh: vgpu 显存
  limits: optional
  requests: disabled
  resourceUnit: "Mi"
  relatedResources: "nvidia.com/gpualloc,nvidia.com/gpucores"
  excludeResources: "nvidia.com/mps-core,nvidia.com/mps-memory,tencent.com/vcuda-core,tencent.com/vcuda-memory"
  runtimeClassName: ""
  ignoreNodeCheck: "true"
kind: ConfigMap
metadata:
  labels:
    features.cpaas.io/enabled: "true"
    features.cpaas.io/group: hami-nvidia
    features.cpaas.io/type: CustomResourceLimitation
  name: cf-crl-hami-nvidia-gpumem
  namespace: kube-public
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cf-crl-hami-config
  namespace: kube-public
  labels:
    device-plugin.cpaas.io/config: "true"
data:
  deviceName: "HAMi"
  nodeLabelKey: "gpu"
  nodeLabelValue: "on"
EOF

Note

If you config resource quota for hami resource in versions prior to 2.7.1, please delete and reconfigure it.