Installation
TOC
Prerequisites
- Cluster administrator access to your ACP cluster
- NvidiaDriver: v450+
- ACP version: v3.18,v4.0,v4.1
Procedure
Installing Nvidia driver in your gpu node
Prefer to Installation guide of Nvidia Official website
Installing Nvidia Container Runtime
Prefer to Installation guide of Nvidia Container Toolkit
Add Nvidia yum library in GPU node
Note: Make sure the GPU node can access nvidia.github.io
When the message "Metadata cache created." appears, it indicates that the addition was successful.
Installing Nvidia Container Runtime
When the prompt "Complete!" appears, it means the installation is successful.
Downloading Cluster plugin
Alauda Build of NVIDIA GPU Device Plugin cluster plugin can be retrieved from Customer Portal.
Please contact Consumer Support for more information.
Uploading the Cluster plugin
For more information on uploading the cluster plugin, please refer to
Installing Alauda Build of NVIDIA GPU Device Plugin
-
Add label "nvidia-device-enable=pgpu" in your GPU node for nvidia-device-plugin schedule.
Note: The same node cannot have both
gpu=onandnvidia-device-enable=pgpulabels at the same time -
Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of NVIDIA GPU Device PluginCluster plugin. Note: Deploy form parameters can be kept as default or modified after knowing how to use them. -
Verify result. You can see the status of "Installed" in the UI or you can check the pod status:
-
Finally, you can see the
Extended Resourcesin the form of resources when create application in ACP, and then you can selectGPU core.
Installing Alauda Build of DCGM-Exporter
- Go to the
Administrator->Marketplace->Cluster Pluginpage, switch to the target cluster, and then deploy theAlauda Build of DCGM-ExporterCluster plugin: Set the node labels in the popup form:
- Node Label Key: nvidia-device-enable
- Node Label Value: pgpu
If you need enable dcgm-exporter for Hami, you can add another labels:
- Node Label Key: gpu
- Node Label Value: on
- Verify result. You can see the status of "Installed" in the UI or you can check the pod status: