常见问题

在 Kubernetes 中设置 PSA=restricted 时如何使用 Kubeflow plugins 如何配置 Kubeflow 以使用替代平台地址进行登录？如何使用外部 S3/MinIO 存储启动 Kubeflow Pipeline Run 配置 Kubeflow Notebook 使用自定义 GPU 资源 Pod 启动失败：Probe 超时（kube-ovn 环境）

在 Kubernetes 中设置 PSA=restricted 时如何使用 Kubeflow plugins

如果您的 namespace 设置了 PSA=restricted，在使用 Kubeflow 组件时可能会遇到错误，例如创建 notebook、Kubeflow Pipeline 运行等。为了解决这个问题，您需要将当前 namespace 的默认 PSA 更改为 baseline：

kubectl label --overwrite ns [NAMESPACE_NAME] pod-security.kubernetes.io/audit=baseline
kubectl label --overwrite ns [NAMESPACE_NAME] pod-security.kubernetes.io/enforce=baseline
kubectl label --overwrite ns [NAMESPACE_NAME] pod-security.kubernetes.io/warn=baseline

注意：您可能需要咨询集群管理员，以确保更改 PSA 是可接受的。

如何配置 Kubeflow 以使用替代平台地址进行登录？

在某些环境中，平台访问地址被配置为内网地址，用户需要通过“替代平台地址”进行登录。在这种情况下，虽然 OIDC issuer 仍然基于原始平台地址，但登录页面 URL 必须更新为替代地址。

步骤：

定位 ModuleInfo Resource：

在 global 集群中，使用以下命令找到与 kfbase plugin 对应的 ModuleInfo resource：

kubectl get moduleinfoes -l cpaas.io/module-name=kfbase,cpaas.io/cluster-name=<deployed-cluster-name>

编辑 ModuleInfo Resource

按照如下所示，在 spec 下添加 valuesOverride 部分。将 <Alternative-Platform-Address> 替换为实际的替代地址。

......
spec: 
  valuesOverride: 
    mlops/kfbase:  
      oidcAuthURL: https://<Alternative-Platform-Address>/dex/auth
......

重启 OAuth2 Proxy：

通过重启目标集群中的 oauth2-proxy deployment 来应用更改：

kubectl rollout restart deploy -n kubeflow-oauth2-proxy oauth2-proxy

如何使用外部 S3/MinIO 存储启动 Kubeflow Pipeline Run

当您使用外部 S3/MinIO 存储服务安装 Kubeflow 时，需要添加一个 “KFP Launcher” ConfigMap 来设置当前 namespace 或用户使用的存储。您可以查看 Kubeflow 文档 https://www.kubeflow.org/docs/components/pipelines/operator-guides/configure-object-store/#s3-and-s3-compatible-provider 了解更多详情。如果未设置任何配置，Pipeline runs 仍可能访问默认服务地址，例如 "minio-service.kubeflow:9000 "，这并不正确。

下面是一个简单示例，供您开始使用：

apiVersion: v1
data:
  defaultPipelineRoot: s3://mlpipeline
  providers: |-
    s3:
      default:
        endpoint: minio.minio-system.svc:80
        disableSSL: true
        region: us-east-2
        forcePathStyle: true
        credentials:
          fromEnv: false
          secretRef:
            secretName: mlpipeline-minio-artifact
            accessKeyKey: accesskey
            secretKeyKey: secretkey
kind: ConfigMap
metadata:
  name: kfp-launcher
  namespace: wy-testns

例如，您应在此 ConfigMap 中设置以下值，以指向您自己的 S3/MinIO 存储：

defaultPipelineRoot：用于存储 pipeline 中间数据的位置
endpoint：S3/MinIO 服务端点。注意，不应以 "http" 或 "https" 开头
disableSSL：是否禁用到该端点的 "https" 访问
region：S3 region。如果使用 MinIO，任意值均可
credentials：secrets 中的 AK/SK

添加此 ConfigMap 后，新启动的 Kubeflow Pipeline Runs 将自动读取此配置，并保存 Kubeflow Pipeline 使用的数据。

配置 Kubeflow Notebook 使用自定义 GPU 资源

您可以添加其他 GPU 资源类型，以便 Kubeflow Notebook 网页可以创建利用这些硬件的实例，例如使用 Ascend GPUs 时。

通过运行以下命令编辑 ConfigMap：

kubectl -n kubeflow get configmap | grep jupyter-web-app-config
kubectl -n kubeflow edit configmap jupyter-web-app-config-<actual-cm-suffix>

找到下面的部分，并添加您的 GPU 资源类型，例如 "your-custom.com/gpu"。

注意：您只能添加使用整数值的资源类型，例如 1、2、4、8。另外，您不能像使用 HAMi 那样，同时使用 "Cores" 和 "Memory" 来添加 "Virtual" 或 "Shared" GPU 资源。

################################################################
# GPU/Device-Plugin Resources
################################################################
gpus:
  readOnly: false
 
  # configs for gpu/device-plugin limits of the container
  # https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins
  value:
    # the `limitKey` of the default vendor
    # (to have no default, set as "")
    vendor: ""
 
    # the list of available vendors in the dropdown
    #  `limitsKey` - what will be set as the actual limit
    #  `uiName` - what will be displayed in the dropdown UI
    vendors:
    - limitsKey: "nvidia.com/gpu"
      uiName: "NVIDIA"
    - limitsKey: "amd.com/gpu"
      uiName: "AMD"
    - limitsKey: "habana.ai/gaudi"
      uiName: "Intel Gaudi"
    - limitsKey: "your-custom.com/gpu"
      uiName: "Your Custom Vendor"
    # the default value of the limit
    # (possible values: "none", "1", "2", "4", "8")
    num: "none"

Pod 启动失败：Probe 超时（kube-ovn 环境）

症状： kubeflow namespace 中大量 Pods 卡在 CrashLoopBackOff 或 Init:1/2，并且 Pod Events 中显示如下错误：

Startup probe failed: Get "http://<pod-ip>:<port>/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Liveness probe failed: ...context deadline exceeded...

原因： kfbase 部署的 default-allow-same-namespace NetworkPolicy 只允许来自同一 namespace 中 Pods 以及少量系统 namespace 的入站流量。在使用 kube-ovn 作为 CNI 的集群中，kubelet 发送的健康检查流量会通过 kube-ovn 的 join subnet（默认 100.64.0.0/16）到达 Pods。该流量的源 IP 不匹配任何现有的 NetworkPolicy 规则，因此会被 OVN ACL 丢弃，导致所有 probe 超时。

修复： 创建一个允许来自 kube-ovn join subnet 的入站流量的 NetworkPolicy：

# 1. Check the CIDR of the kube-ovn join subnet
kubectl get subnet join -o jsonpath='{.spec.cidrBlock}'
# Example output: 100.64.0.0/16

# 2. Check the IP of each node on the join subnet
kubectl get nodes -o custom-columns='NAME:.metadata.name,JOIN_IP:.metadata.annotations.ovn\.kubernetes\.io/ip_address'

# 3. Verify whether the probe timeout is related to the NetworkPolicy (temporary test)
# Change the ingress of default-allow-same-namespace to [{}] (allow all inbound traffic),
# then observe whether the Pods recover. Be sure to revert the change after confirmation.

    # First get the join subnet CIDR
JOIN_CIDR=$(kubectl get subnet join -o jsonpath='{.spec.cidrBlock}')

kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-kubelet-probes
  namespace: kubeflow
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: ${JOIN_CIDR}
EOF

注意： 不同集群中的 join subnet CIDR 可能不同。请始终通过运行 kubectl get subnet join 获取实际值。常见默认值为 100.64.0.0/16。

#常见问题

#目录

#在 Kubernetes 中设置 PSA=restricted 时如何使用 Kubeflow plugins

#如何配置 Kubeflow 以使用替代平台地址进行登录？

#如何使用外部 S3/MinIO 存储启动 Kubeflow Pipeline Run

#配置 Kubeflow Notebook 使用自定义 GPU 资源

#Pod 启动失败：Probe 超时（kube-ovn 环境）