创建集群

本文档提供了使用 Cluster API 在 DCS 平台上创建 Kubernetes 集群的完整操作指南。该过程涉及部署和配置多个 Kubernetes 资源,这些资源协同工作以提供和管理集群基础设施。

目录

前提条件

在创建集群之前,请确保满足以下所有前提条件:

1. DCS 平台已安装并正常运行

DCS 平台必须已完全安装并处于正常运行状态。请确保您拥有:

  • 访问 DCS 平台服务的端点 URL
  • 有效的认证凭据(authUserauthKey
  • 创建和管理虚拟机的适当权限

2. 虚拟机模板准备

用于 Kubernetes 安装,您必须:

  • 提供的 MicroOS 镜像上传至 DCS 平台
  • 基于该镜像创建虚拟机模板
  • 确保模板包含所有必要的 Kubernetes 组件

3. 必需插件安装

global 集群中安装以下插件:

  • Cluster API Provider Kubeadm - 提供 Kubernetes 集群引导能力
  • Cluster API Provider DCS - 实现 DCS 基础设施集成与管理

详细安装说明请参见安装指南

4. 公共镜像仓库配置

上配置公共镜像仓库凭据,包括:

  • 镜像仓库地址配置
  • 合适的认证凭据设置

详细配置步骤请参见 Alauda Container Platform 文档:配置 → 集群 → 操作指南 → 更新公共镜像仓库凭据

集群创建概述

总体上,您将在 global 集群中创建以下 Cluster API 资源,以提供基础设施并引导一个功能完整的 Kubernetes 集群。

WARNING

重要的命名空间要求

为确保与 作为业务集群的正确集成,所有资源必须部署在 cpaas-system 命名空间中。在其他命名空间部署资源可能导致集成异常。

控制平面配置

控制平面负责管理集群状态、调度和 Kubernetes API。以下内容展示如何配置高可用控制平面。

WARNING

配置参数注意事项

配置资源时,请谨慎修改参数:

  • 仅替换用 <> 括起来的值,替换为您环境中的实际值
  • 保留所有其他参数,因为它们代表了优化或必需的配置
  • 修改非占位符参数可能导致集群不稳定或集成异常

配置流程

请按以下步骤依次执行:

  1. 规划网络并部署 API 负载均衡器
  2. 配置 DCS 凭据(Secret)
  3. 创建 IP 和主机名池
  4. 创建控制平面 DCSMachineTemplate
  5. 配置 KubeadmControlPlane
  6. 配置 DCSCluster
  7. 创建 Cluster

应用清单后, 将创建一个 DCS Kubernetes 控制平面。

网络规划与负载均衡器

在创建控制平面资源之前,规划网络架构并部署用于高可用的负载均衡器。

要求

  • 网络划分:规划控制平面节点的 IP 地址范围
  • 负载均衡器:部署并配置访问 API 服务器
  • IP 绑定:将负载均衡器绑定到控制平面 IP 池中的 IP
  • 连通性:确保所有组件之间的网络连通

负载均衡器负责将 API 服务器流量分发到控制平面节点,确保可用性和容错性。

配置 DCS 认证信息

DCS 认证信息存储在 Secret 资源中。

以下示例中,<auth-secret-name> 是保存的 Secret 名称:

apiVersion: v1
data:
  authUser: <base64-encoded-auth-user>
  authKey: <base64-encoded-auth-key>
  endpoint: <base64-encoded-endpoint>
kind: Secret
metadata:
  name: <auth-secret-name>
  namespace: cpaas-system
type: Opaque
参数说明
.data.authUserDCS 平台 API 用户登录名(base64 编码)
.data.authKeyDCS 平台 API 用户登录密码(base64 编码)
.data.endpointDCS 平台 API 地址,包含 http 或 https 协议(base64 编码)

配置 IP 和主机名池

您需要提前规划控制平面虚拟机的 IP 地址、主机名、DNS 服务器及其他网络信息。

WARNING

必须配置的机器信息数量不少于控制平面节点数量。

以下示例中,<control-plane-iphostname-pool-name> 是资源名称:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DCSIpHostnamePool
metadata:
  name: <control-plane-iphostname-pool-name>
  namespace: cpaas-system
spec:
  pool:
  - ip: "<control-plane-ip-1>"
    mask: "<control-plane-mask>"
    gateway: "<control-plane-gateway>"
    dns: "<control-plane-dns>"
    hostname: "<control-plane-hostname-1>"
    machineName: "<control-plane-machine-name-1>"
参数说明必填
.spec.pool[].ip待创建虚拟机的 IP 地址
.spec.pool[].mask子网掩码
.spec.pool[].gateway网关 IP 地址
.spec.pool[].dnsDNS 服务器 IP(多个用逗号分隔)
.spec.pool[].machineName虚拟机在 DCS 平台中的名称
.spec.pool[].hostname虚拟机主机名

配置机器模板(控制平面)

DCS 机器模板声明了后续 Cluster API 组件创建的 DCS 机器配置。机器模板指定虚拟机模板、附加磁盘、CPU、内存等配置信息。

WARNING

您可以在 dcsMachineDiskSpec 部分添加额外自定义磁盘,但必须保留示例中所有磁盘条目(包括 systemVolume、/var/lib/kubelet、/var/lib/containerd 和 /var/cpaas 挂载点)。添加磁盘时,请确保不遗漏这些关键配置。

以下示例中,<cp-dcs-machine-template-name> 是控制平面机器模板名称:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DCSMachineTemplate
metadata:
  name: <cp-dcs-machine-template-name>
  namespace: cpaas-system
spec:
  template:
    spec:
      vmTemplateName: <vm-template-name>
      location:
        type: folder
        name: <folder-name>
      resource: # 可选,未指定则使用模板默认值
        type: cluster # cluster | host, 可选
        name: <cluster-name> # 可选
      vmConfig:
        dvSwitchName: <dv-switch-name> # 可选
        portGroupName: <port-group-name> # 可选
        dcsMachineCpuSpec:
          quantity: <control-plane-cpu>
        dcsMachineMemorySpec: # MB
          quantity: <control-plane-memory>
        dcsMachineDiskSpec: # GB
        - quantity: 0
          datastoreClusterName: <datastore-cluster-name>
          systemVolume: true
        - quantity: 10
          datastoreClusterName: <datastore-cluster-name>
          path: /var/lib/etcd
          format: xfs
        - quantity: 100
          datastoreClusterName: <datastore-cluster-name>
          path: /var/lib/kubelet
          format: xfs
        - quantity: 100
          datastoreClusterName: <datastore-cluster-name>
          path: /var/lib/containerd
          format: xfs
        - quantity: 100
          datastoreClusterName: <datastore-cluster-name>
          path: /var/cpaas
          format: xfs
      ipHostPoolRef:
        name: <control-plane-iphostname-pool-name>

关键参数说明

参数类型说明必填
.spec.template.spec.vmTemplateNamestringDCS 虚拟机模板名称
.spec.template.spec.locationobject虚拟机创建位置(未指定则自动选择)
.spec.template.spec.location.typestring虚拟机创建位置类型(当前仅支持 "folder")
.spec.template.spec.location.namestring虚拟机创建文件夹名称
.spec.template.spec.resourceobject虚拟机创建计算资源选择(未指定则自动选择)
.spec.template.spec.resource.typestring计算资源类型:cluster 或 host
.spec.template.spec.resource.namestring计算资源名称
.spec.template.spec.vmConfigobject虚拟机配置
.spec.template.spec.vmConfig.dvSwitchNamestring虚拟机交换机名称(未指定则使用模板默认)
.spec.template.spec.vmConfig.portGroupNamestring端口组名称(必须属于上述交换机,未指定则使用模板默认)
.spec.template.spec.vmConfig.dcsMachineCpuSpec.quantityint虚拟机 CPU 配置(核数)
.spec.template.spec.vmConfig.dcsMachineMemorySpec.quantityint虚拟机内存大小(MB)
.spec.template.spec.vmConfig.dcsMachineDiskSpec[]object虚拟机磁盘配置
.spec.template.spec.vmConfig.dcsMachineDiskSpec[].quantityint磁盘大小(GB),系统盘为 0 时自动使用模板系统盘大小
.spec.template.spec.vmConfig.dcsMachineDiskSpec[].datastoreClusterNamestring磁盘所属数据存储集群名称
.spec.template.spec.vmConfig.dcsMachineDiskSpec[].systemVolumebool是否为系统盘(仅允许一个磁盘为 true)
.spec.template.spec.vmConfig.dcsMachineDiskSpec[].pathstring磁盘挂载目录(未指定则不挂载)
.spec.template.spec.vmConfig.dcsMachineDiskSpec[].formatstring文件系统格式
.spec.template.spec.ipHostPoolRef.namestring引用的 DCSIpHostnamePool 名称

配置 KubeadmControlPlane

当前 DCS 控制平面实现依赖 Cluster API 控制平面提供者 kubeadm,需要配置 KubeadmControlPlane 资源。

示例中的大多数参数已是优化或必需配置,但部分参数可能需根据您的环境进行定制。

以下示例中,<kcp-name> 是资源名称:

apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: <kcp-name>
  namespace: cpaas-system
  annotations:
    controlplane.cluster.x-k8s.io/skip-kube-proxy: ""
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
  kubeadmConfigSpec:
    users:
    - name: boot
      sshAuthorizedKeys:
      - "<ssh-authorized-keys>"
    format: ignition
    files:
    - path: /etc/kubernetes/admission/psa-config.yaml
      owner: "root:root"
      permissions: "0644"
      content: |
        apiVersion: apiserver.config.k8s.io/v1
        kind: AdmissionConfiguration
        plugins:
        - name: PodSecurity
          configuration:
            apiVersion: pod-security.admission.config.k8s.io/v1
            kind: PodSecurityConfiguration
            defaults:
              enforce: "privileged"
              enforce-version: "latest"
              audit: "baseline"
              audit-version: "latest"
              warn: "baseline"
              warn-version: "latest"
            exemptions:
              usernames: []
              runtimeClasses: []
              namespaces:
              - kube-system
              - cpaas-system
    - path: /etc/kubernetes/patches/kubeletconfiguration0+strategic.json
      owner: "root:root"
      permissions: "0644"
      content: |
        {
          "apiVersion": "kubelet.config.k8s.io/v1beta1",
          "kind": "KubeletConfiguration",
          "protectKernelDefaults": true,
          "tlsCertFile": "/etc/kubernetes/pki/kubelet.crt",
          "tlsPrivateKeyFile": "/etc/kubernetes/pki/kubelet.key",
          "streamingConnectionIdleTimeout": "5m",
          "clientCAFile": "/etc/kubernetes/pki/ca.crt"
        }
    - path: /etc/kubernetes/encryption-provider.conf
      owner: "root:root"
      append: false
      permissions: "0644"
      content: |
        apiVersion: apiserver.config.k8s.io/v1
        kind: EncryptionConfiguration
        resources:
        - resources:
          - secrets
          providers:
          - aescbc:
              keys:
              - name: key1
                secret: <base64-encoded-secret>
    - path: /etc/kubernetes/audit/policy.yaml
      owner: "root:root"
      append: false
      permissions: "0644"
      content: |
        apiVersion: audit.k8s.io/v1
        kind: Policy
        # Don't generate audit events for all requests in RequestReceived stage.
        omitStages:
        - "RequestReceived"
        rules:
        # The following requests were manually identified as high-volume and low-risk,
        # so drop them.
        - level: None
          users:
          - system:kube-controller-manager
          - system:kube-scheduler
          - system:serviceaccount:kube-system:endpoint-controller
          verbs: ["get", "update"]
          namespaces: ["kube-system"]
          resources:
          - group: "" # core
            resources: ["endpoints"]
        # Don't log these read-only URLs.
        - level: None
          nonResourceURLs:
          - /healthz*
          - /version
          - /swagger*
        # Don't log events requests.
        - level: None
          resources:
          - group: "" # core
            resources: ["events"]
        # Don't log devops requests.
        - level: None
          resources:
          - group: "devops.alauda.io"
        # Don't log get list watch requests.
        - level: None
          verbs: ["get", "list", "watch"]
        # Don't log lease operation
        - level: None
          resources:
          - group: "coordination.k8s.io"
            resources: ["leases"]
        # Don't log access review and token review requests.
        - level: None
          resources:
          - group: "authorization.k8s.io"
            resources: ["subjectaccessreviews", "selfsubjectaccessreviews"]
          - group: "authentication.k8s.io"
            resources: ["tokenreviews"]
        # Don't log imagewhitelists and namespaceoverviews operations
        - level: None
          resources:
          - group: "app.alauda.io"
            resources: ["imagewhitelists"]
          - group: "k8s.io"
            resources: ["namespaceoverviews"]
        # Secrets, ConfigMaps can contain sensitive & binary data,
        # so only log at the Metadata level.
        - level: Metadata
          resources:
          - group: "" # core
            resources: ["secrets", "configmaps"]
        # devops installmanifests and katanomis can contains huge data and sensitive data, only log at the Metadata level.
        - level: Metadata
          resources:
          - group: "operator.connectors.alauda.io"
            resources: ["installmanifests"]
          - group: "operators.katanomi.dev"
            resources: ["katanomis"]
        # Default level for known APIs
        - level: RequestResponse
          resources:
          - group: "" # core
          - group: "aiops.alauda.io"
          - group: "apps"
          - group: "app.k8s.io"
          - group: "authentication.istio.io"
          - group: "auth.alauda.io"
          - group: "autoscaling"
          - group: "asm.alauda.io"
          - group: "clusterregistry.k8s.io"
          - group: "crd.alauda.io"
          - group: "infrastructure.alauda.io"
          - group: "monitoring.coreos.com"
          - group: "operators.coreos.com"
          - group: "networking.istio.io"
          - group: "extensions.istio.io"
          - group: "install.istio.io"
          - group: "security.istio.io"
          - group: "telemetry.istio.io"
          - group: "opentelemetry.io"
          - group: "networking.k8s.io"
          - group: "portal.alauda.io"
          - group: "rbac.authorization.k8s.io"
          - group: "storage.k8s.io"
          - group: "tke.cloud.tencent.com"
          - group: "devopsx.alauda.io"
          - group: "core.katanomi.dev"
          - group: "deliveries.katanomi.dev"
          - group: "integrations.katanomi.dev"
          - group: "artifacts.katanomi.dev"
          - group: "builds.katanomi.dev"
          - group: "versioning.katanomi.dev"
          - group: "sources.katanomi.dev"
          - group: "tekton.dev"
          - group: "operator.tekton.dev"
          - group: "eventing.knative.dev"
          - group: "flows.knative.dev"
          - group: "messaging.knative.dev"
          - group: "operator.knative.dev"
          - group: "sources.knative.dev"
          - group: "operator.devops.alauda.io"
          - group: "flagger.app"
          - group: "jaegertracing.io"
          - group: "velero.io"
            resources: ["deletebackuprequests"]
          - group: "connectors.alauda.io"
          - group: "operator.connectors.alauda.io"
            resources: ["connectorscores", "connectorsgits", "connectorsocis"]
        # Default level for all other requests.
        - level: Metadata
    preKubeadmCommands:
    - while ! ip route | grep -q "default via"; do sleep 1; done; echo "NetworkManager started"
    - mkdir -p /run/cluster-api && restorecon -Rv /run/cluster-api
    - if [ -f /etc/disk-setup.sh ]; then bash /etc/disk-setup.sh; fi
    postKubeadmCommands:
    - chmod 600 /var/lib/kubelet/config.yaml
    clusterConfiguration:
      imageRepository: cloud.alauda.io/alauda
      dns:
        imageTag: <dns-image-tag>
      etcd:
        local:
          imageTag: <etcd-image-tag>
      apiServer:
        extraArgs:
          audit-log-format: json
          audit-log-maxage: "30"
          audit-log-maxbackup: "10"
          audit-log-maxsize: "200"
          profiling: "false"
          audit-log-mode: batch
          audit-log-path: /etc/kubernetes/audit/audit.log
          audit-policy-file: /etc/kubernetes/audit/policy.yaml
          tls-cipher-suites: "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
          encryption-provider-config: /etc/kubernetes/encryption-provider.conf
          admission-control-config-file: /etc/kubernetes/admission/psa-config.yaml
          tls-min-version: VersionTLS12
          kubelet-certificate-authority: /etc/kubernetes/pki/ca.crt
        extraVolumes:
        - name: vol-dir-0
          hostPath: /etc/kubernetes
          mountPath: /etc/kubernetes
          pathType: Directory
      controllerManager:
        extraArgs:
          bind-address: "::"
          profiling: "false"
          tls-min-version: VersionTLS12
          flex-volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
      scheduler:
        extraArgs:
          bind-address: "::"
          tls-min-version: VersionTLS12
          profiling: "false"
    initConfiguration:
      patches:
        directory: /etc/kubernetes/patches
      nodeRegistration:
        kubeletExtraArgs:
          node-labels: "kube-ovn/role=master"
          provider-id: PROVIDER_ID
          volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
          protect-kernel-defaults: "true"
    joinConfiguration:
      patches:
        directory: /etc/kubernetes/patches
      nodeRegistration:
        kubeletExtraArgs:
          node-labels: "kube-ovn/role=master"
          provider-id: PROVIDER_ID
          volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
          protect-kernel-defaults: "true"
  machineTemplate:
    nodeDrainTimeout: 1m
    nodeDeletionTimeout: 5m
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: DCSMachineTemplate
      name: <cp-dcs-machine-template-name>
  replicas: 3
  version: <control-plane-kubernetes-version>

关键参数说明

参数类型说明必填
.spec.kubeadmConfigSpecobjectkubeadm 引导提供者启动参数,用于定制虚拟机启动配置(用户、网络、文件等)
.spec.kubeadmConfigSpec.users[]object用户配置
.spec.machineTemplate.infrastructureRefstring用于创建 DCSMachine 资源的 DCSMachineTemplate 名称
.spec.replicasint控制平面虚拟机副本数(不可超过引用的 IpHostnamePool 中配置的数量)
.spec.versionstring控制平面 Kubernetes 版本(必须与虚拟机模板版本匹配)

配置 DCSCluster

DCSCluster 是基础设施集群声明。由于 DCS 平台当前不提供原生负载均衡器,您需要提前手动配置负载均衡器,并将其绑定到“配置虚拟机 IP 和主机名池”部分中配置的 IP 地址。

以下示例中,<dcs-cluster-name> 是资源名称:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: DCSCluster
metadata:
  name: "<dcs-cluster-name>"
  namespace: cpaas-system
spec:
  controlPlaneLoadBalancer: # 配置高可用
    host: <load-balancer-ip-or-domain-name>
    port: 6443
    type: external 
  credentialSecretRef: # 引用认证 Secret
    name: <auth-secret-name>
  controlPlaneEndpoint: # Cluster API 规范,保持与 controlPlane 一致
    host: <load-balancer-ip-or-domain-name>
    port: 6443
  networkType: kube-ovn 
  site: <site>  # DCS 平台参数,资源池 ID

关键参数说明

参数类型说明必填
.spec.controlPlaneLoadBalancerobject控制平面 API 服务器暴露方式
.spec.controlPlaneLoadBalancer.typestring当前仅支持 "external"
.spec.controlPlaneLoadBalancer.hoststring负载均衡器 IP 或域名
.spec.controlPlaneLoadBalancer.portint64端口号
.spec.credentialSecretRef.namestringDCS 集群认证信息(参见“配置 DCS 认证信息”部分)
.spec.controlPlaneEndpointobjectAPI 服务器暴露地址(Cluster API 规范)
.spec.networkTypestring当前仅支持 "kube-ovn"
.spec.sitestringDCS 平台站点 ID

配置 Cluster

Cluster API 中的 Cluster 资源用于声明集群,需要引用对应的控制平面资源和基础设施集群资源:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    capi.cpaas.io/resource-group-version: infrastructure.cluster.x-k8s.io/v1beta1
    capi.cpaas.io/resource-kind: DCSCluster
    cpaas.io/kube-ovn-version: <kube-ovn-version>
    cpaas.io/kube-ovn-join-cidr: <kube-ovn-join-cidr>
  labels:
    cluster-type: DCS
  name: <cluster-name>
  namespace: cpaas-system
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - <pods-cidr>
    services:
      cidrBlocks:
      - <services-cidr>
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: <kubeadm-control-plane-name>
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: DCSCluster
    name: <dcs-cluster-name-for-nodes>

关键参数说明

参数类型说明必填
.spec.clusterNetwork.pods.cidrBlocks[]stringPod 网络 CIDR
.spec.clusterNetwork.services.cidrBlocks[]stringService 网络 CIDR
.spec.controlPlaneRefobject控制平面引用(参见“配置 KubeadmControlPlane 资源”部分)
.spec.infrastructureRefobject基础设施集群引用(参见“配置 DCSCluster 资源”部分)

节点部署

请参阅部署节点页面获取操作说明。

集群验证

部署完所有集群资源后,请验证集群是否创建成功并正常运行。

使用 控制台

  1. 进入 控制台的 管理员 视图
  2. 访问 集群集群
  3. 在集群列表中找到新创建的集群
  4. 确认集群状态显示为 运行中
  5. 检查所有控制平面和工作节点均为 Ready

使用 kubectl

您也可以使用 kubectl 命令验证集群:

# 查看集群状态
kubectl get cluster -n cpaas-system <cluster-name>

# 验证控制平面节点
kubectl get kubeadmcontrolplane -n cpaas-system <kcp-name>

# 查看机器状态
kubectl get machines -n cpaas-system

# 验证集群部署状态
kubectl get clustermodule <cluster-name> -o jsonpath='{.status.base.deployStatus}'

预期结果

成功创建的集群应显示:

  • 集群状态:运行中已配置
  • 所有控制平面机器:运行中
  • 所有工作节点(如已部署):运行中
  • Kubernetes 节点:Ready
  • 集群模块状态:完成