logo
Alauda Container Platform
English
简体中文
English
简体中文
logo
Alauda Container Platform
Navigation

Overview

Architecture
Release Notes

Install

Overview

Prepare for Installation

Prerequisites
Download
Node Preprocessing
Installing
Global Cluster Disaster Recovery

Upgrade

Overview
Pre-Upgrade Preparation
Upgrade the global cluster
Upgrade Workload Clusters

User Interface

Web Console

Overview
Accessing the Web Console
Customizing the Web Console
Customizing the Left Navigation

CLI Tools

ACP CLI (ac)

Getting Started with ACP CLI
Configuring ACP CLI
Usage of ac and kubectl Commands
Managing CLI Profiles
Extending ACP CLI with Plugins
AC CLI Developer Command Reference
AC CLI Administrator Command Reference
violet CLI

Configure

Feature Gate

Clusters

Overview
Immutable Infrastructure

Node Management

Overview
Add Nodes to On-Premises Clusters
Manage Nodes
Node Monitoring

Managed Clusters

overview

Import Clusters

Overview
Import Standard Kubernetes Cluster
Import OpenShift Cluster
Import Amazon EKS Cluster
Import GKE Cluster
Import Huawei Cloud CCE Cluster (Public Cloud)
Import Azure AKS Cluster
Import Alibaba Cloud ACK Cluster
Import Tencent Cloud TKE Cluster
Register Cluster

Public Cloud Cluster Initialization

Network Initialization

AWS EKS Cluster Network Initialization Configuration
AWS EKS Supplementary Information
Huawei Cloud CCE Cluster Network Initialization Configuration
Azure AKS Cluster Network Initialization Configuration
Google GKE Cluster Network Initialization Configuration

Storage Initialization

Overview
AWS EKS Cluster Storage Initialization Configuration
Huawei Cloud CCE Cluster Storage Initialization Configuration
Azure AKS Cluster Storage Initialization Configuration
Google GKE Cluster Storage Initialization Configuration

How to

Network Configuration for Import Clusters
Fetch import cluster information
Trust an insecure image registry
Collect Network Data from Custom Named Network Cards
Creating an On-Premise Cluster
Hosted Control Plane
Cluster Node Planning
etcd Encryption

How to

Add External Address for Built-in Registry
Choosing a Container Runtime
Optimize Pod Performance with Manager Policies
Updating Public Repository Credentials

Backup and Recovery

Overview
Install
Backup repository

Backup Management

ETCD Backup
Create an application backup schedule
Hooks

Recovery Management

Run an Application Restore Task
Image Registry Replacement

Networking

Guides

Configure Domain
Creating Certificates
Configure Services
Configure Ingresses
Configure Subnets
Configure MetalLB
Configure GatewayAPI Gateway
Configure GatewayAPI Route
Configure ALB
Configure NodeLocal DNSCache
Configure CoreDNS

How To

Tasks for Ingress-Nginx
Tasks for Envoy Gateway
Soft Data Center LB Solution (Alpha)

Kube OVN

Understanding Kube-OVN CNI
Preparing Kube-OVN Underlay Physical Network
Automatic Interconnection of Underlay and Overlay Subnets
Cluster Interconnection (Alpha)
Configure Egress Gateway
Configure IPPool
Configuring Kube-OVN Network to Support Pod Multi-Network Interfaces (Alpha)
Configure Endpoint Health Checker

alb

Tasks for ALB

Trouble Shooting

How to Solve Inter-node Communication Issues in ARM Environments?
Find Who Cause the Error

Storage

Introduction

Concepts

Core Concepts
Persistent Volume
Access Modes and Volume Modes

Guides

Creating CephFS File Storage Type Storage Class
Creating CephRBD Block Storage Class
Create TopoLVM Local Storage Class
Creating an NFS Shared Storage Class
Deploy Volume Snapshot Component
Creating a PV
Creating PVCs
Using Volume Snapshots

How To

Generic ephemeral volumes
Using an emptyDir
Configuring Persistent Storage Using Local volumes
Configuring Persistent Storage Using NFS
Third‑Party Storage Capability Annotation Guide

Troubleshooting

Recover From PVC Expansion Failure

Object Storage

Introduction
Concepts
Installing

Guides

Creating a BucketClass for Ceph RGW
Creating a BucketClass for MinIO
Create a Bucket Request

How To

Control Access & Quotas for COSI Buckets with CephObjectStoreUser (Ceph Driver)
Machine Configuration

Scalability and Performance

Evaluating Resources for Workload Cluster
Disk Configuration
Evaluating Resources for Global Cluster
Improving Kubernetes Stability for Large-Scale Clusters

Storage

Ceph Distributed Storage

Introduction

Install

Create Standard Type Cluster
Create Stretch Type Cluster
Architecture

Concepts

Core Concepts

Guides

Accessing Storage Services
Managing Storage Pools
Node-specific Component Deployment
Adding Devices/Device Classes
Monitoring and Alerts

How To

Configure a Dedicated Cluster for Distributed Storage
Cleanup Distributed Storage

Disaster Recovery

File Storage Disaster Recovery
Block Storage Disaster Recovery
Object Storage Disaster Recovery
Update the optimization parameters
Create Ceph Object Store User
Setting Storage Pool Quotas

MinIO Object Storage

Introduction
Install
Architecture

Concepts

Core Concepts

Guides

Adding a Storage Pool
Monitoring & Alerts

How To

Data Disaster Recovery

TopoLVM Local Storage

Introduction
Install

Guides

Device Management
Monitoring and Alerting

How To

Backup and Restore TopoLVM Filesystem PVCs with Velero
Configuring Striped Logical Volumes

Networking

Overview

Networking Operators

MetalLB Operator
Ingress Nginx Operator
Envoy Gateway Operator

ALB Operator

Understanding ALB
Auth
Deploy High Available VIP for ALB
Bind NIC in ALB
Decision‑Making for ALB Performance Selection
Load Balancing Session Affinity Policy in ALB
L4/L7 Timeout
HTTP Redirect
CORS
Header Modification
URL Rewrite
ModSecurity
OTel
TCP/HTTP Keepalive
ALB with Ingress-NGINX Annotation Compatibility
ALB Monitoring

Network Security

Understanding Network Policy APIs
Admin Network Policy
Network Policy

Ingress and Load Balancing

Ingress and Load Balancing with Envoy Gateway
Network Observability

Security

Alauda Container Security

Security and Compliance

Compliance

Introduction
Install Alauda Container Platform Compliance with Kyverno

HowTo

Private Registry Access Configuration
Image Signature Verification Policy
Image Signature Verification Policy with Secrets
Image Registry Validation Policy
Container Escape Prevention Policy
Security Context Enforcement Policy
Network Security Policy
Volume Security Policy

API Refiner

Introduction
Install Alauda Container Platform API Refiner
About Alauda Container Platform Compliance Service

Users and Roles

User

Introduction

Guides

Manage User Roles
Create User
User Management

Group

Introduction

Guides

Manage User Group Roles
Create Local User Group
Manage Local User Group Membership

Role

Introduction

Guides

Create Kubernetes Roles
Manage Roles After the RBAC Refactor

IDP

Introduction

Guides

LDAP Management
OIDC Management

Troubleshooting

Delete User

User Policy

Introduction

Multitenancy(Project)

Introduction

Guides

Create Project
Manage Project Quotas
Manage Project
Manage Project Cluster
Manage Project Members

Audit

Introduction

Telemetry

Install

Certificates

Automated Kubernetes Certificate Rotation
cert-manager
OLM Certificates
Certificate Monitoring
Rotate TLS Certs of Platform Access Addresses

Virtualization

Virtualization

Overview

Introduction
Install

Images

Introduction

Guides

Adding Virtual Machine Images
Update/Delete Virtual Machine Images
Update/Delete Image Credentials

How To

Creating Windows Images Based on ISO using KubeVirt
Creating Linux Images Based on ISO Using KubeVirt
Exporting Virtual Machine Images
Permissions

Virtual Machine

Introduction

Guides

Creating Virtual Machines/Virtual Machine Groups
Batch Operations on Virtual Machines
Logging into the Virtual Machine using VNC
Managing Key Pairs
Managing Virtual Machines
Monitoring and Alerts
Quick Location of Virtual Machines

How To

Configuring USB host passthrough
Virtual Machine Hot Migration
Virtual Machine Recovery
Clone Virtual Machines on KubeVirt
Physical GPU Passthrough Environment Preparation
Configuring High Availability for Virtual Machines
Create a VM Template from an Existing Virtual Machine

Troubleshooting

Pod Migration and Recovery from Abnormal Shutdown of Virtual Machine Nodes
Hot Migration Error Messages and Solutions

Network

Introduction

Guides

Configure Network

How To

Control Virtual Machine Network Requests Through Network Policy
Configuring SR-IOV
Configuring Virtual Machines to Use Network Binding Mode for IPv6 Support

Storage

Introduction

Guides

Managing Virtual Disks

Backup and Recovery

Introduction

Guides

Using Snapshots
Using Velero

Developer

Overview

Quick Start

Creating a simple application via image

Building Applications

Build application architecture

Concepts

Application Types
Custom Applications
Workload Types
Understanding Parameters
Understanding Environment Variables
Understanding Startup Commands
Resource Unit Description

Namespaces

Creating Namespaces
Importing Namespaces
Resource Quota
Limit Range
Pod Security Policies
UID/GID Assignment
Overcommit Ratio
Managing Namespace Members
Updating Namespaces
Deleting/Removing Namespaces

Creating Applications

Creating applications from Image
Creating applications from Chart
Creating applications from YAML
Creating applications from Code
Creating applications from Operator Backed
Creating applications by using CLI

Operation and Maintaining Applications

Application Rollout

Installing Alauda Container Platform Argo Rollouts
Application Blue Green Deployment
Application Canary Deployment
Status Description

KEDA(Kubernetes Event-driven Autoscaling)

KEDA Overview
Installing KEDA

How To

Integrating ACP Monitoring with Prometheus Plugin
Pausing Autoscaling in KEDA
Configuring HPA
Starting and Stopping Applications
Configuring VerticalPodAutoscaler (VPA)
Configuring CronHPA
Updating Applications
Exporting Applications
Updating and deleting Chart Applications
Version Management for Applications
Deleting Applications
Handling Out of Resource Errors
Health Checks

Workloads

Deployments
DaemonSets
StatefulSets
CronJobs
Jobs
Pods
Containers
Working with Helm charts

Configurations

Configuring ConfigMap
Configuring Secrets

Application Observability

Monitoring Dashboards
Logs
Events

How To

Setting Scheduled Task Trigger Rules
Add ImagePullSecrets to ServiceAccount

Images

Overview of images

How To

Creating images
Managing images

Registry

Introduction

Install

Install Via YAML
Install Via Web UI

How To

Common CLI Command Operations
Using Alauda Container Platform Registry in Kubernetes Clusters

Source to Image

Overview

Introduction
Architecture
Release Notes
Lifecycle Policy

Install

Installing Alauda Container Platform Builds

Upgrade

Upgrading Alauda Container Platform Builds

Guides

Managing applications created from Code

How To

Creating an application from Code

Node Isolation Strategy

Introduction
Architecture

Concepts

Core Concepts

Guides

Create Node Isolation Strategy
Permissions
FAQ

Alauda Container Platform GitOps

About Alauda Container Platform GitOps

Extend

Overview
Operator
Cluster Plugin
Chart Repository
Upload Packages

Observability

Overview

Monitoring

Introduction
Install

Architecture

Monitoring Module Architecture
Monitoring Component Selection Guide
Monitor Component Capacity Planning
Concepts

Guides

Management of Metrics
Management of Alert
Management of Notification
Management of Monitoring Dashboards
Management of Probe

How To

Backup and Restore of Prometheus Monitoring Data
VictoriaMetrics Backup and Recovery of Monitoring Data
Collect Network Data from Custom-Named Network Interfaces

Distributed Tracing

Introduction
Install
Architecture
Concepts

Guides

Query Tracing
Query Trace Logs

How To

Non-Intrusive Integration of Tracing in Java Applications
Business Log Associated with the TraceID

Troubleshooting

Unable to Query the Required Tracing
Incomplete Tracing Data

Logs

About Logging Service

Events

Introduction
Events

Inspection

Introduction
Architecture

Guides

Inspection
Component Health Status

Hardware accelerators

About Alauda Build of Hami
About Alauda Build of NVIDIA GPU Device Plugin

Alauda Service Mesh

Service Mesh 1.x
Service Mesh 2.x

Alauda AI

About Alauda AI

Alauda DevOps

About Alauda DevOps

Alauda Cost Management

About Alauda Cost Management

Alauda Application Services

Overview

Introduction
Architecture
Install
Upgrade

Alauda Database Service for MySQL

About Alauda Database Service for MySQL-MGR
About Alauda Database Service for MySQL-PXC

Alauda Cache Service for Redis OSS

About Alauda Cache Service for Redis OSS

Alauda Streaming Service for Kafka

About Alauda Streaming Service for Kafka

Alauda Streaming Service for RabbitMQ

About Alauda Streaming Service for RabbitMQ

Alauda support for PostgreSQL

About Alauda support for PostgreSQL

Operations Management

Introduction

Parameter Template Management

Introduction

Guides

Parameter Template Management

Backup Management

Introduction

Guides

External S3 Storage
Backup Management

Inspection Management

Introduction

Guides

Create Inspection Task
Exec Inspection Task
Update and Delete Inspection Tasks

How To

How to set Inspection scheduling?

Inspection Optimization Recommendations

MySQL

MySQL IO Load Optimization
MySQL Memory Usage Optimization
MySQL Storage Space Optimization
MySQL Active Thread Count Optimization
MySQL Row Lock Optimization

Redis

Redis BigKey
High CPU Usage in Redis
High Memory Usage in Redis

Kafka

High CPU Utilization in Kafka
Kafka Rebalance Optimization
Kafka Memory Usage Optimization
Kafka Storage Space Optimization

RabbitMQ

RabbitMQ Mnesia Database Exception Handling

Alert Management

Introduction

Guides

Relationship with Platform Capabilities

Upgrade Management

Introduction

Guides

Instance Upgrade

API Reference

Overview

Introduction
Kubernetes API Usage Guide

Advanced APIs

Alert APIs

AlertHistories [v1]
AlertHistoryMessages [v1]
AlertStatus [v2]
SilenceStatus [v2]

Event APIs

Search

GitOps APIs

Core
Application
ApplicationSet

Log APIs

Aggregation
Archive
Context
Search

Monitoring APIs

Indicators [monitoring.alauda.io/v1beta1]
Metrics [monitoring.alauda.io/v1beta1]
Variables [monitoring.alauda.io/v1beta1]

Kubernetes APIs

Alert APIs

AlertTemplate [alerttemplates.aiops.alauda.io/v1beta1]
PrometheusRule [prometheusrules.monitoring.coreos.com/v1]

AutoScaling APIs

HorizontalPodAutoscaler [autoscaling/v2]

Configuration APIs

ConfigMap [v1]
Secret [v1]

Inspection APIs

Inspection [inspections.ait.alauda.io/v1alpha1]

MachineConfiguration APIs

MachineConfig [machineconfiguration.alauda.io/v1alpha1]
MachineConfigPool [machineconfiguration.alauda.io/v1alpha1]
MachineConfiguration [machineconfiguration.alauda.io/v1alpha1]

ModulePlugin APIs

ModuleConfig [moduleconfigs.cluster.alauda.io/v1alpha1]
ModuleInfo [moduleinfoes.cluster.alauda.io/v1alpha1]
ModulePlugin [moduleplugins.cluster.alauda.io/v1alpha1]

Namespace APIs

LimitRange [v1]
Namespace [v1]
ResourceQuota [v1]

Networking APIs

HTTPRoute [httproutes.gateway.networking.k8s.io/v1]
Service [v1]
VpcEgressGateway [vpc-egress-gateways.kubeovn.io/v1]
Vpc [vpcs.kubeovn.io/v1]

Notification APIs

Notification [notifications.ait.alauda.io/v1beta1]
NotificationGroup [notificationgroups.ait.alauda.io/v1beta1]
NotificationTemplate [notificationtemplates.ait.alauda.io/v1beta1]

Operator APIs

Operator [operators.operators.coreos.com/v1]

Workload APIs

Cronjob [batch/v1]
DameonSet [apps/v1]
Deployment [apps/v1]
Job [batch/v1]
Pod [v1]
Replicaset [apps/v1]
ReplicationController [v1]
Statefulset [apps/v1]
📝 Edit this page on GitHub
Previous PagePhysical GPU Passthrough Environment Preparation
Next PageCreate a VM Template from an Existing Virtual Machine

View full docs as PDF

#Configuring High Availability for Virtual Machines

#TOC

#Overview

Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.

Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as fencing before initiating recovery of the workload, known as remediation and ideally, recovery of the node also.

It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, Alauda Container Platform provides multiple components for the automation of failure detection, fencing and remediation.

#Glossary

AcronymTerm
SNRSelf Node Remediation
NHCNode Health Check

#Component Overview

  • Self Node Remediation Operator

    The Self Node Remediation Operator is a Alauda Container Platform add-on Operator that implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.

    Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.

  • Node Health Check Operator

    The Node Health Check Operator is a Alauda Container Platform add-on Operator that implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides these features. By default, it is configured to utilize the Self Node Remediation system.

#Flow of events during fencing and remediation

#Procedure

#Operator Listing

  • Download the Alauda Build of SelfNodeRemediation installation package corresponding to your platform architecture.
  • Upload the Alauda Build of SelfNodeRemediation installation package using the Upload Packages mechanism.
  • Download the Alauda Build of NodeHealthCheck installation package corresponding to your platform architecture.
  • Upload the Alauda Build of NodeHealthCheck installation package using the Upload Packages mechanism.

#Deploying Self Node Remediation Operator

  1. Login, go to the Administrator page.

  2. Click Marketplace > OperatorHub to enter the OperatorHub page.

  3. Find the Alauda Build of SelfNodeRemediation, click Install, and navigate to the Install Alauda Build of SelfNodeRemediation page.

    Configuration Parameters:

    ParameterRecommended Configuration
    ChannelThe default channel is stable.
    Installation ModeCluster: All namespaces in the cluster share a single Operator instance for creation and management, resulting in lower resource usage.
    Installation PlaceSelect Recommended, Namespace only support workload-availability.
    Upgrade StrategyManual: When there is a new version in the Operator Hub, manual confirmation is required to upgrade the Operator to the latest version.

#Configuring Self Node Remediation Operator(optional)

The Self Node Remediation Operator creates the SelfNodeRemediationConfig CR with the name self-node-remediation-config. The CR is created in the namespace of the Self Node Remediation Operator.

Note

A change in the SelfNodeRemediationConfig CR re-creates the Self Node Remediation daemon set.

The SelfNodeRemediationConfig CR resembles the following YAML file:

apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationConfig
metadata:
  name: self-node-remediation-config
  namespace: workload-availability
spec:
  safeTimeToAssumeNodeRebootedSeconds: 180 
  watchdogFilePath: /dev/watchdog 
  isSoftwareRebootEnabled: true 
  apiServerTimeout: 15s 
  apiCheckInterval: 5s 
  maxApiErrorThreshold: 3 
  peerApiServerTimeout: 5s 
  peerDialTimeout: 5s 
  peerRequestTimeout: 5s 
  peerUpdateInterval: 15m 
  hostPort: 30001 
  customDsTolerations: 
  - effect: NoSchedule
    key: node-role.kubernetes.io.infra
    operator: Equal
    value: "value1"
    tolerationSeconds: 3600

Parameters

ParameterDescription
safeTimeToAssumeNodeRebootedSecondsSpecify an optional time duration that the Operator waits before recovering affected workloads running on an unhealthy node. Starting replacement pods while they are still running on the failed node can lead to data corruption and a violation of run-once semantics. The Operator calculates a minimum duration using the values in the ApiServerTimeout, ApiCheckInterval, MaxApiErrorThreshold, PeerDialTimeout, and PeerRequestTimeout fields, as well as the watchdog timeout and the cluster size at the time of remediation.
watchdogFilePathSpecify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
If a watchdog device is unavailable, the SelfNodeRemediationConfig CR uses a software reboot.
isSoftwareRebootEnabledSpecify if you want to enable software reboot of the unhealthy nodes. By default, the value of isSoftwareRebootEnabled is set to true. To disable the software reboot, set the parameter value to false.
apiServerTimeoutSpecify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be greater than or equal to 10 milliseconds.
apiCheckIntervalSpecify the frequency to check connectivity with each API server. The timeout duration must be greater than or equal to 1 second.
maxApiErrorThresholdSpecify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be greater than or equal to 1 second.
peerApiServerTimeoutSpecify the duration of the timeout for the peer to connect the API server. The timeout duration must be greater than or equal to 10 milliseconds.
peerDialTimeoutSpecify the duration of the timeout for establishing connection with the peer. The timeout duration must be greater than or equal to 10 milliseconds.
peerRequestTimeoutSpecify the duration of the timeout to get a response from the peer. The timeout duration must be greater than or equal to 10 milliseconds.
peerUpdateIntervalSpecify the frequency to update peer information such as IP address. The timeout duration must be greater than or equal to 10 seconds.
hostPortSpecify an optional value to change the port that Self Node Remediation agents use for internal communication. The value must be greater than 0. The default value is port 30001.
customDsTolerationsSpecify custom toleration Self Node Remediation agents that are running on the DaemonSets to support remediation for different types of nodes.
Note
  • The Self Node Remediation Operator creates the CR by default in the deployment namespace.
  • The name for the CR must be self-node-remediation-config.
  • You can only have one SelfNodeRemediationConfig CR.
  • Deleting the SelfNodeRemediationConfig CR disables Self Node Remediation.

#Configuring Self Node Remediation Template(optional)

The Self Node Remediation Operator also creates the SelfNodeRemediationTemplate Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes that is aimed to recover workloads faster. The following remediation strategies are available:

  • Automatic

    This remediation strategy simplifies the remediation process by letting the Self Node Remediation Operator decide on the most suitable remediation strategy for the cluster. This strategy checks if the OutOfServiceTaint strategy is available on the cluster. If the OutOfServiceTaint strategy is available, the Operator selects the OutOfServiceTaint strategy. If the OutOfServiceTaint strategy is not available, the Operator selects the ResourceDeletion strategy. Automatic is the default remediation strategy.

  • ResourceDeletion

    This remediation strategy removes the pods on the node, rather than the removal of the node object.

  • OutOfServiceTaint

    This remediation strategy implicitly causes the removal of the pods and associated volume attachments on the node, rather than the removal of the node object. It achieves this by placing the OutOfServiceTaint strategy on the node.

The Self Node Remediation Operator creates the SelfNodeRemediationTemplate CR for the strategy self-node-remediation-automatic-strategy-template, which the Automatic remediation strategy uses.

The SelfNodeRemediationTemplate CR resembles the following YAML file:

apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
metadata:
  creationTimestamp: "2022-03-02T08:02:40Z"
  name: self-node-remediation-<remediation_object>-deletion-template 
  namespace: workload-availability
spec:
  template:
    spec:
      remediationStrategy: <remediation_strategy>  

Parameters

ParameterDescription
remediation_strategyValues: Automatic、ResourceDeletion、OutOfServiceTaint

#Deploying Node Health Check Operator

  1. Login, go to the Administrator page.

  2. Click Marketplace > OperatorHub to enter the OperatorHub page.

  3. Find the Alauda Build of NodeHealthCheck, click Install, and navigate to the Install Alauda Build of NodeHealthCheck page.

    Configuration Parameters:

    ParameterRecommended Configuration
    ChannelThe default channel is stable.
    Installation ModeCluster: All namespaces in the cluster share a single Operator instance for creation and management, resulting in lower resource usage.
    Installation PlaceSelect Recommended, Namespace only support workload-availability.
    Upgrade StrategyManual: When there is a new version in the Operator Hub, manual confirmation is required to upgrade the Operator to the latest version.

#Create NodeHealthCheck instance

Execute the following command on the cluster control node:

Command
Example
cat << EOF | kubectl apply -f -
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-<name>
spec:
  minHealthy: <minHealthy>
  remediationTemplate:
    apiVersion: self-node-remediation.medik8s.io/v1alpha1
    kind: SelfNodeRemediationTemplate
    name: self-node-remediation-automatic-strategy-template
    namespace: workload-availability
  selector: <selector>
  unhealthyConditions:
    - duration: 300s
      status: 'False'
      type: Ready
    - duration: 300s
      status: Unknown
      type: Ready
EOF

Parameters:

ParameterDescription
nameresource name
minHealthySpecify the minimum proportion of healthy nodes. Faulty nodes will only be repaired when the proportion of healthy nodes is greater than or equal to this value. The default value is 51%
selectorSpecify LabelSelector to match the nodes to be inspected and self-repaired. Please avoid specifying control-plane and worker nodes simultaneously in the same instance

#Verification(optional)

Simulate the failure of the running node of the virtual machine and confirm that the virtual machine is automatically scheduled to run on other nodes.