Configuring High Availability for Virtual Machines
Overview
Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics.
Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as fencing before initiating recovery of the workload, known as remediation and ideally, recovery of the node also.
It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, Alauda Container Platform provides multiple components for the automation of failure detection, fencing and remediation.
Glossary
Component Overview
-
Self Node Remediation Operator
The Self Node Remediation Operator is a Alauda Container Platform add-on Operator that implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning.
Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check.
-
Node Health Check Operator
The Node Health Check Operator is a Alauda Container Platform add-on Operator that implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides these features. By default, it is configured to utilize the Self Node Remediation system.
Flow of events during fencing and remediation
Procedure
Operator Listing
- Download the Alauda Build of SelfNodeRemediation installation package corresponding to your platform architecture.
- Upload the Alauda Build of SelfNodeRemediation installation package using the Upload Packages mechanism.
- Download the Alauda Build of NodeHealthCheck installation package corresponding to your platform architecture.
- Upload the Alauda Build of NodeHealthCheck installation package using the Upload Packages mechanism.
Deploying Self Node Remediation Operator
-
Login, go to the Administrator page.
-
Click Marketplace > OperatorHub to enter the OperatorHub page.
-
Find the Alauda Build of SelfNodeRemediation, click Install, and navigate to the Install Alauda Build of SelfNodeRemediation page.
Configuration Parameters:
Configuring Self Node Remediation Operator(optional)
The Self Node Remediation Operator creates the SelfNodeRemediationConfig CR with the name self-node-remediation-config. The CR is created in the namespace of the Self Node Remediation Operator.
A change in the SelfNodeRemediationConfig CR re-creates the Self Node Remediation daemon set.
The SelfNodeRemediationConfig CR resembles the following YAML file:
Parameters
- The Self Node Remediation Operator creates the CR by default in the deployment namespace.
- The name for the CR must be
self-node-remediation-config. - You can only have one
SelfNodeRemediationConfigCR. - Deleting the
SelfNodeRemediationConfigCR disables Self Node Remediation.
Configuring Self Node Remediation Template(optional)
The Self Node Remediation Operator also creates the SelfNodeRemediationTemplate Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes that is aimed to recover workloads faster. The following remediation strategies are available:
-
Automatic
This remediation strategy simplifies the remediation process by letting the Self Node Remediation Operator decide on the most suitable remediation strategy for the cluster. This strategy checks if the
OutOfServiceTaintstrategy is available on the cluster. If theOutOfServiceTaintstrategy is available, the Operator selects theOutOfServiceTaintstrategy. If theOutOfServiceTaintstrategy is not available, the Operator selects theResourceDeletionstrategy.Automaticis the default remediation strategy. -
ResourceDeletion
This remediation strategy removes the pods on the node, rather than the removal of the node object.
-
OutOfServiceTaint
This remediation strategy implicitly causes the removal of the pods and associated volume attachments on the node, rather than the removal of the node object. It achieves this by placing the
OutOfServiceTaintstrategy on the node.
The Self Node Remediation Operator creates the SelfNodeRemediationTemplate CR for the strategy self-node-remediation-automatic-strategy-template, which the Automatic remediation strategy uses.
The SelfNodeRemediationTemplate CR resembles the following YAML file:
Parameters
Deploying Node Health Check Operator
-
Login, go to the Administrator page.
-
Click Marketplace > OperatorHub to enter the OperatorHub page.
-
Find the Alauda Build of NodeHealthCheck, click Install, and navigate to the Install Alauda Build of NodeHealthCheck page.
Configuration Parameters:
Create NodeHealthCheck instance
Execute the following command on the cluster control node:
Parameters:
Verification(optional)
Simulate the failure of the running node of the virtual machine and confirm that the virtual machine is automatically scheduled to run on other nodes.