etcd Backup and Restore

The etcd service on the cluster is a distributed key-value store responsible for storing cluster configuration information. etcd is deployed on all control plane nodes of the cluster.

After installing the Alauda Container Platform Cluster Enhancer plugin, an EtcdBackupConfiguration resource is automatically created for the cluster configuration. The EtcdBackupConfiguration contains information about backup data sources (control nodes, backup paths), backup data storage locations, backup methods, and more. Each backup execution based on the policy generates a new backup record, enabling you to back up cluster configurations on-demand or automatically on a periodic basis.

Prerequisites

To enable etcd backup:

  1. Download Alauda Container Platform Cluster Enhancer from the Customer Portal.
  2. Upload the package to the platform.
  3. Install the plugin on your cluster.

After installation, an EtcdBackupConfiguration resource is automatically created.

How it works

  • etcd backup is provided by Alauda Container Platform Cluster Enhancer
  • Supports both local storage and S3-compatible object storage. By default, backups are stored locally in /cpaas. Configuring S3 storage creates an additional copy in the S3 bucket; local backups continue to be generated.
  • For clusters running on Immutable OS, S3 storage is required (local storage is not supported)

Configuration Reference

You can configure the EtcdBackupConfiguration resource to customize backup schedules, retention policies, and storage options.

Schedule and Retention

  • schedule: Defines the backup frequency using standard cron syntax.
    • Example: 0 0 * * * (Run backup daily at midnight).
  • localStorage: Configures local backup storage.
    • path: The directory on the host where backups are stored. Default is /cpaas.
    • ttl: The retention period for backup files in seconds. Backups older than this duration will be automatically deleted.
      • Example: 7776000 (approximately 90 days).
  • paused: Set to true to temporarily suspend automatic backups without deleting the configuration.

Example configuration:

apiVersion: enhancement.cluster.alauda.io/v1
kind: EtcdBackupConfiguration
metadata:
  name: etcd-backup-default
spec:
  schedule: "0 0 * * *"       # Run daily at 00:00
  paused: false               # Enable backups
  localStorage:
    path: /cpaas-backup       # Custom backup path
    ttl: "7776000"            # Retain for 90 days
  # ... other fields

Viewing Backup Records

To view etcd backup records, you can use the platform UI or the command line.

Using the Platform UI

  1. In the left navigation bar, click Operation Center > Monitor > Dashboards.
  2. Click Switch in the upper right corner of the page.
  3. Click Clusteretcd backup to view the etcd backup records.

Using the CLI

You can verify backup status and history by checking the status field of the EtcdBackupConfiguration resource:

kubectl get etcdbackupconfiguration etcd-backup-default -o yaml

The output contains a status.records list with details for each backup, including:

  • backupTimestamp: Time the backup was created.
  • fileName: Name of the backup file (e.g., snapshot-etcd-<date>-<time>-<ip>.tar).
  • result: Outcome of the backup operation (e.g., Success).

S3 Backup Configuration

To enable S3 storage for etcd backups, follow these steps:

Prerequisites

  • Alauda Container Platform Cluster Enhancer is installed on the cluster.

Step 1: Create S3 Secret

Prepare your S3 access credentials and create a Kubernetes secret in the cpaas-system namespace:

export ACCESS_KEY="your-access-key"
export SECRET_KEY="your-secret-key"

kubectl create secret generic etcd-backup-s3-secret \
  --from-literal=ACCESS_KEY="$ACCESS_KEY" \
  --from-literal=SECRET_KEY="$SECRET_KEY" \
  --dry-run=client -n cpaas-system -o yaml | kubectl apply -f -

Step 2: Configure EtcdBackupConfiguration

Modify the EtcdBackupConfiguration resource to add the remoteStorage field with S3 configuration:

spec:
  remoteStorage:
    s3:
      endpoint: "your-s3-endpoint"  # e.g.: https://s3.bucket.com
      region: "your-s3-region"
      bucket: "your-s3-bucket"
      dir: "your-s3-bucket-dir"
      skipTLSVerify: false  # Set to true only for self-signed certificates
      secretRef: etcd-backup-s3-secret

Step 3: Verify Backup

Trigger a manual etcd backup to verify the configuration:

# Set environment variables
token="your-platform-token"
platform_url="your-platform-url"
cluster_name="your-cluster-name"

# Trigger backup
curl $platform_url/kubernetes/$cluster_name/apis/enhancement.cluster.alauda.io/v1/etcdbackupconfigurations/etcd-backup-default/exec \
  -k -H "Authorization: Bearer $token"

After the backup completes, verify that backup files exist in your S3 bucket.

etcd Restore

Warning:

  • This operation performs a disastrous recovery of the etcd cluster. It will overwrite the existing data. Ensure you have a valid backup snapshot before proceeding.
  • This procedure entails significant risks. If you are unsure about the operation, please contact technical support.
  • During the recovery process, the Kubernetes API Server will be unavailable.

Prerequisites

  • The Kubernetes cluster is deployed using hostnames (kubectl get node shows hostnames as node names).
  • An etcd backup snapshot is available.
  • The cluster is malfunctioning due to the failure of etcd nodes (e.g., more than half of the control plane nodes are down).
  • This recovery procedure is specifically designed for a 3-node control plane cluster. If your cluster has 5 or more control plane nodes, please contact technical support for assistance.

Step 1: Backup Original Data and Modify etcd Configuration

Execute the following commands on all control plane nodes:

# Create backup directory
mkdir -p /root/backup_$(date +%Y%m%d%H)/old-etcd/

# Stop kubelet
systemctl stop kubelet

# Backup etcdctl binary
cp $(find /var/lib/containerd/ -name etcdctl | tail -1) /root/etcdctl

# Remove etcd containers
crictl ps -a | grep etcd | awk '{print $1}' | xargs -r crictl rm -f

# Backup etcd data and kubernetes configuration
cp -a /var/lib/etcd/* /root/backup_$(date +%Y%m%d%H)/old-etcd/
rm -rf /var/lib/etcd/*
cp -r /etc/kubernetes/ /root/backup_$(date +%Y%m%d%H)/old-etcd/

# Modify etcd.yaml to use existing cluster state
sed -i /initial-cluster-state=/d /etc/kubernetes/manifests/etcd.yaml
sed -i '/initial-cluster=/a\    - --initial-cluster-state=existing' /etc/kubernetes/manifests/etcd.yaml

Note: Verify the indentation of --initial-cluster-state=existing in /etc/kubernetes/manifests/etcd.yaml.

Step 2: Copy Backup Snapshot

Copy the latest etcd backup snapshot to the /tmp directory on the first control plane node and name it snapshot.db.

Step 3: Restore etcd

Execute the following script on the first control plane node to restore the snapshot.

Note: The following script assumes a 3-node control plane cluster. If your cluster has 5 or more nodes, please contact technical support.

#!/usr/bin/env bash

# Set etcd node IPs (Replace with actual IPs)
export ETCD_1=1.1.1.1
export ETCD_2=2.2.2.2
export ETCD_3=3.3.3.3

# Set corresponding node hostnames (Replace with actual hostnames)
export ETCD_1_HOSTNAME=etcd-1
export ETCD_2_HOSTNAME=etcd-2
export ETCD_3_HOSTNAME=etcd-3

export ETCDCTL_API=3

# Loop for 3 nodes. Adjust '1 2 3' if you have a different number of nodes.
for n in 1 2 3; do
  ip_var=ETCD_${n}
  host_var=ETCD_${n}_HOSTNAME

  ip=${!ip_var}
  host=${!host_var}

  echo "Restoring for node: ${host} (${ip})..."

  rm -rf /tmp/etcd
  /root/etcdctl snapshot restore /tmp/snapshot.db \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --skip-hash-check=true \
    --data-dir=/tmp/etcd \
    --name "${host}" \
    --initial-cluster \
      ${ETCD_1_HOSTNAME}=https://${ETCD_1}:2380,\
${ETCD_2_HOSTNAME}=https://${ETCD_2}:2380,\
${ETCD_3_HOSTNAME}=https://${ETCD_3}:2380 \
    --initial-advertise-peer-urls https://"${ip}":2380 && \
    mv /tmp/etcd /root/etcd_"${host}"

  echo "Restoration for ${host} completed. Data directory: /root/etcd_${host}"
done

After the script completes, three directories (etcd_$host) are generated in the /root directory.

Step 4: Distribute Restored Data

  1. Transfer the restored data directories to the corresponding control plane nodes. Use scp or a similar tool to copy the directories generated in Step 3 (/root/etcd_<hostname>) from the first node to the others.

    For example, transfer to etcd-2 and etcd-3:

    # Replace <etcd-2-ip> and <etcd-3-ip> with actual IPs
    scp -r /root/etcd_etcd-2 root@<etcd-2-ip>:/root/
    scp -r /root/etcd_etcd-3 root@<etcd-3-ip>:/root/
  2. Restore the data to the etcd data directory (/var/lib/etcd) on each control plane node.

    # On etcd-1:
    cp -r /root/etcd_etcd-1/member/* /var/lib/etcd/
    
    # On etcd-2:
    cp -r /root/etcd_etcd-2/member/* /var/lib/etcd/
    
    # On etcd-3:
    cp -r /root/etcd_etcd-3/member/* /var/lib/etcd/

Step 5: Restart Cluster Components

Execute the following commands on all control plane nodes:

# Remove Kubernetes control plane containers
crictl ps -a | grep -E "kube-api|kube-sche|kube-contro" | awk '{print $1}' | xargs -r crictl rm -f

# Restart kubelet
systemctl restart kubelet

Step 6: Verify Recovery

  1. Check if the etcd cluster is healthy. You can execute this command inside any etcd pod or using the etcdctl binary on the host:

    # Using etcdctl on the host
    export ETCDCTL_API=3
    /root/etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key \
      --endpoints=https://127.0.0.1:2379 \
      endpoint health
  2. Check if the Kubernetes pods are running correctly:

    kubectl get po -n kube-system
  3. Restart kubelet on all nodes (both control plane and worker nodes) to ensure all components reconnect to the recovered etcd:

    systemctl restart kubelet

Configuration Management

To modify the default etcd backup configuration, contact technical support for detailed configuration options and advanced settings.