Tag: disaster recovery

The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters
Kubernetes allows us to run a containerized application at scale without drowning in the details of application load balancing. You can ensure high availability for your applications running on Kubernetes by running multiple replicas (pods) of the application. All the complexity of container orchestrations is hidden away safely so that you can focus on developing application instead of deploying it. Learn more about high availability of Kubernetes Clusters and how you can use Kubedm for high availability in Kubernetes here.

But using Kubernetes has its own challenges and getting Kubernetes up and running takes some real work. If you are not familiar with getting Kubernetes up and running, you might want to take a look here.

Kubernetes allows us to have a zero downtime deployment, yet service interrupting events are inevitable and can occur at any time. Your network can go down, your latest application push can introduce a critical bug, or in the rarest case, you might even have to face a natural disaster.

When you are using Kubernetes, sooner or later, you need to set up a backup. In case your cluster goes into an unrecoverable state, you will need a backup to go back to the previous stable state of the Kubernetes cluster.

Why Backup and Recovery?

There are three reasons why you need a backup and recovery mechanism in place for your Kubernetes cluster. These are:
1. To recover from Disasters: like someone accidentally deleted the namespace where your deployments reside.
2. Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
3. Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.
What to Backup?

Now that you know why, let’s see what exactly do you need to backup. The two things you need to backup are:
1. Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
2. If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.
How to Backup?

There have been various tools like Heptio ark and Kube-backup to backup and restore the Kubernetes cluster for cloud providers. But, what if you are not using managed Kubernetes cluster? You might have to get your hands dirty if you are running Kubernetes on Baremetal, just like we are.

We are running 3 master Kubernetes cluster with 3 etcd members running on each master. If we lose one master, we can still recover the master because etcd quorum is intact. Now if we lose two masters, we need a mechanism to recover from such situations as well for production grade clusters.

Want to know how to set up multi-master Kubernetes cluster? Keep reading!

Taking etcd backup:

There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment.

There are two ways to setup etcd cluster in kubernetes environment:
1. Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
2. External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.
Backup Strategy for Internal Etcd Cluster:

To take a backup from inside a etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host.

Following is the definition of Kubernetes CronJob which will take etcd backup every minute:
`apiVersion: batch/v1beta1kind: CronJobmetadata: name: backup namespace: kube-systemspec: # activeDeadlineSeconds: 100schedule: "*/1 * * * *" jobTemplate: spec: template: spec: containers: - name: backup # Same image as in /etc/kubernetes/manifests/etcd.yaml image: k8s.gcr.io/etcd:3.2.24 env: - name: ETCDCTL_API value: "3" command: ["/bin/sh"] args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"] volumeMounts: - mountPath: /etc/kubernetes/pki/etcd name: etcd-certs readOnly: true - mountPath: /backup name: backup restartPolicy: OnFailure hostNetwork: true volumes: - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd type: DirectoryOrCreate - name: backup hostPath: path: /data/backup type: DirectoryOrCreate
```
`apiVersion: batch/v1beta1kind: CronJobmetadata: name: backup namespace: kube-systemspec: # activeDeadlineSeconds: 100schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
# Same image as in /etc/kubernetes/manifests/etcd.yaml
image: k8s.gcr.io/etcd:3.2.24
env:
- name: ETCDCTL_API
value: "3"
command: ["/bin/sh"]
args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
volumeMounts:
- mountPath: /etc/kubernetes/pki/etcd
name: etcd-certs
readOnly: true
- mountPath: /backup
name: backup
restartPolicy: OnFailure
hostNetwork: true
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
- name: backup
hostPath:
path: /data/backup
type: DirectoryOrCreate
```
Backup Strategy for External Etcd Cluster:

If you running etcd cluster on Linux hosts as a service, you should set up a Linux cron job to take backup of your cluster.

Run the following command to save etcd backup
```
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save /path/for/backup/snapshot.db
```
Disaster Recovery

Now, Let’s say the Kubernetes cluster went completely down and we need to recover the Kubernetes cluster from the etcd snapshot.

Normally, start the etcd cluster and do the kubeadm init on the master node with etcd endpoints.

Make sure you put the backup certificates into /etc/kubernetes/pki folder before kubeadm init. It will pick up the same certificates.

Restore Strategy for Internal Etcd Cluster:
```
docker run --rm 
-v '/data/backup:/backup' 
-v '/var/lib/etcd:/var/lib/etcd' 
--env ETCDCTL_API=3 
k8s.gcr.io/etcd:3.2.24' 
/bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-2018-12-09_11:12:05_UTC.db' ; mv /default.etcd/member/ /var/lib/etcd/"

kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd
```
Restore Strategy for External Etcd Cluster

Restore the etcd on 3 nodes using following commands:
ETCDCTL_API=3 etcdctl snapshot restore snapshot-188.db --name master-0 --initial-cluster master-0=http://10.0.1.188:2380,master-01=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 --initial-cluster-token my-etcd-token --initial-advertise-peer-urls http://10.0.1.188:2380 ETCDCTL_API=3 etcdctl snapshot restore snapshot-136.db --name master-1 --initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 --initial-cluster-token my-etcd-token --initial-advertise-peer-urls http://10.0.1.136:2380 ETCDCTL_API=3 etcdctl snapshot restore snapshot-155.db --name master-2 --initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 --initial-cluster-token my-etcd-token --initial-advertise-peer-urls http://10.0.1.155:2380
```
ETCDCTL_API=3 etcdctl snapshot restore snapshot-188.db 
--name master-0 
--initial-cluster master-0=http://10.0.1.188:2380,master-01=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
--initial-cluster-token my-etcd-token 
--initial-advertise-peer-urls http://10.0.1.188:2380

ETCDCTL_API=3 etcdctl snapshot restore snapshot-136.db 
--name master-1 
--initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
--initial-cluster-token my-etcd-token 
--initial-advertise-peer-urls http://10.0.1.136:2380

ETCDCTL_API=3 etcdctl snapshot restore snapshot-155.db 
--name master-2 
--initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
--initial-cluster-token my-etcd-token 
--initial-advertise-peer-urls http://10.0.1.155:2380
```
The above three commands will give you three restored folders on three nodes named master:

0.etcd, master-1.etcd and master-2.etcd

Now, Stop all the etcd service on the nodes, replace the restored folder with the restored folders on all nodes and start the etcd service. Now you can see all the nodes, but in some time you will see that only master node is ready and other nodes went into the not ready state. You need to join those two nodes again with the existing ca.crt file (you should have a backup of that).

Run the following command on master node:
```
kubeadm token create --print-join-command
```
It will give you kubeadm join command, add one –ignore-preflight-errors and run that command on other two nodes for them to come into the ready state.

Conclusion

One way to deal with master failure is to set up multi-master Kubernetes cluster, but even that does not allow you to completely eliminate the Kubernetes etcd backup and restore, and it is still possible that you may accidentally destroy data on the HA environment.

Need help with disaster recovery for your Kubernetes Cluster? Connect with the experts at Velotio!

For more insights into Kubernetes Disaster Recovery check out here.
March 1, 2024

Kubernetes Migration: How To Move Data Freely Across Clusters

This blog focuses on migrating Kubernetes clusters from one cloud provider to another. We will be migrating our entire data from Google Kubernetes Engine to Azure Kubernetes Service using Velero.

Prerequisite

A Kubernetes cluster > 1.10

Setup Velero with Restic Integration

Velero consists of a client installed on your local computer and a server that runs in your Kubernetes cluster, like Helm.

Installing Velero Client

You can find the latest release corresponding to your OS and system and download Velero from there:

$ wget
https://github.com/vmware-tanzu/velero/releases/download/v1.3.1/velero-v1.3.1-linux-amd64.tar.gz

$ wget
https://github.com/vmware-tanzu/velero/releases/download/v1.3.1/velero-v1.3.1-linux-amd64.tar.gz

Extract the tarball (change the version depending on yours) and move the Velero binary to /usr/local/bin

$ tar -xvzf velero-v0.11.0-darwin-amd64.tar.gz
$ sudo mv velero /usr/local/bin/
$ velero help

$ tar -xvzf velero-v0.11.0-darwin-amd64.tar.gz
$ sudo mv velero /usr/local/bin/
$ velero help

Create a Bucket for Velero on GCP

Velero needs an object storage bucket where it will store the backup. Create a GCS bucket using:

gsutil mb gs://<bucket-name

gsutil mb gs://<bucket-name

Create a Service Account for Velero

# Create a Service Account
gcloud iam service-accounts create velero --display-name "Velero service account"
SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list --filter="displayName:Velero service account" --format 'value(email)')
#Define Permissions for the Service Account
ROLE_PERMISSIONS=(
compute.disks.get
compute.disks.create
compute.disks.createSnapshot
compute.snapshots.get
compute.snapshots.create
compute.snapshots.useReadOnly
compute.snapshots.delete
compute.zones.get
)
# Create a Role for Velero
PROJECT_ID=$(gcloud config get-value project)
gcloud iam roles create velero.server 
--project $PROJECT_ID 
--title "Velero Server" 
--permissions "$(IFS=","; echo "${ROLE_PERMISSIONS[*]}")"
# Create a Role Binding for Velero
gcloud projects add-iam-policy-binding $PROJECT_ID 
--member serviceAccount:$SERVICE_ACCOUNT_EMAIL 
--role projects/$PROJECT_ID/roles/velero.server
gsutil iam ch serviceAccount:$SERVICE_ACCOUNT_EMAIL:objectAdmin
# Generate Service Key file for Velero and save it for later
gcloud iam service-accounts keys create credentials-velero 
--iam-account $SERVICE_ACCOUNT_EMAIL

# Create a Service Account
gcloud iam service-accounts create velero --display-name "Velero service account"
SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list --filter="displayName:Velero service account" --format 'value(email)')

#Define Permissions for the Service Account
ROLE_PERMISSIONS=(
compute.disks.get
compute.disks.create
compute.disks.createSnapshot
compute.snapshots.get
compute.snapshots.create
compute.snapshots.useReadOnly
compute.snapshots.delete
compute.zones.get
)

# Create a Role for Velero
PROJECT_ID=$(gcloud config get-value project)

gcloud iam roles create velero.server 
--project $PROJECT_ID 
--title "Velero Server" 
--permissions "$(IFS=","; echo "${ROLE_PERMISSIONS[*]}")"

# Create a Role Binding for Velero
gcloud projects add-iam-policy-binding $PROJECT_ID 
--member serviceAccount:$SERVICE_ACCOUNT_EMAIL 
--role projects/$PROJECT_ID/roles/velero.server

gsutil iam ch serviceAccount:$SERVICE_ACCOUNT_EMAIL:objectAdmin

# Generate Service Key file for Velero and save it for later
gcloud iam service-accounts keys create credentials-velero 
--iam-account $SERVICE_ACCOUNT_EMAIL

Install Velero Server on GKE and AKS

Use the –use-restic flag on the Velero install command to install restic integration.

$ velero install 
--use-restic 
--bucket  
--provider gcp 
--secret-file  
--use-volume-snapshots=false 
--plugins=--plugins restic/restic
$ velero plugin add velero/velero-plugin-for-gcp:v1.0.1
$ velero plugin add velero/velero-plugin-for-microsoft-azure:v1.0.0

$ velero install 
--use-restic 
--bucket  
--provider gcp 
--secret-file  
--use-volume-snapshots=false 
--plugins=--plugins restic/restic
$ velero plugin add velero/velero-plugin-for-gcp:v1.0.1
$ velero plugin add velero/velero-plugin-for-microsoft-azure:v1.0.0

After that, you can see a DaemonSet of restic and deployment of Velero in your Kubernetes cluster.

$ kubectl get po -n velero

$ kubectl get po -n velero

Restic Components

In addition, there are three more Custom Resource Definitions and their associated controllers to provide restic support.

Restic Repository

Maintain the complete lifecycle for Velero’s restic repositories.
Restic lifecycle commands such as restic init check and prune are handled by this CRD controller.

PodVolumeBackup

This CRD backs up the persistent volume based on the annotated pod in selected namespaces.
This controller executes backup commands on the pod to initialize backups.

PodVolumeRestore

This controller restores the respective pods that were inside restic backups. And this controller is responsible for the restore commands execution.

Backup an application on GKE

For this blog post, we are considering that Kubernetes already has an application that is using persistent volumes. Or you can install WordPress as an example as explained here.

We will perform GKE Persistent disk migration to Azure Persistent Disk using Velero.

Follow the below steps:

To back up, the deployment or statefulset checks for the volume name that is mounted to backup that particular persistent volume. For example, here pods need to be annotated with Volume Name “data”.

volumes:
    - name: data
        persistentVolumeClaim:
            claimName: mongodb

volumes:
    - name: data
        persistentVolumeClaim:
            claimName: mongodb

Annotate the pods with the volume names, you’d like to take the backup of and only those volumes will be backed up:

$ kubectl -n NAMESPACE annotate pod/POD_NAME backup.velero.io/backup-volumes=VOLUME_NAME1,VOLUME_NAME2

$ kubectl -n NAMESPACE annotate pod/POD_NAME backup.velero.io/backup-volumes=VOLUME_NAME1,VOLUME_NAME2

For example,

$ kubectl -n application annotate pod/wordpress-pod backup.velero.io/backup-volumes=data

$ kubectl -n application annotate pod/wordpress-pod backup.velero.io/backup-volumes=data

Take a backup of the entire namespace in which the application is running. You can also specify multiple namespaces or skip this flag to backup all namespaces by default.
We are going to backup only one namespace in this blog.

$ velero backup create testbackup --include-namespaces application

$ velero backup create testbackup --include-namespaces application

Monitor the progress of backup:

$ velero backup describe testbackup --details

$ velero backup describe testbackup --details

Once the backup is complete, you can list it using:

$ velero backup get

$ velero backup get

You can also check the backup on GCP Portal under Storage.
Select the bucket you created and you should see a similar directory structure:

Restore the application to AKS

Follow the below steps to restore the backup:

Make sure to have the same StorageClass available in Azure as used by GKE Persistent Volumes. For example, if the Storage Class of the PVs is “persistent-ssd”, create the same on AKS using below template:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: persistent-ssd // same name as GKE storageclass name
provisioner: kubernetes.io/azure-disk
parameters: 
  storageaccounttype: Premium_LRS
  kind: Managed

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: persistent-ssd // same name as GKE storageclass name
provisioner: kubernetes.io/azure-disk
parameters: 
  storageaccounttype: Premium_LRS
  kind: Managed

Run Velero restore.

$ velero restore create testrestore --from-backup testbackup

$ velero restore create testrestore --from-backup testbackup

You can monitor the progress of restore:

$ velero restore describe testrestore --details

$ velero restore describe testrestore --details

You can also check on GCP Portal, a new folder “restores” is created under the bucket.

In some time, you should be able to see that the application namespace is back and WordPress and MySQL pods are running again.

Troubleshooting

For any errors/issues related to Velero, you may find below commands helpful for debugging purposes:

# Describe the backup to see the status
$ velero backup describe testbackup --details
# Check backup logs, and look for errors if any
$ velero backup logs testbackup
# Describe the restore to see the status
$ velero restore describe testrestore --details
# Check restore logs, and look for errors if any
$ velero restore logs testrestore
# Check velero and restic pod logs, and look for errors if any
$ kubectl -n velero logs VELERO_POD_NAME/RESTIC_POD_NAME
NOTE: You can change the default log-level to debug mode by adding --log-level=debug as an argument to the container command in the velero pod template spec.
# Describe the BackupStorageLocation resource and look for any errors in Events
$ kubectl describe BackupStorageLocation default -n velero

# Describe the backup to see the status
$ velero backup describe testbackup --details

# Check backup logs, and look for errors if any
$ velero backup logs testbackup

# Describe the restore to see the status
$ velero restore describe testrestore --details

# Check restore logs, and look for errors if any
$ velero restore logs testrestore

# Check velero and restic pod logs, and look for errors if any
$ kubectl -n velero logs VELERO_POD_NAME/RESTIC_POD_NAME
NOTE: You can change the default log-level to debug mode by adding --log-level=debug as an argument to the container command in the velero pod template spec.

# Describe the BackupStorageLocation resource and look for any errors in Events
$ kubectl describe BackupStorageLocation default -n velero

Conclusion

The migration of persistent workloads across Kubernetes clusters on different cloud providers is difficult. This became possible by using restic integration with the Velero backup tool. This tool is still said to be in beta quality as mentioned on the official site. I have performed GKE to AKS migration and it went successfully. You can try other combinations of different cloud providers for migrations.

The only drawback of using Velero to migrate data is if your data is too huge, it may take a while to complete migration. It took me almost a day to migrate a 350 GB disk from GKE to AKS. But, if your data is comparatively less, this should be a very efficient and hassle-free way to migrate it.

December 12, 2022

Tag: disaster recovery

The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Why Backup and Recovery?

What to Backup?

How to Backup?

Taking etcd backup:

Backup Strategy for Internal Etcd Cluster:

Backup Strategy for External Etcd Cluster:

Disaster Recovery

Restore Strategy for Internal Etcd Cluster:

Restore Strategy for External Etcd Cluster

Conclusion

Kubernetes Migration: How To Move Data Freely Across Clusters

Prerequisite

Setup Velero with Restic Integration

Installing Velero Client

Create a Bucket for Velero on GCP

Create a Service Account for Velero

Install Velero Server on GKE and AKS

Restic Components

Backup an application on GKE

Restore the application to AKS

Troubleshooting

Conclusion