How to backup Hashicorp Vault w/ Raft storage on Kubernetes

Context

Our team has experimented with Hashicorp Vault as our new credentials management solution. Thanks to the official Vault Helm Chart, we could get a production-ready vault cluster running on Kubernetes—with minimal effort.

Architecture

Our 5-node vault cluster is highly available by using the provided Integrated Storage Raft backend. The statefulset and persistent volume claims allow each node to have its own data storage, which offers backups and redundancy for our database.

The Problem

Unfortunately, the open-source version of Vault does not offer an out-of-the-box automated backup solution, and it is only available in Vault Enterprise.

While the backup feature is no longer automated, Vault can still access it through CLI and HTTP API. We use a CronJob that runs alongside our Vault Kubernetes deployment to trigger snapshot saves every few minutes—a manual version of the old automatic configuration.

How?

Prerequisite

You have a working vault cluster
You have sufficient access to the cluster
You have an operating S3 storage instance

Setup Policy and Authentication

Configure the snapshot agent with a minimal backup policy.

echo '
path "sys/storage/raft/snapshot" {
   capabilities = ["read"]
}' | vault policy write snapshot -

The app-role authentication method allows applications or machines to authenticate with Vault using roles defined by the administrator.

The AppRole authentication method is a perfect fit for the Snapshot Agent, which will be used to connect to our Vault cluster. Copy down both the role ID and secret ID - you will need them later!

vault auth enable approle
vault write auth/approle/role/snapshot-agent token_ttl=2h token_policies=snapshot
vault read auth/approle/role/snapshot-agent/role-id -format=json | jq -r .data.role_id
vault write -f auth/approle/role/snapshot-agent/secret-id -format=json | jq -r .data.secret_id

Prepare Secrets

Let's save the sensitive information we gathered as secrets. We will use them later.

apiVersion: v1
kind: Secret
metadata:
  name: vault-snapshot-agent-token
type: Opaque
data:
  # we use gotmpl here
  # you can replace them with base64-encoded value
  VAULT_APPROLE_ROLE_ID: {{ .Values.approle.secretId | b64enc | quote }}
  VAULT_APPROLE_SECRET_ID: {{ .Values.approle.secretId | b64enc | quote }}

apiVersion: v1
kind: Secret
metadata:
  name: vault-snapshot-s3
type: Opaque
data:
  # we use gotmpl here
  # you can replace them with base64-encoded value
  AWS_ACCESS_KEY_ID: {{ .Values.backup.accessKeyId | b64enc | quote }}
  AWS_SECRET_ACCESS_KEY: {{ .Values.backup.secretAccesskey | b64enc | quote }}
  AWS_DEFAULT_REGION: {{ .Values.backup.region | b64enc | quote }}

The CronJob

Let's define the CronJob that does all the actual work.

We configure the VAULT_ADDR environment variable to http://vault-active.vault.svc.cluster.local:8200. Using vault-active Service can ensure the snapshot request is made against the leader node, assuming you have enabled Service Registration, which is the default. The URL may vary depending on your vault helm chart deployment release name and target namespace.

I overcomplicated the cronjob by using multiple containers to perform a simple backup and upload task. The intention was to avoid building custom images, and I don't want another image that needs maintenance.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-snapshot-cronjob
spec:
  schedule: "@every 12h"
  jobTemplate:
    spec:
      template:
        spec:
          volumes:
          - name: share
            emptyDir: {}
          containers:
          - name: snapshot
            image: vault:1.7.2
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            args:
            - -ec
            # The offical vault docker image actually doesn't come with `jq`. You can 
            # - install it during runtime (not a good idea and your security team may not like it)
            # - ship `jq` static binary in a standalone image and mount it using a shared volume from `initContainers`
            # - build your custom `vault` image
            - |
              curl -sS https://webinstall.dev/jq | sh
              export VAULT_TOKEN=$(vault write auth/approle/login role_id=$VAULT_APPROLE_ROLE_ID secret_id=$VAULT_APPROLE_SECRET_ID -format=json | /jq/jq -r .auth.client_token);
              vault operator raft snapshot save /share/vault-raft.snap; 
            envFrom:
            - secretRef:
                name: vault-snapshot-agent-token
            env:
            - name: VAULT_ADDR
              valut: http://vault-active.vault.svc.cluster.local:8200
            volumeMounts:
            - mountPath: /share
              name: share
          - name: upload
            image: amazon/aws-cli:2.2.14
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            args:
            - -ec
            # the script wait untill the snapshot file is available
            # then upload to s3
            # for folks using non-aws S3 like IBM Cloud Object Storage service, add a `--endpoint-url` option
            # run `aws --endpoint-url <https://your_s3_endpoint> s3 cp ...`
            # change the s3://<path> to your desired location
            - |
              until [ -f /share/vault-raft.snap ]; do sleep 5; done;
              aws s3 cp /share/vault-raft.snap s3://vault/vault_raft_$(date +"%Y%m%d_%H%M%S").snap;
            envFrom:
            - secretRef:
                name: vault-snapshot-s3
            volumeMounts:
            - mountPath: /share
              name: share
          restartPolicy: OnFailure

Wrapping Up

You now have all the resources needed to automate vault backup for the Raft backend. You can just run kubectl apply -f * or build your own Helm Chart and distribute it internally on whatever chart repository you use.

How to backup Hashicorp Vault with Raft storage on Kubernetes