Building Kubernetes Clusters using Kubernetes
Wait…what? Yes, I heard this reaction when I first presented the idea of using Kubernetes to build Kubernetes clusters.
But — I can’t think of a better tool for cloud infrastructure automation than Kubernetes itself. Using one central K8s cluster we build and manage hundreds of other K8s clusters. In this article I’m going to show you how.
Note: SAP Concur uses AWS EKS, and a similar concept can be applied to Google’s GKE, Azure’s AKS, or any other cloud provider’s Kubernetes offering.
Production-ready
Building a Kubernetes cluster in any major cloud provider has never been easier. Bringing AWS EKS clusters up and running is as easy as:
$ eksctl create cluster
However to build a production-ready Kubernetes cluster requires more than that. While the definition of production-ready may vary, SAP Concur uses these 4 build stages to build and deliver production-ready Kubernetes clusters.
4 build stages
- Preflight tests: A collection of basic tests against the target AWS environment to ensure all requirements are in place before we start an actual cluster build. For example: available IP addresses per subnet, AWS exports, SSM parameters or other variables.
- EKS control plane and nodegroup: This is an actual AWS EKS cluster build with attached worker nodes.
- Addons installation: Frosting. This is what makes your cluster sweeter :-) Install addons like Istio, logging integration, autoscaler, etc. The list of addons is comprehensive and totally optional.
- Cluster validation: At this stage we validate the cluster (EKS core components and addons) from a functional perspective before we sign it off and hand it over. The more tests you write the better sleep you get. (Especially when you are the on-call person!)
Glue ‘em up!
The 4 build stages each use different tools and techniques (which I’ll describe later). We were looking for a tool that glues them all together while supporting sequences and parallelism; is event-driven; and preferably can visualize each build.
Voila! We found the Argo products family, specifically Argo Events and Argo Workflows. Both run on Kubernetes as CRD’s and use a YAML declarative concept as used in other Kubernetes deployments.
We found the perfect combination: Imperative Orchestration, Declarative Automation
Break into pieces with Argo Workflows
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD.
Note: If you are familiar with K8s YAML, I promise this will be easy.
Let’s have a look at what each of the 4 build stages may look like in Argo Workflows.
1. Preflight tests
We use a BATS testing framework as an efficient tool to write tests. Writing a preflight test in BATS can be as easy as:
#!/usr/bin/env bats@test “More than 100 available IP addresses in subnet MySubnet” {AvailableIpAddressCount=$(aws ec2 describe-subnets --subnet-ids MySubnet | jq -r ‘.Subnets[0].AvailableIpAddressCount’)
[ “${AvailableIpAddressCount}” -gt 100 ]}
Running the above BATS test file (avail-ip-addresses.bats
) and 3 other fictional BATS tests using Argo Workflows, in parallel, may look like this:
— name: preflight-tests
templateRef:
name: argo-templates
template: generic-template
arguments:
parameters:
— name: command
value: “{{item}}”
withItems:
— bats /tests/preflight/accnt-name-export.bats”
— bats /tests/preflight/avail-ip-addresses.bats”
— bats /tests/preflight/dhcp.bats”
— bats /tests/preflight/subnet-export.bats”
2. EKS control plane and nodegroup
You have freedom of choice to build the actual EKS cluster. The available tools are eksctl
, CloudFormation or Terraform templates. Building core EKS using CloudFormation templates (eks-controlplane.yaml
and eks-nodegroup.yaml
) with Argo Workflows in two steps with dependency may look like this.
— name: eks-controlplane
dependencies: [“preflight-tests”]
templateRef:
name: argo-templates
template: generic-template
arguments:
parameters:
— name: command
value: |
aws cloudformation deploy \
--stack-name {{workflow.parameters.CLUSTER_NAME}} \
--template-file /eks-core/eks-controlplane.yaml \
--capabilities CAPABILITY_IAM- name: eks-nodegroup
dependencies: [“eks-controlplane”]
templateRef:
name: argo-templates
template: generic-template
arguments:
parameters:
— name: command
value: |
aws cloudformation deploy \
--stack-name {{workflow.parameters.CLUSTER_NAME}}-nodegroup \
--template-file /eks-core/eks-nodegroup.yaml \
--capabilities CAPABILITY_IAM
3. Addons installation
Install addons using plain kubectl
, helm, kustomize or a combination of these. For example, installing themetrics-server
addon with helm template
and kubectl
, only if metrics-server
addon installation is requested (conditional), in Argo Workflows can look like this.
— name: metrics-server
dependencies: [“eks-nodegroup”]
templateRef:
name: argo-templates
template: generic-template
when: “‘{{workflow.parameters.METRICS-SERVER}}’ != none”
arguments:
parameters:
— name: command
value: |
helm template /addons/{{workflow.parameters.METRICS-SERVER}}/ \
--name “metrics-server” \
--namespace “kube-system” \
--set global.registry={{workflow.parameters.CONTAINER_HUB}} | \
kubectl apply -f -
4. Cluster validation
For validating addons functionality we use the brilliant BATS library DETIK which makes writing K8s tests a piece of cake.
#!/usr/bin/env batsload “lib/utils”
load “lib/detik”DETIK_CLIENT_NAME=”kubectl”
DETIK_CLIENT_NAMESPACE="kube-system"@test “verify the deployment metrics-server” {
run verify “there are 2 pods named ‘metrics-server’”
[ “$status” -eq 0 ]
run verify “there is 1 service named ‘metrics-server’”
[ “$status” -eq 0 ]
run try “at most 5 times every 30s to find 2 pods named ‘metrics-server’ with ‘status’ being ‘running’”
[ “$status” -eq 0 ]
run try “at most 5 times every 30s to get pods named ‘metrics-server’ and verify that ‘status’ is ‘running’”
[ “$status” -eq 0 ]
}
Executing the above BATS DETIK test file ( metrics-server.bats
), only if metrics-server
addon is installed, in Argo Workflows may look like this:
— name: test-metrics-server
dependencies: [“metrics-server”]
templateRef:
name: worker-containers
template: addons-tests-template
when: “‘{{workflow.parameters.METRICS-SERVER}}’ != none”
arguments:
parameters:
— name: command
value: |
bats /addons/test/metrics-server.bats
Imagine how many validation tests you can plug in here. Do you need to run Sonobuoy conformance tests, Popeye — A Kubernetes Cluster Sanitizer or Fairwinds’ Polaris? Plug ‘em in using Argo Workflows!
If you have come to this point, you have built a fully functional production-ready AWS EKS cluster with metrics-server
addon installed, tested, and ready to hand over. Well done!
But don’t leave now, the best comes at the end.
WorkflowTemplates
Argo Workflows supports WorkflowTemplates which allows for reusable workflows. Each of the 4 build stages are WorkflowTemplates. We have essentially created building blocks that can be combined as needed. Using one “master” Workflow can execute all build stages in order (as in the example above) or each stage can be executed independently. That flexibility is achieved with the help of Argo Events.
Argo Events
Argo Events is an event-driven workflow automation framework for Kubernetes which helps you trigger K8s objects, Argo Workflows, Serverless workloads, etc. on events from a variety of sources like webhook, s3, schedules, messaging queues, gcp pubsub, sns, sqs, etc.
Cluster builds are triggered by an API call (Argo Events) using JSON payload. Besides that each of the 4 build stages (WorkflowTemplates) has its own API endpoint. Kubernetes operators (read: humans) can greatly benefit from it:
- not sure about the state of the cloud environment? call preflight API
- want to build a naked EKS cluster? call eks-core (control-plane and nodegroup) API
- want to re/install addons to existing EKS cluster? call addons API
- things went south with the cluster and you quickly need to run a collection of tests? call test API
Argo features
Both Argo Events and Argo Workflows come with a large set of features out of the box that save you from writing it yourself.
The top 7 most handy for our use case are:
- Parallelism
- Dependencies
- Retries — note the red failed preflight and validation tests in the images above. Argo automatically retried them with consequent success.
- Conditionals
- S3 support
- WorkflowTemplates
- Events Sensor Parameters
Conclusion
Being able to use a large number of tools that work together and define the desired infrastructure state imperatively allows flexibility without compromises and high project velocity. We are looking to use Argo Events and Workflows for other automation tasks. The possibilities are endless.