Static IPs for TCE and TKGm on vSphere

Preface

Tanzu Kubernetes Grid Multi Cloud (TKGm) and Tanzu Community Edition (TCE) are both great distributions of kubernetes.

Both of them are based on the same underlying technology and framework. This post is equally relevant and applies to both systems the same.

Problem Statement

One of the challenges we have encountered many times with TKGm and TCE is that they require us to use DHCP for node networks.

While this may not seem like a big issue, it actually is in many cases if not dealt with correctly.

The main issue is that Kubernetes components have many certificates and all comunication is secured, however in ClusterAPI which is the provisioning mechanism used in tanzu clusters, the certificates are generated for the IPs the nodes receive when they are first deployed.

Basically this means that the IP address MUST remain the same for the entire lifecycle of the machine.

In an ideal situation this should not be an issue, but as we know the world is not ideal.

We have power outages, host crashes, scheduled downtime etc. that can all be causes for our clusters to go down.

When that happens, the nodes depending on how you set up the environment, may or may not receive the same IP when they are powered back on.

While this can be solved by ClusterAPI pretty easily for worker nodes using the Machine Health check functionality, as the worker nodes if they un-responsive can be simply recreated, for control plane nodes its not as easy.

If you have a single control plane node (Don’t do this!!!) then unless you can manually change the IP back to the original IP you are out of luck. If its a 3 node control plane, if 1 node got the wrong IP all is fine and a new node will replace it but if 2 or 3 nodes have the original IP changed, you will be in a bad place as well.

Current Official Solution

The TKG documentation mentions that for control plane nodes you should create DHCP reservations which will make sure that the node always receives the same IP, but this is hard to maintain. This requires that every time a new control plane node is created that you manually edit the DHCP server to add a reservation for the nodes IP and MAC address.

A new node will be created for multiple reasons, and managing this at scale is not a fun task.

A Better (Yet Unsupported) Solution

The true solution for this would be to use static IP allocation for nodes.

Currently, TKGm and TCE do not support static IP address management, but adding in the functionality is actually really easy!

The reason DHCP is used is because nodes are spun up and spun down constantly in a ClusterAPI (CAPI) environment and as such, we need a way to have VMs assigned an IP automatically when CAPI creates the machine.

While DHCP is one way to achieve this, using an IP Address Management (IPAM) solution is another option that can serve the same purpose but also at the same time solve the issue of having dynamic IP addresses, as the IPAM can provide us static IPs for our nodes.

TKGm and TCE support some of the most common CAPI providers, but there are a whole plethora of providers out there including one called CAPM3 (Cluster API Provider Metal3) which is a bare metal based CAPI provider.

CAPM3, includes within it not only the infrastructure provider, but have also created a seperate controller called the CAPM3 IPAM Controller, which implements a simple to use in cluster CRD based IPAM which can integrate well with CAPI based solutions.

This controller, adds 3 key CRDs to our cluster:

  1. IP Pool – a way to define a pool of IPs that nodes will be assigned an IP from
  2. IP Claim – Similar to a PVC but for an IP
  3. IP Address – The actual IP object which is assigned to a VM via a claim

Basically the flow of how things work is that for each Machine object we need a IP Claim to be created which will in turn allocate an IP and generate an IP Address CRD and then we need to plumb that back up into our machines.

In CAPM3, the infrastructure provider handles this for them, but for CAPV we need a solution to automate that which has already been created by the team at Spectro Cloud and is called “cluster-api-provider-vsphere-static-ip“. This controller will basically act as the bridge between the IPAM provider of CAPM3 and the CAPV objects we deploy via Tanzu.

ClusterAPI vSphere by default has network address allocation type set to DHCP but if we change it to static via the vSphere Machine Template CRD, the machine provisioning will wait until the vSphere Machine CRD is updated with a static IP Address and then it will provision the node with that static IP configuration.

This means that the final flow we are looking for is that

  1. The cluster is created
  2. The first Control Plane node CRD is created but is set to get a static IP and is in a pending state
  3. The Spectro Cloud controller sees this and creates a IP Claim CR for us
  4. The CAPM3 IPAM will create an IP Address CR for us from the configured pool of IPs
  5. The Spectro Cloud controller sees the newly created IP Address CR and populates the data of that CR onto our vSphere Machine CR
  6. CAPV sees that all the details it needs are now in the objects spec and provisions the VM
  7. The rest of the deployment process continues as normal

This process basically is the same for each and every VM in the cluster.

The key thing we still havent mentioned though is how do you specify which IP Pool a cluster should have its nodes IPs allocated from.

This is accomplished via labels which we apply to the vSphere Machine Template as well as to the IP Pool which when they match they form a pair.

It is also important to note that an IP Pool is scoped to a single cluster. this means that you will create a 1:1 ratio between an IP Address Pool and a Kubernetes Workload Cluster.

While this may sound complex, the implementation is very simple and the UX is pretty simple once the solution is setup.

How Do We Implement This

The preperation phase has 4 key steps

  1. Deploy a Management Cluster (not covered in this post)
  2. Deploy the CAPM3 IPAM Controller
  3. Deploy the Spectro Cloud CAPV Static IP Controller
  4. Add YTT overlays to our Tanzu config files

Once we have completed these 4 steps, we can start creating clusters with static IPs!

Deploying the IPAM controller

The easiest way to do this is via a simple kubectl apply command on our management cluster:

kubectl create ns capm3-system
kubectl apply -f https://github.com/metal3-io/ip-address-manager/releases/download/v1.1.3/ipam-components.yaml

Basically we are creating a namespace and then simply installing the controller and all its resources from the official release artifact.

Deploying the CAPV Static IP Controller

Currently the Spectro Cloud team don’t release any artifacts and you need to build the code and manifests youself using the different make targets in the repo.

To make your lives easier, I have already run this and pasted the generated manifest in a gist for consumption which points to the container i built for this which is hosted on ghcr.

kubectl apply -f https://gist.githubusercontent.com/vrabbi/b20af526c091cced11495f578a5a3fc5/raw/128d922f9497272b952580d6e2e357020669a5db/capv-ipam-controller.yaml

This is all there is to install on our cluster!

Creating the needed overlays

This step is needed in order for us to integrate our Tanzu Cluster creation with the newly installed components.

The first overlay is pretty simple. It configures DHCP to be false on all vSphere Machine Templates, and then also sets the label needed to match with an IP Pool to them.

cat << EOF > ~/.config/tanzu/tkg/providers/infrastructure-vsphere/ytt/vsphere-static-ip-overlay.yaml
#@ load("@ytt:overlay", "overlay")
#@ load("@ytt:data", "data")
#@ if data.values.USE_STATIC_IPS:
#@overlay/match by=overlay.subset({"kind": "VSphereMachineTemplate"}), expects="1+"
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
#@overlay/match missing_ok=True
labels:
#@overlay/match missing_ok=True
cluster.x-k8s.io/ip-pool-name: #@ data.values.CLUSTER_NAME
#@overlay/match missing_ok=True
cluster.x-k8s.io/network-name: #@ data.values.VSPHERE_NETWORK.split("/")[-1]
spec:
template:
spec:
network:
devices:
#@overlay/match by=overlay.index(0)
- dhcp4: false
#@ end
EOF

The second file we need to create is a file that defines the additional data values we want to have available to us for configuring the IPAM integration in our cluster configuration file.

cat << EOF > ~/.config/tanzu/tkg/providers/infrastructure-vsphere/ytt/vsphere-static-ip-default-values.yaml

#@data/values
#@overlay/match-child-defaults missing_ok=True
---
USE_STATIC_IPS: false
FIRST_IP:
LAST_IP:
SUBNET_PREFIX: 24
DEFAULT_GATEWAY:
DNS_SERVER: 8.8.8.8
EOF

As can be seen, we have a total of 6 new values that we can use in order to configure this solution on a per cluster basis:

  1. USE_STATIC_IPS – This is a Boolean set by default to false in order to not change TCE/TKGm default behavior. You must set this to true in the Cluster Config file to have static IP management enabled for the cluster.
  2. FIRST_IP – as mentioned an IP Pool is needed per cluster so this will be configured in that IP Pool as the first IP in a range of IPs that the IPAM solution will manage for our cluster.
  3. LAST_IP – this is the last IP the IP Pool will manage closing the IP Address range from the value of the FIRST_IP variable.
  4. SUBNET_PREFIX – this is the subnet prefix of the network your nodes are being provisioned to. by default i have set it to 24 which is the equivalent of 255.255.255.0 or a Class C network.
  5. DEFAULT_GATEWAY – this is the default Gateway of the node network.
  6. DNS_SERVER – this is the DNS Server you want to be defined on the clusters nodes.

The final file we need to add is a file that will create the IP Pool for us at cluster creation time:

cat << EOF > ~/.config/tanzu/tkg/providers/infrastructure-vsphere/ytt/vsphere-static-ip-ippool-addition.yaml
#@ load("@ytt:data", "data")
#@ if data.values.USE_STATIC_IPS:
---
apiVersion: ipam.metal3.io/v1alpha1
kind: IPPool
metadata:
name: #@ data.values.CLUSTER_NAME
namespace: #@ data.values.NAMESPACE
labels:
cluster.x-k8s.io/network-name: #@ data.values.VSPHERE_NETWORK.split("/")[-1]
spec:
clusterName: #@ data.values.CLUSTER_NAME
pools:
- start: #@ data.values.FIRST_IP
end: #@ data.values.LAST_IP
prefix: #@ data.values.SUBNET_PREFIX
gateway: #@ data.values.DEFAULT_GATEWAY
prefix: #@ data.values.SUBNET_PREFIX
gateway: #@ data.values.DEFAULT_GATEWAY
namePrefix: #@ "ip-{}".format(data.values.CLUSTER_NAME)
dnsServers:
- #@ data.values.DNS_SERVER
#@ end
EOF

Once these 3 files are in place, we can start to create our clusters with static IPs!

Creating a Cluster

As mentioned, we have added a few variables that if you want to use static IPs you will need to set in your clusters configuration file before deploying the cluster.

Once you have added those values as described above, you can simply create the cluster via the Tanzu CLI

tanzu cluster create -f

If you want to see what objects are created you can easily do so via kubectl as everything is a kubernetes resource.

If you have the kubectl plugin called lineage or the kubectl plugin called tree installed the visibility of the solution becomes really awesome:

Kubectl Lineage Example:

Kubectl Tree Example:

The Future

Currently in Upstream ClusterAPI work is being done to add in an official set of IPAM APIs that will allow providers to build a solution like this directly into their providers in a streamlined manner.

CAPV is already looking into this very strongly and once this functionality is released in Core ClusterAPI, I am sure that shortly afterwards we will see similar solutions to this being implemented by official providers in a standardized way which will make the adoption of such features into a supported solution within Tanzu much more feasible.

In the meantime, this is a great solution to help with specific use cases where you really dont have the ability to manage the clusters via DHCP and want the simplicity of having Static IPs.

Summary

While at first the solution looks complex, the setup takes less then 5 minutes and the added value in my opinion is huge!

I use this in my home lab and it works great but make sure to validate it many times and thoroughly before implementing this in any sort of production environment.

As mentioned there is no support today for this solution but if Static IPs matter to you and you want this functionality in TCE/TKG make sure to make your voices heard and raise issues, comment on issues etc. in the TCE or Tanzu Framework Github Repos and also talk to your VMware account team and explain to them why you want this type of a solution.

Leave a Reply

Discover more from vRabbi's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading