Multi Cluster TAP – Why Do We Need It?

One of the great features in Tanzu Application Platform is that it supports a multi cluster deployment.

Many people may ask, Why does this matter? and also may ask is this really needed?

To answer these questions lets take a quick look back at TAS – A great VMware maintained PaaS solution and how it handled or didn’t handle the multi cluster story.

Tanzu Application Services (Cloud Foundry)

In TAS, the entire solution is deployed in a per environment basis. This means that the entire stack otherwise referred to as a foundation, must be deployed in every environment I want to run a TAS based application.

If we take a simple use case where a company would have 3 different environments (Dev, QA and Prod) this would mean that i would deploy the entire TAS stack 3 times.

There would also be no connection from a platform level between the 3 different environments.

This means that when i move from Dev to QA and finally to prod, in each environment the platform would compile my code, Build the Container / Blob, Create the deployments manifests and finally deploy my application.

This also means that whenever I make a change in one environment I must manually or via some custom automations, propagate those changes the same to each of my additional environments.

While writing these automations is not a terribly complex task, it is a level of overhead we must maintain over time.

Another key issue with this approach is that my artifact is being built multiple times and not just once. while Buildpacks and the overall CF platform do offer the benefit of reproducibility making the multiple builds outputs be essentially the same, it still is a repeated task that can make things like attestation, artifact tracing and overall confidence in the idea of artifact promotion much more difficult to achieve.

Another key aspect that makes the multiple foundations a challenge is the fact that it takes up A LOT of resources!

For example, when running on vSphere, the minimal footprint as documented on the TAS documentation would require:

82 vCPUs, 120GB RAM and 2TB of storage.

Now if we times that by our 3 environments, and this is all speaking of a single AZ deployment which is not suggested for a production deployment, we are talking about at minimum, 246 vCPUs, 360GB RAM and 6TB of storage!

That can become very expensive overtime.

Tanzu Application Platform

In TAP, the story is very different.

One of the key reasons that TAP can support a Multi Cluster topology is that unlike TAS, TAP is built to be customized, manipulated, and tweaked to your needs.

Because the steps in our software supply chain on TAP are configurable, we can have each cluster that has TAP on it, perform different tasks and steps.

TAP also is based on an OSS project called cartographer. Cartographer as the underlying tool for choregraphing the supply chain, has a clear separation between CI and CD.

while for CI we have what is referred to as a Cluster Supply Chain, In the CD world we have a Cluster Delivery. both Resources are similar in there structure, but each one is purpose built for its specific domain.

When it come to promotion between environments, we are given 2 OOTB solutions that can offer us a great mechanism for promoting artifacts from one environment to X other environments.

The first option is to use GitOps. TAP will build our image for us and also generate our Kubernetes manifests amongst other tasks it performs. once the generated manifests are rendered, not only can TAP apply them to the cluster, it can also push the manifests up to a Git Repository. This can allow for traceability, and also can help us promote artifacts, as the artifacts are now saved outside of the cluster, in a central location.

The other OOTB option is to use what i refer to as OCIOps or other places may refer to as RegOps. this is where the manifests are stored not in Git but rather in an OCI bundle in a Container registry. While the technology used is different the underlying objective and outcome is the same in this regard. In both cases our configuration is pushed to a central location, from which it will be pulled and run in additional clusters.

So Why Multi Cluster

The reasons for wanting to go down a multi cluster approach can be split into multiple general sections:

  1. Resource Utilization
  2. Security
  3. Artifact traceability and attestation
  4. Scalability
  5. Multi Cloud
  6. Industry Standards
  7. Visibility

Lets break these down and understand how Multi Cluster Topologies can help.

Resource Utilization

As mentioned before, It is inevitable that we will need multiple environments. while that is true, the need to deploy the entire platform in every environment can get very costly in terms of Cash, Resources and also Time. The time aspect is key as well and is often overlooked. The need to rebuild an app at every promotion can be a timely task and when you deal with large scales, this can quickly build up and become a true bottleneck.

The fact that you don’t need to run Tanzu Build Service, Tekton, Image Scanning mechanisms and more in every cluster can save huge resources, and the fact you don’t need to run knative and contour in your build cluster for example also save huge amounts of resources.

While the numbers may not seem huge at the beginning, when we start to scale out our landscape, the numbers simply keep growing and that’s where this type of a multi cluster topology can really shine!

Security

Another key aspect that Multi Cluster Topologies can hep with, is in the realm of security.

Many organizations have regulations and security requirements that can limit the accessibility to the internet for example from our clusters. not only this but we overall want our environments to be as locked down as possible, and to only open up our clusters to the bare minimum needed, even if it is within our companies network.

By separating out the Build aspects from our runtime aspects into separate clusters, we can for example have our build cluster open to either the internet for pulling down dependencies from sources we need or to our internal artifact registries, but at the same time we can lock down our runtime clusters to only have access for example to our container registry for pulling images and our git server for pulling down manifests, without the need to have the runtime clusters open to our build time infrastructure.

This could offer huge flexibility to on one hand not impede on developer agility and speed while at the same time, keeping our environment safe.

Artifact traceability and attestation

Another key aspect that Multi Cluster Topologies can hep with, is related to security but has many other implications and benefits as well.

The idea of artifact traceability and attestation is not new, but it is getting a lot of attention these days, especially since we have seen multiple large scale attacks happen in the industry that could have been prevented it such mechanisms were in play.

the general Idea of being able to trace the provenance of an artifact throughout its lifecycle is a key aspect in making sure that our Software supply chain is secure and that nothing malicious has altered our software throughout the supply chain.

This is a topic that is being heavily invested on in TAP, but is already seeing its first steps of integration through the built in integration with Cosign, which will automatically sign our images when they are built in the build cluster. The next step is a built in configuration we have in the runtime clusters, that can validate and enforce that only images signed by a specific build cluster for example can be run in the cluster. this allows us to be certain that only images we know the provenance of can be run in our clusters.

Another key aspect is that because TAP uses Cloud Native Buildpacks by default to build images, we get an SBOM attached to each of our images built which is a key feature and needed info when trying to attest to the origin of an image, and to what is in that image.

While much more work is needed in this area, by having the signing and SBOM generation capabilities built in from the beginning into the platform, we are already in a pretty good place.

This is a huge improvement over the traditional PaaS approach of rebuilding an artifact in every environment as the attack vectors in that case are much wider, and we don’t actually have the traceability back to the original source which is our development environment. in TAP this is all possible with little to no configuration post deployment of the platform which is pretty amazing!

Scalability

Another key aspect that multi cluster can help with is scalability of our platform.

Typically we see customer starting with relatively small kubernetes clusters, but as time passes, these clusters keep growing. we have had many instances by customers where for many reasons including scalability issues, the need to split clusters has been required.

This is easily dealt with in TAP as we can simply add another CD target which would be our new cluster, without effecting our running environment.

Multi Cloud

TAP is truly a multi cloud PaaS.

While tools like TAS could be run on different clouds, there were complexities involved in getting this to work and to manage it at scale.

A key factor that differentiates the 2 tools in this regard is where the platform begins, and what it needs to integrate with.

TAS included within it the container runtime and container orchestrator as well as the platform itself. this meant that the platform was tightly coupled to the underlying container orchestration tool. This also meant that the integration point between TAS and our cloud or clouds of choice was specific to each cloud provider. TAS had to be deployed with a specific cloud provider interface (CPI) that knew how to interact with the specific cloud provder to perform tasks such as managing VMs, LBs, Security Groups etc.

This means that with every cloud provider, the configuration was different, and that you could only deploy TAS on a cloud, that had a TAS Cloud Provider implementation.

In TAP, the platform does not include a container orchestrator, rather it simply relies on you installing the platform on a conformant Kubernetes cluster which could be of any form shape or size.

Because the integration layer for TAP is simply Kubernetes, you can run TAP anywhere that has Kubernetes. This can be vSphere, GCP, AWS or Azure which are all supported in TAS as well, or it could even be Bare Metal servers in your datacenter, or any other Cloud Providers Kubernetes offering as long as that Kubernetes Cluster is a conformant kubernetes cluster.

Decoupling the runtime from the platform is a huge benefit as it means, the TAP development team can concentrate on building out the platform itself and not need to worry about the underlying cloud provider the platform is running on.

This also helps us as users be assured that we could migrate to a different cloud, or add another cloud into our environment and onboard it into TAP very easily, without any special configurations or changes!

Industry Standards

When we look at the broader ecosystem today, the idea of a multi cluster supply chain, with purpose built clusters that artifacts are promoted between using GitOps principles is becoming a De-Facto standard.

TAP as a platform itself not only is built up of industry standard tools such as Kubernetes, Knative, Contour, Cert Manager, Buildpacks, Tekton, FluxCD etc. but it also enables us to utilize industry standard practices such as GitOps in a really easy to consume manner.

This can be extremely helpful when onboarding new developers or platform engineers into our environments, as the concepts and methodologies used in TAP are industry standards which means that we have a much better chance of finding people that are familiar with these approaches then other proprietary and tool specific approaches we see in other PaaS platforms.

Visibility

Another key aspect that TAP in a multi cluster topology offers us is in the realm of visibility.

When we have multiple environments, visibility across our landscape becomes a challenging topic to tackle.

TAP which uses the CNCF Backstage project for its GUI, has a huge advantage because Backstage is built in a way that easily allows connecting multiple clusters into the same GUI, and giving a true single plane of glass visibility across our entire landscape.

TAP GUI includes within it built in support to visualize our workloads across clusters, side by side, allowing us to see how a particular service is performing and where it is running across all of our clusters.

Summary

The idea of a multi cluster PaaS is truly exciting in my opinion. I think that having this capability brings the opportunity to think of how we build out our environments in ways we haven’t thought of as possible till today.

I’m truly excited to see how the Multi Cluster TAP functionality will grow and evolve over time.

For those interested in how i set up my Multi Cluster TAP environments in my vSphere environments, you can check out my Github Repo which has a simple BASH script that can deploy 5 TKGm clusters and fully configure TAP on them in a simple and automated way.

Leave a Reply

%d