Home

  • Updating TBS Dependencies in TAP

    Updating TBS Dependencies in TAP

    TLDR – as of January 26th, you will need to update your TBS dependencies if you are building Spring Boot applications.
    This is related to the linked Known issue in TAP 1.4 but also effects previous versions of TAP.

    Preface

    A few days ago i was working on some TAP related testing, and all of a sudden a workload which always worked, started to fail on me.

    I could figure out why my app was failing, so i started to look into what was going on.

    quickly i saw that the build itself was failing for my container, and a saw the following error message:

        [creator]     unable to invoke layer creator
        [creator]     unable to contribute spring-cloud-bindings layer
        [creator]     unable to get dependency spring-cloud-bindings
        [creator]     unable to download https://repo.spring.io/release/org/springframework/cloud/spring-cloud-bindings/1.10.0/spring-cloud-bindings-1.10.0.jar
        [creator]     could not download https://repo.spring.io/release/org/springframework/cloud/spring-cloud-bindings/1.10.0/spring-cloud-bindings-1.10.0.jar: 401
        [creator]     ERROR: failed to build: exit status 1
    

    After some quick searching on the internet i stumbled upon the following blog post , which describes a change being made in the spring artifactory instance, where they are removing anonymous access for downloads of released artifacts, and instead people should pull these from maven central.

    The issue I was hitting was due to the fact that the Spring boot buildpack that is part of the TAP 1.4 release, is still trying to pull down the spring-cloud-bindings dependency from the spring registry, and I happened to be testing things, during one of the brown-outs, ahead of the final removal of anonymous access on January 26th 2023.

    Solving The Issue

    While TAP releases versions on a quarterly release cycle, as well as patch releases which typically help solve these types of issues every month, Tanzu Build Service, which is the underlying project used in TAP for building images, releases new buildpacks and dependencies, on a very frequent basis, in order to patch CVEs, bump dependencies, add new features, add new buildpacks etc.

    As VMware Tanzu Buildpacks, use paketo buildpacks as the base for most of the buildpacks, I looked to see if the Paketo Spring Buildpack had already solved this issue, and indeed i found that this had been fixed in the 5.22.1 release of the buildpack.

    From there through some simple browsing of the Tanzu Buildpacks documentation i found that not only had an upstream release been cut by Paketo solving this issue, but also that VMware had already released a Tanzu Buildpack release for downstream consumption which incorporates this fix as well.

    So now that I found that a new release is available, it was time to figure out how to update by TBS dependencies.

    While i could just update the relevant buildpack, I decided that if im already updating dependencies, I might as well update the whole set of dependencies available, as its always best to stay inline with the latest patches.

    To do this VMware have a very detailed document in the TBS documentation which is also liked to from the TAP documentation.

    With all that information, lets go through the general steps I used.

    Download the descriptor YAML from Tanzu Network

    The dependency bundles, and there relevant descriptor files can be found at the following URL

    In my case I used the version 100.0.396 which was the latest version at the time.

    The file I downloaded was the descriptor-100.0.396.yaml file which is the full dependencies descriptor, however you could also use the lite-descriptor-100.0.396.yaml file if you so desire.

    Relocating the dependencies to my local registry

    In line with the best practice from VMware for TAP, which is to relocate all images to your own registry and not to depend on the VMware public registry, I decided to relocate the dependencies.

    Step 1 – login to your registry and to the Tanzu Network registry

    export MY_REGISTRY_URL=harbor.vrabbi.cloud
    export MY_REGISTRY_REPO=tbs/full-deps
    docker login registry.tanzu.vmware.com
    docker login registry.pivotal.io
    docker login $MY_REGISTRY_URL
    

    Step 2 – Relocate dependencies

    export DEP_VERSION="100.0.396"
    imgpkg copy -b registry.tanzu.vmware.com/tbs-dependencies/full:$DEP_VERSION \
      --to-repo $MY_REGISTRY_URL/$MY_REGISTRY_REPO
    

    Step 3 – Update Dependencies to be pulled from my registry

    imgpkg pull -b $MY_REGISTRY_URL/$MY_REGISTRY_REPO:$DEP_VERSION -o ./tbs-deps
    kbld -f ./tbs-deps/.imgpkg/images.yml \
      -f ./tbs-deps/tanzu.descriptor.v1alpha3/descriptor-$DEP_VERSION.yaml \
      > ./tbs-dependencies-from-my-registry.yaml
    

    Step 4 – Update the dependencies in my cluster

    kp import -f ./tbs-dependencies-from-my-registry.yaml
    

    With this done, we have now successfully updated TAP to use the new buildpacks, build and run images, and all the new and fancy changes that have been added by the TBS team since the TAP 1.4 release came out.

    Summary

    Updating dependencies is a process one should do on a regular basis in order to always stay as secure and up to date as possible. TAP lets us easily update our dependencies out of band from TAP releases, which is great, and allows us to perform the updates we need as frequently as we need to, without the need to constantly be chasing after the next TAP release the second it is released, just to get the security patches we have been looking for.

    While the above process seems like quite a few steps and includes some manual tasks around download the dependency file from Tanzu Network, this can easily be automated using the Pivnet CLI, and the whole process can be easily automated using whatever CI/CD tools you use.

    Hopefully this helps you if you run into this issue specifically, or really in any case, where you want to update dependencies out of band from a TAP release.

  • TAP 1.4 – What’s New

    TAP 1.4 – What’s New

    TAP 1.4 is a major release for Tanzu Application Platform which includes a bunch of really awesome new features, as well as improvements to existing features in terms of stability and UX.

    My favorite new features:

    1. Developer Namespace Preparation Controller
    2. TLS Management with the new Shared Cluster Issuer Integration
    3. Inner Loop support for Dotnet Core Applications
    4. Prisma Scanner (Alpha)

    My favorite enhancements:

    1. Live update and Debug can run together
    2. Application Accelerator enhancements

    Lets take a look at each of these features and what they bring to the platform

    Developer Namespace Preparation Controller

    One of the key difficulties with managing TAP till version 1.4, was the walls of YAML that were needed in order to prepare a namespace for TAP workloads.

    This process entailed creating RBAC resources, Scan Templates, Scan Policies, Service Accounts, secrets, Testing Pipelines etc.

    While this was always possible to automate as i discussed in a previous blog post using tools like Kyverno, having a built in solution is a huge improvement.

    As of TAP 1.4, we now have an additional component called “Namespace Provisioner” which allows operators to easily prepare namespaces for TAP workloads.

    The solution can deploy out of the box resources, that correspond to the Supply chain you have installed in the cluster, as well as any other resources you desire via a simple git integration mechanism, which is based on the Carvel tooling, namely kapp controller and YTT.

    The docs are quite informative and give good examples of how to extend the functionality of the new package to adapt it to your own needs.

    While this is just the initial release, and it is not perfect yet, It covers the vast majority of use cases already, and is a huge step in the right direction, towards making TAP much easier to consume which is awesome!!!

    TLS Management with the new Shared Cluster Issuer Integration

    One of the new features that came along with TAP 1.4 is the new and improved integration with Cert Manager.

    While Cert Manager was always included in TAP, this new integration gives us some pretty great new functionalities.

    First of all, TAP components that are exposed via ingress, are now signed with certificates generated from a single CA, making TLS trust much smoother. While not all components have moved to using this issuer, the majority are, and the rest will be added in upcoming releases.

    beyond the TAP components themselves, we also now get auto generated per workload certificates using the same issuer for all deployed applications! this is a huge improvement and makes the security posture OOTB with TAP much better.

    The integration by default uses a self signed CA issuer, however you can easily integrate this with any other issuer and TAP will use that issuer be it LetsEncrypt, Venafi, Vault or really any other CA you use to sign the certificates used by the system and workloads!

    Previously the common solution was to use a wildcard certificate, which while it did work in many cases, had some serious limitations, as well as security implications, which are resolved through this new integration which is really a great step towards a secure by default app platform!

    Inner Loop support for Dotnet Core Applications

    The inner loop development support for Spring boot in VSCode and Intellij is one of the greatest features of TAP. In this release, VMware have added support for inner loop development of dotnet core applications via a new IDE plugin for Visual Studio.

    This is a great addition to TAP, as it broadens the amount of use cases and customers the platform can be applicable for.

    dotnet core is extremely popular, and with steeltoe, we can even get spring like functionality really easily for dotnet based applications such as actuator endpoints which is awesome!

    supporting Visual Studio for this flow is a huge improvement as well, as many enterprise organizations have standardized around Visual Studio as their IDE.

    While currently the 2 supported languages for Inner Loop development are Java and Dotnet, I have been able to make it work as well with Golang and Python, making the solution really easy to extend, and I’m sure we will se more languages and IDEs added in future releases, further enhancing the experience for developers in the language and IDE of their choosing.

    Prisma Scanner (Alpha)

    One of the most exciting new features in my opinion is the new Alpha integration of Prisma as a scanner for both Source code and image scanning in TAP.

    TAP has always had a pluggable mechanism for scanners, and as TAP has matured, more and more scanners have been added to the platform.

    Prisma is one of the most common scanners on the market today, and is used by many customers, especially those in the cloud, that use Prisma as not only a scanner, but also as a very powerful CSPM and Kubernetes runtime protection tool.

    Adding support for Prisma in TAP, makes the onboarding to the platform in brownfield environments much easier, as policies and knowledge already exist in many customers environments, around Prisma, making this a huge plus for many potential TAP users!

    While building your own scanner integration is not a very difficult task in many cases, as I have discussed in length in a previous post where I discussed how I integrated trivy as a scanner within TAP, having a supported integration from VMware is always better and this is a great addition!

    Live update and Debug can run together

    This is a huge improvement in functionality added in TAP 1.4 in both Intellij as well as in VSCode.
    The inner loop development capabilities in TAP are a key feature of TAP and being able to both run remote debugging as well as live update in parallel on our workloads, increases the developer velocity, and makes the Developers experience so much better.

    It truly is awesome to see all of the improvements being made in this area from release to release, and seeing how this is helping customers is really awesome!

    Application Accelerator enhancements

    The entry point into TAP for developers is application accelerator, and i believe that it is one of the most integral parts of TAP, which brings some amazing capabilities to all organizations.

    The 4 key features that have been added to App Accelerator in TAP 1.4 which I am excited about are:

    1. Loop transform
    2. Custom Type definitions
    3. Git repo creation from VSCode
    4. Code completion & validation for accelerator YAML definitions

    Lets look at each of these new features briefly

    Loop Transform

    This is a huge improvement with many different use cases. previously if we had an array as an input, that array was passed to a transform step as is, and the transform itself would deal with the array input, however this means that we only execute a transform a single time per accelerator run.

    Now with the loop transform, we can iterate over elements in an array and apply transformations for each and every element of that array.

    This can be extremely beneficial when trying to do bulk operations, and i have already started working on some new accelerators i will be sharing publicly that utilize this feature extensively.

    Custom Types

    This is another huge feature add, which when paired with the loop transform becomes even more powerful.

    Till now, every input was of a simple type basically a string, a number, or a boolean. In many cases, we may have a set of values that are all related to one another, and that actually would be better suited with something like a map in programming terms. a simple example of this could be settings key value pairs for things like labels or annotations in kubernetes.

    The cool thing with being able to define custom types which as mentioned are basically maps in programming terms, is that when used with the loop transformations, are options are truly endless. we can build some pretty amazing and complex accelerators, which can offer huge value to our organizations in a simple and streamlined approach.

    Git Repo Creation From VSCode

    Since the VSCode extension was added for application accelerators in TAP 1.2, I have been using it all the time. I love backstage, and the app accelerator plugin in TAP GUI is great, but in the end when working on code, the place you really want to work from is your IDE, and the VSCode plugin is exactly what I wished for.

    While this was great and offered huge value already, in TAP 1.3 we got a great new feature in app accelerator which was the ability to auto create the Git Repo for us when generating a new project. This however was only possible when accessing app accelerator via TAP GUI and was not part of the VSCode extension.

    Now in TAP 1.4 we can have the same experience and functionality directly from my IDE as well, which make the experience so much better, and enables all the needed functionality from app accelerator to be done without ever leaving my IDE. I can now generate a project from an accelerator, create a repo, deploy an app in my iterate cluster, live update and debug it, and deploy it to my production cluster without ever leaving the comfort of my precious IDE!

    Code completion & validation for accelerator YAML definitions

    This is a really nice feature, which is available in VSCode as well as Intellij.

    Its not always easy to remember the exact syntax, or what the exact structure should be for an accelerator YAML file. with this new feature, we get code completion as well as validations for our accelerator YAML files in these 2 IDEs.

    making the experience of authoring accelerators easier is a great thing, as it is democratizing who can create these accelerators, and can enable quicker delivery of new accelerators within your organizations, which in turn brings value quicker to your customers which are the developers in your org.

    Summary

    TAP 1.4 is a huge milestone, and has added some amazing features. This blog only covers the tip of the iceberg of all the changes and enhancements that were added into TAP 1.4.

    The pace at which TAP is advancing is truly amazing, and now more then ever, you should start hopping on the TAP train, cause its a pretty awesome ride!

  • TAP PR Flow with Azure DevOps

    TAP PR Flow with Azure DevOps

    TAP has an awesome GitOps PR flow, but OOTB it does not work with Azure DevOps. In this post we will see how we can make it work with Azure DevOps as well.

    Why Azure DevOps Is Difficult

    While Azure DevOps does have git repositories, There are 3 main issues with the Azure DevOps Git implementation.

    1. They have not actually implemented the Git v1 API and require the use of the “multi ack” protocol which is not implemented in go-git which is the main git implementation library in go which is used in nearly all gitops tools.

    2. The URL of a git repository does not follow the standard path format like is implemented in github, gitlab, gitea, etc. which makes templating difficult.

    3. Azure DevOps does not support cloning repositories with the stanmdard “.git” suffix added to the URL, which in many platforms and solutions is added automatically to the URL and with Azure DevOps it simply does not work.

    Beyond these specific low level issues in Azure DevOps, but very likely due to these issues, the tool which is used in TAP to create PRs in the GitOps flow which is jx-scm from the JenkinsX project does not support Azure DevOps.

    Time to find a solution

    How to create a PR

    When searching how to create a PR via the command line in Azure DevOps, I found that the official method is to use the Azure CLI, and then adding the Azure DevOps extension. The issue i found is that this would increase the size of the container image used to create the PRs by 1.2GB!!!!!!!!!!

    With this being immediately disregarded as a possible solution for me, I decided that using the REST API via simple curl commands is probably the way to go.

    While Azure DevOps is a clunky Git implementation, the REST API is actually very well documented, so finding the right API was actually really easy.

    Once i found the right API, I took a look at the OOTB Tekton ClusterTask which is used for pushing commits and opening a PR, and updated the script to support Azure DevOps as well.

    The change that was needed is in the final step, which is called “open-pr” and the new script value needs to basically, check if the git provider type is azure devops, and if so use our custom logic and otherwise, simply use the default as it comes.

    The final script looks like this:

    #!/usr/bin/env bash
    
    set -o errexit
    set -o pipefail
    
    head_branch=$(cat /workspaces/ws/commit_branch | tr -d '\n')
    token=$(cat $(credentials.path)/.git-credentials | sed -e 's/https:.*://' | sed -e 's/@.*//')
    if [[ "$(params.git_server_kind)" == "azure" ]]; then
      JSON_FMT='{"sourceRefName":"refs/heads/%s","targetRefName":"refs/heads/%s","title":"%s","description":"%s"}\n'
      bodyJson=`printf "$JSON_FMT" "$head_branch" "$(params.base_branch)" "$(params.pull_request_title)" "$(params.pull_request_body)"`
      repo_api_path=`echo $(params.repository_name) | sed 's|/_git/|/_apis/git/repositories/|g' -`
      uri="$(params.git_server_address)/$(params.repository_owner)/$repo_api_path/pullRequests?api-version=3.0"
    
      echo $bodyJson
      echo $uri
    
      base64AuthInfo=$(echo -n "test:$token" | base64)
      result=$(curl -X POST -u "test:$token" -H "Content-Type: application/json" -d "$bodyJson" "$uri")
      
      echo $result | jq -r '.repository.webUrl + "/pullrequest/" + (.pullRequestId|tostring)' > $(results.pr-url.path)
    else
      jx-scm pull-request create \
      --kind "$(params.git_server_kind)" \
      --server "$(params.git_server_address)" \
      --token "$token" \
      --owner "$(params.repository_owner)" \
      --name "$(params.repository_name)" \
      --head "$head_branch" \
      --title "$(params.pull_request_title)" \
      --body "$(params.pull_request_body)" \
      --base "$(params.base_branch)" \
      --allow-update 2>&1 |
      tee stdoutAndSterr.txt
      cat stdoutAndSterr.txt | sed -n -e 's/^.*\. url: //p' > $(results.pr-url.path)
    fi
    

    As you can see, we are simply checking the parameter which is supplied in the tap-values file called git_server_kind and if it is set to “azure” we use our own custom logic, and otherwise we use the logic provided by VMware using the jx-scm CLI.

    Solving the Path issue

    This actually doesn’t need to be solved, but what it requires is that when filling in the TAP Values, the repository_name variable under the gitops section for the supply chain config, must be in the format <PROJECT NAME>/_git/<REPO NAME>.

    While this could be solved and most likely hidden from the end user in multiple different ways, It would require changes in too many places, it simply was not worth the effort as this is a global setting and is not a value that is needed to be configured by end users on a day 2 day basis.

    Solving the lack of Git v1 API support

    As mentioned above, Azure DevOps does not work with go-git which is the default git implementation used in TAP specifically but also is the standard use in nearly all go based solutions that integrate with git.

    Luckily TAP does support configuring the git implementation to use, with the default being go-git, but also supporting libgit2.

    libgit2 does include support for the multi ack protocol which is what Azure DevOps supports, so that issue can be solved by adding the variable “git_implementation: libgit2” under the ootb_delivery_basic top level key as well as under the configured supply chains top level key.

    libgit2 does work, however it should not be used unless no other option is available, as it is much less stable then go-git. but in this situation, we do not have a real choice.

    Solving the “.git” suffix issue

    This issue unfortunately requires changes to be made in multiple different locations, as the OOTB templates, add the “.git” suffix to repo URLs when templating the resources out.

    The templates we need to change are the ones that create the gitrepository CRs and the ones that create the delivery CRs.

    This means we have 2 ClusterSourceTemplates to update:

    • source-template
    • delivery-source-template

    And we also need to update the ClusterTemplate that creates the delivery resource:

    • deliverable-template

    The change in all of them is the same. We have 2 YTT functions in all 3 of these templates called “git_repository” and “mono_repository”.

    In these function we simply need to remove the addition of the “.git” at the end of the functions.

    Once you have completed these changes, you are ready to use the GitOps flow and PR flow with Azure DevOps.

    Example GitOps section in TAP Values files for Azure DevOps

    An example of the relevant sections in the TAP Values that you can use once the above steps are completed would be

    ootb_delivery_basic:
      git_implementation: libgit2
    ootb_supply_chain_testing_scanning:
      git_implementation: libgit2
      gitops:
        server_address: https://dev.azure.com
        repository_owner: vrabbi
        repository_name: test-az-devops-tap/_git/test-az-devops-tap
        branch: main
        commit_strategy: pull_request
        pull_request:
          server_kind: azure
          commit_branch: ""
          pull_request_title: ready for review
          pull_request_body: generated by supply chain
        ssh_secret: git-creds

    Summary

    While TAP OOTB currently doesn’t support Azure DevOps in the GitOps PR flow, as I hope you can see from this post, the extensible nature of TAP, makes such integrations relatively very easy to do.

    Building this solution was a matter of about 1.5 hours of investigation and trial and error, which is not bad considering the intricacies in solving this particular solution which I hope again shows just how flexible the platform really is, and how easy it is to customize it to your own needs!

  • Making TKGm feel like EKS

    Making TKGm feel like EKS

    Recently I was having a discussion with a customer regarding management of kubernetes environments in multi cloud scenarios.

    They are a very large EKS shop, and are also planning now on having a big kubernetes footprint on premises.

    One of the things that came up, was the desire to have the experience of deploying an app to either environment as similar as possible, so that in an ideal world, the target location should not matter at all to the application teams.

    Another key thing that came up was the need for applications running in the on premises datacenters, to communicate and connect to AWS services like RDS, S3, etc.

    While we were discussing this, I was trying to imagine what such a design would look like, and I knew that the key factor in this case, beyond solving this issue, was to require as little change on the existing environment as possible, due to the size of their environment in EKS.

    This is when I started to look into if and how I could provide as close to the EKS experience as possible in an on-premises environment.

    The Key Component

    The key component I tried to tackle was IRSA which stands for “IAM Roles For Service Accounts”.

    This is an amazing feature in AWS which is deeply integrated into EKS, in which I can map a Service Account in a kubernetes cluster to a specific IAM role, thereby enabling it to access AWS resources without the need for static credentials saved in secrets or similar solutions.

    This is a key feature in EKS and this customer was heavily relying on it within there environment, and I knew this would be an interesting task to try and solve.

    A Bit Of History Around IRSA

    In 2014, AWS Identity and Access Management added support for federated identities using OpenID Connect (OIDC). This feature allows you to authenticate AWS API calls with supported identity providers and receive a valid OIDC JSON web token (JWT). You can pass this token to the AWS STS AssumeRoleWithWebIdentity API operation and receive IAM temporary role credentials. You can use these credentials to interact with any AWS service, including Amazon S3 and DynamoDB.

    Kubernetes has long used service accounts as its own internal identity system. Pods can authenticate with the Kubernetes API server using an auto-mounted token (which was a non-OIDC JWT) that only the Kubernetes API server could validate. These legacy service account tokens don’t expire, and rotating the signing key is a difficult process. In Kubernetes version 1.12, support was added for a new ProjectedServiceAccountToken feature. This feature is an OIDC JSON web token that also contains the service account identity and supports a configurable audience.

    Amazon EKS hosts a public OIDC discovery endpoint for each cluster that contains the signing keys for the ProjectedServiceAccountToken JSON web tokens so external systems, such as IAM, can validate and accept the OIDC tokens that are issued by Kubernetes.

    How do we make this work On Premises

    After researching the subject, and trying to find the correct approach, I found an architecture I thought would work, and decided to try it out.

    The first element we need to do is to create an OIDC discovery endpoint that is accessible from AWS services.

    This can easily be done by creating an S3 bucket and adding in a discovery json file at the correct path.

    The first challenge was that we need the OIDC discovery endpoint to contain the signing keys for our service accounts projected tokens.

    At this point i stumbled upon a document that is located on GitHub which covers the steps in detail.

    While the document seemed very interesting, it doesn’t work well with the flow that I was trying to achieve, as TKGm manages its own certificates, and the guide assumes that you are basically doing kubernetes the hard way, and manually generating the certificates and configuring the API Server ahead of deployment time.

    In general the approach they mention is as follows:

    1. create an S3 bucket
    2. Generate RSA signing key pair
    3. Generate an oidc discovery json file
    4. use a go program to generate the keys.json file
    5. push the files to the S3 bucket
    6. Setup your API server with flags pointing at the S3 bucket URI and at the certificates
    7. Create an IAM IDP and connect it to the S3 bucket
    8. Deploy Cert Manager (not documented well but needed)
    9. Deploy the Pod Identity Webhook

    Because TKG will auto generate our certificates for us, and I don’t want to do that out of band, and I also want to do this via Terraform, I have changed the order and steps a bit in order to fit into the TKGm flow better:

    1. create an S3 bucket
    2. Generate an oidc discovery json file
    3. Push the discovery json file to the S3 bucket
    4. Create an IAM IDP and connect it to the S3 bucket
    5. Create a YTT overlay to add 2 needed flags to the API server for TKG clusters
    6. Create the TKGm cluster
    7. Deploy Cert Manager
    8. Retrieve the certificate from the cluster
    9. Generate the keys.json file using Terraform
    10. Push the keys.json file to the S3 bucket
    11. Deploy the Pod Identity Webhook

    What does the flow look like

    So as mentioned above, I wanted to do this via Terraform, and as such I have built an example Terraform module that can set the entire configuration up for us, in a multi step process:

    To deploy a cluster with IRSA we first need to apply our configuration that is needed before the cluster is created. This will create the IAM IDP, S3 bucket, OIDC Discovery json file, and any requested IAM roles for IRSA usage.

    Once we do that the terraform module will output as seen above, the manual steps we need to take in order to proceed.

    The first step is to create a YTT overlay file which will add 2 flags to the API Server pods for new clusters.

    The first flag we are adding is the “api-audience” flag is needed in order to set the correct aud value in our tokens generated for service accounts. when using IRSA, the value should be “sts.amazonaws.com” in all cases, and therefore we have hardcoded the value.

    The second flag is the “service-account-issuer” flag, which needs to be set to the URL of our S3 bucket which is hosting the OIDC discovery endpoint for us. In this case, we have templated out the bucket name, which is built in our terraform code to include the cluster name and by doing it this way, we ensure that the YTT overlay can work for all new clusters. This flag typically in TKGm clusters is set by Kubeadm automatically to http://kubernetes.default.svc.cluster.local which is the internal DNS name for the kubernetes API server, and we need to change it so that the issuer field in the OIDC tokens for service accounts point to the correct issuer.

    Once we have set that overlay, the next step is to create our TKGm cluster using the Tanzu CLI as we typically would.

    Once the cluster is created, we then need to get the admin kubeconfig of the cluster which can be done as you typically would.

    Now the last manual step we need to take, is to install cert-manager which can be done using the Tanzu package CLI:

    Now that we have Cert Manager deployed we can run the terraform module again after making 1 change to our tfvars file, which is to change the “cluster_created” variable from false to true.

    This change signals to the terraform provider to do the following:

    1. retrieve the Service account signing keypairs public key from the management cluster which is stored in a kubernetes secret
    2. retrieve a service account token from the workload cluster from a secret
    3. extract the needed data from both of these sources to build the JWKS json which is needed for the keys.json file
    4. Generate the keys.json file and push it to the S3 bucket
    5. Deploy the Pod Identity Webhook components to the workload cluster

    With this we are now done and can use IRSA based authentication in our clusters, just like we would for an EKS cluster.

    What are the general use cases

    While I mentioned above, the use case which brought me to look into this matter, which was around bridging the gaps between the experience on prem and in EKS to make the co-existence of the two more seamless, There are a few more use cases, which are actually in many cases even more prevalent that this could help with:

    1. Velero – we can use IRSA instead of saving access keys and secret keys in a kubernetes secret in order to use an S3 bucket as our backup target. This has security benefits over plain text and static credentials saved in a cluster within a secret making it a good choice where security is of concern.
    2. Kpack – We can use IRSA so that Kpack can push images to an ECR registry without needing to refresh our credentials every 12 hours as is needed by ECR.
    3. Kaniko – Like with Kpack, we can use IRSA to support pushing images to an ECR registry
    4. FluentBit – We can use IRSA with FluentBit to ship logs to things like OpenSearch or CloudWatch seamlessly.
    5. Crossplane – This is a huge benefit, as Crossplane with the AWS provider, allows us to manage nearly all AWS services directly as CRDs in our kubernetes clusters. using IRSA for crossplane, can make it extremely easy and seamless to support deploying and managing AWS workloads from our on prem environment in a secure and sealess manner.
    6. AWS Controllers For Kubernetes (ACK) – Like Crossplane but specific to AWS, we can use ACK with IRSA in order to support IaC use cases against AWS in a secure manner.
    7. Thanos – A great solution for long term storage for prometheus metrics, which stores the metrics in an object store of your choosing such as AWS S3. We can have Thanos use IRSA easily in order to keep our prometheus metrics for as long as we need in a really simple process.
    8. External DNS – we can utilize IRSA together with external DNS in order to have automated DNS record management for ingress resources we create in our cluster in Route53!
    9. Cert Manager – We can utilize IRSA with the AWS Private CA plugin for Cert Manager, to auto generate certificates for us and manage their lifecycle using an AWS ACM Private CA!
    10. External Secrets – We can use IRSA for the external secrets operator, in order to pull secrets from AWS Secrets Manager or from AWS Parameter Store, in a secure and simple manner.

    And that is just a partial list of off the shelf solutions, used very often in many environments, where IRSA could be of great help.

    Where is the code

    To try this out you can find the terraform code in the dedicated Github Repository.

    Summary

    This was a fun experiment and I think this has some true benefits for many use cases. I hope to streamline the process more in the future, when TKG cluster provisioning can be streamlined via the Cluster Class mechanism in TKG 2, which will hopefully allow for a single step end to end deployment mechanism for this.


  • Building Learning Center Workshops in TAP

    Building Learning Center Workshops in TAP

    One of the great technologies included in Tanzu Application Platform is the Learning Center Operator.

    Learning Center provides a platform for creating and self-hosting workshops. It allows content creators to create workshops from markdown files that are displayed to the learner in a terminal shell environment with an instructional wizard UI. The UI can embed slide content, an integrated development environment (IDE), a web console for accessing the Kubernetes cluster, and other custom web applications.

    TLDR

    tanzu accelerator create lc-workshop \
      --git-repository https://github.com/vrabbi/learning-center-workshop-accelerator.git \
      --git-branch main \
      --display-name "Learning Center Workshop" \
      --icon-url https://avatars.githubusercontent.com/u/58150231?s=280&v=4

    How To Build A Workshop Officially

    To get started building a workshop, VMware provide a zip file on Tanzu Network with samples for getting started.

    While the idea of a sample repository is great, the content is not up to date, and does not work as is.

    The changes one would be required to make include, changing the apiVersion for the kubernetes versions of the Custom resources, change the image being used as the base image as it is not compatible, update the Dockerfile for building a custom base image if needed to use the compatible base image, and then they can start building out there custom content.

    Application Accelerator To The Rescue

    As I have been working on building out a lot of workshops for many different technologies and use cases, I wanted to streamline the process of getting started on a new workshop, and immediately thought of Application Accelerator as a great way to set this up.

    When approaching the task, I started looking deeply into the docs of learning center and seeing what could be configured, and the amount of customization capabilities is huge, and would be way too much for an accelerator, so I decided to limit the scope to only the main fields i typically would want to have control over, and then if additional customizations are needed afterwards, that can be done manually on a case by case basis.

    What does the UX look like

    The user first will go to TAP GUI or to their VSCode where the application accelerator plugin has been configured and click on the relevant accelerator:

    Next they are greeted with a form to fill out basic information about the workshop:

    These settings are the most general and settings, and are required for every workshop you generate.

    Next we include settings about the workshop in terms of difficulty, duration, and max concurrent sessions:

    We then have some resource settings that are set to the operators defaults, but depending on your use case, may need to be changed. This includes, resource quotas for the sessions namespace, as well as how much memory to allocate to the pod that the workshop will be running in:

    We then get to the workshop environment settings, where we can enable or disable different features in the environment based on the needs of our specific workshop.

    This includes things like the terminal layout, whether to include the embedded VSCode instance, enabling a kubernetes UI Console and if so choosing whether to use octant or the kubernetes default dashboard, as well as some more interesting and specialized features like deploying a container registry per session, or adding docker to the workshop environment, to allow users to build images within the workshop itself:

    We then have a mandatory field one must fill out with the list of workshop exercises they want to build. This list effect the generated files, as we need to update in multiple files, what modules we want in our workshop, so doing it via the accelerator, greatly streamlines the process of building a workshop, and allows you to focus solely on the content itself without dealing with the configuration needed to make it come together:

    And finally we have the last section which is gated behind an advanced features checkbox, as it is not really needed to be configured in most cases, but can be extremely beneficial when you need these settings:

    Behind this advanced features checkbox, I have currently added 3 features.

    Using vCluster

    A really great thing in learning center is the level of isolation we get between each workshop session, which is needed as we are allowing users to dynamically create workshop sessions in our shared environment. Learning center does this by isolating each workshop session to a dedicated namespace.

    While this is great for many use cases, when building workshops around kubernetes technologies, many times we may need our users to be able to have higher level permissions then we can offer them in the shared environment. This could include things like installing CRDs, creating namespaces and so on.

    A great solution for this is to integrate our workshop with another really cool project called vCluster which is build and maintained by Loft.

    vCluster enables us to deploy a virtual kubernetes cluster on a real kubernetes cluster, which basically entails deploying a kubernetes control plane using K3s a lightweight distribution of kubernetes in a pod, alongside another component called a syncer.

    The syncer is in charge of syncing resources between the hosting cluster (where our workshop instance is running) and the virtual cluster running on top of it when needed.

    The reason this is so cool, is that while in the virtual cluster, we can provide our end users, with cluster admin access, the pods themselves which get synced down to the actual cluster in order to run, are all confined into a single namespace. this means that we can still benefit from the namespace isolation we get in Learning Center, while unblocking our users, from playing around with kubernetes with full admin rights.

    Installing Carvel CLIs

    This can be extremely helpful when dealing with kubernetes based workshops, especially in the Tanzu ecosystem, where carvel is everywhere!

    Because carvel is a rapidly growing and evolving toolset, keeping a workshop environment up to date with the latest carvel tools installed can be a challenge. therefore in this approach, I added a script that runs when the workshop session is created that will download the latest binaries and install them for you. While this does slow down the startup time of a workshop slightly, i believe the tradeoff is worth it in many cases, and it prevent you from needing to build and maintain a custom container image for the workshop.

    Installing Kubectl Plugins

    While kubectl is a very powerful CLI on its own, one of the key strengths it has is actually the ease of extending it via plugins.

    Plugins can be installed and configured manually, or one could use krew, which is the kubectl plugin management tool, which itself is a kubectl plugin!

    When enabling this feature, krew gets installed in the workshop, and a set of some of my favorite plugins get installed as well.

    An example of a plugin I have included is the view-secret plugin.

    Many tasks we do in kubernetes are repetitive tasks, and some of them are not so fun to do from the CLI. for example, if a want to get the value of a key in a secret i would typically need to run a command similar to:

    kubectl get secret my-secret -o json | jq -r .data.my_key | base64 --decode

    While this works, would it be nice to have a way of doing it without shelling out to other commands, and with a much shorter syntax like:

    kubectl view-secret my-secret my_key

    Well this is just an example of one of the 9 plugins I have included!

    Back to the UX

    Now that the form has been filled out, lets explore the files this will generate for us

    As we can see, we have a full layout of our workshop generated for us, with all the configuration needed.

    If we look for example at the file modules.yaml under the workshop folder we can see that this required file, contains basically the navigation and flow of our workshop pointing at the relevant markdown files of each exercise.

    By templating this out, It simply streamlines the process, and make your only requirement for building the workshop to be writing the content itself.

    This same thing goes for the fact that we auto generate the markdown files for each of the exercises you have filled out in the form, making it easy for you to simply go to the workshop/content/exercises directory and to begin adding your content.

    One other key thing i added, is some documentation to help you start building the workshop content.

    The docs for learning center are very detailed and as mentioned, the ammount of possible customizations is huge! As such i decided to copy into the repo, the pages from the documentation that i believe are the most useful for a workshop author so that the docs are accessible to them in an easy and streamlined manner.

    The Final Step

    The final step is to push this content to git and start working on the content.

    As of TAP 1.3 we can now have TAP GUI automatically create a git repository for us and push the generated accelerator based project to it!

    Once we fill out the details of what we want the repo called and under which user or org it should be created, we can generate the accelerator:

    And within just a few seconds:

    We can now clone the repository locally and start working on the content!

    Summary

    This I believe is a great example of how Application Accelerator can be extremely useful for use cases that are not necessarily code or dev related.

    Learning Center is a truly awesome piece of technology and bringing a simple UX for getting started building a workshop can hopefully help people start to use it more and more for their own unique use cases.

    Building labs, is a great way of introducing new technologies or for teaching new company standards to your users, and is also a great way to do workshops where we don’t need each participant to pre configure a bunch of things on his machine to get hands on experience!

    If you want to try out the accelerator it can be found in the following git repository and can be simply added to your environment via one simple command:

    tanzu accelerator create lc-workshop \
      --git-repository https://github.com/vrabbi/learning-center-workshop-accelerator.git \
      --git-branch main \
      --display-name "Learning Center Workshop" \
      --icon-url https://avatars.githubusercontent.com/u/58150231?s=280&v=4
  • TAP With ECR – Crossplane And Kyverno To The Rescue

    TAP With ECR – Crossplane And Kyverno To The Rescue

    Preface

    Recently when working with TAP in an EKS environment, I started to get really annoyed with ECR.

    While ECR is a perfectly capable OCI Registry, it has a serious drawback when using TAP, which is the need to generate a repository in advance before you can push an image.

    As far as I know, ECR is the only registry that has such a requirement, and as TAP pushes 1 or 2 different images per workload via the supply chain (1 for the image itself and 1 for the deliverable if using the Registry Ops model), the burden of creating ECR repos was becoming a real pain.

    Finding A Solution

    When looking at how to make the situation better, many options were available, I could deploy Harbor in the environment and use that instead, keep suffering with the current approach, or find a solution using tools I know and love from the CNCF landscape.

    In the end i decided to go with the third approach and settled on using 2 of my favorite tools, Crossplane and Kyverno to get the job done.

    Solution Pre Requisites

    I wont go into detail on how to install Crossplane or Kyverno in this post, as the resources for these tasks out there are great. I will instead base this post on the following being already configured in your environment:

    1. TAP is installed
    2. Crossplane is installed
    3. Crossplane AWS provider is installed and configured with appropriate credentials to manage ECR repos
    4. Kyverno is installed

    Once we have these prerequisites, we are ready to build out the solution.

    The Solution

    The first thing we want to do is build a crossplane Composite Resource Definition and a corresponding Composition, that will create the ECR repos for us:

    Composite Resource Definition:

    apiVersion: apiextensions.crossplane.io/v1
    kind: CompositeResourceDefinition
    metadata:
      name: xworkloadecrrepos.tap.vrabbi.cloud
    spec:
      group: tap.vrabbi.cloud
      names:
        kind: XWorkloadECRRepo
        plural: xworkloadecrrepos
      claimNames:
        kind: WorkloadECRRepo
        plural: workloadecrrepos
      versions:
      - name: v1alpha1
        served: true
        referenceable: true
        schema:
          openAPIV3Schema:
            type: object
            properties:
              spec:
                type: object
                properties:
                  parameters:
                    type: object
                    properties:
                      workloadName:
                        type: string
                      repoPrefix:
                        type: string
                      region:
                        type: string
                      providerName:
                        type: string
                    required:
                    - region
                    - repoPrefix
                    - workloadName
                    - providerName
                required:
                - parameters
    

    Composition:

    apiVersion: apiextensions.crossplane.io/v1
    kind: Composition
    metadata:
      name: workloadecrrepo
      labels:
        crossplane.io/xrd: xworkloadecrrepos.tap.vrabbi.cloud
        provider: aws
    spec:
      writeConnectionSecretsToNamespace: crossplane-system
      compositeTypeRef:
        apiVersion: tap.vrabbi.cloud/v1alpha1
        kind: XWorkloadECRRepo
      resources:
      - name: imagerepo
        base:
          apiVersion: ecr.aws.crossplane.io/v1beta1
          kind: Repository
          spec:
            forProvider:
              forceDelete: true
        patches:
        - type: CombineFromComposite
          combine:
            variables:
            - fromFieldPath: "spec.parameters.repoPrefix"
            - fromFieldPath: "spec.parameters.workloadName"
            - fromFieldPath: "spec.claimRef.namespace"
            strategy: string
            string:
              fmt: "%s/%s-%s"
          toFieldPath: "metadata.annotations[crossplane.io/external-name]"
          policy:
            fromFieldPath: Required
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.region"
          toFieldPath: "spec.forProvider.region"
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.region"
          toFieldPath: "metadata.labels[region]"
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.providerName"
          toFieldPath: "spec.providerConfigRef.name"
      - name: bundlerepo
        base:
          apiVersion: ecr.aws.crossplane.io/v1beta1
          kind: Repository
          spec:
            forProvider:
              forceDelete: true
        patches:
        - type: CombineFromComposite
          combine:
            variables:
            - fromFieldPath: "spec.parameters.repoPrefix"
            - fromFieldPath: "spec.parameters.workloadName"
            - fromFieldPath: "spec.claimRef.namespace"
            strategy: string
            string:
              fmt: "%s/%s-%s-bundle"
          toFieldPath: "metadata.annotations[crossplane.io/external-name]"
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.region"
          toFieldPath: "spec.forProvider.region"
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.region"
          toFieldPath: "metadata.labels[region]"
        - type: FromCompositeFieldPath
          fromFieldPath: "spec.parameters.providerName"
          toFieldPath: "spec.providerConfigRef.name"
    

    Now that we have those defined, we basically have a new namespaced CRD we can use called “WorkloadECRRepo” which an example of one would look like:

    apiVersion: tap.vrabbi.cloud/v1alpha1
    kind: WorkloadECRRepo
    metadata:
      name: example-app-repos
      namespace: default
    spec:
      parameters:
        workloadName: example-app
        region: eu-west-2
        repoPrefix: xxxxxxx.dkr.ecr.eu-west-2.amazonaws.com/tap/workloads
        providerName: aws-provider
    

    This in turn would create for us 2 ECR repos:

    • xxxxxxx.dkr.ecr.eu-west-2.amazonaws.com/tap/workloads/example-app-default
    • xxxxxxx.dkr.ecr.eu-west-2.amazonaws.com/tap/workloads/example-app-default-bundle

    Now that in it of itself makes life much easier, but adding in Kyverno makes it even easier!

    One of the great features we have in Kyverno, is the ability to have what’s called a Generate Policy. A Generate Policy, basically allows you to define that when a specific resource is created or updated, you can define a set of resources to create in accordance with that resource.

    A simple and helpful example, is that when a namespace is created, we want to auto create a default network policy, and also create an image pull secret and maybe a CA cert secret in that namespace to help our developers get started.

    The idea i had, was to create a Kyverno generate policy, that when a workload is created, we will automatically create an instance of the CRD mentioned above.

    This would allow us to automate the ECR repo creation and completely hide the complexity from our end users.

    The first step for allowing this is that Kyverno needs to be given the RBAC rights to create and manage the resources it will be generating for us:

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        app.kubernetes.io/instance: kyverno
        app.kubernetes.io/name: kyverno
      name: kyverno:tap-helpers
    rules:
    - apiGroups:
      - tap.vrabbi.cloud
      resources:
      - workloadecrrepos
      verbs:
      - create
    

    Next we need to define the cluster policy that will generate the repos CRD

    apiVersion: kyverno.io/v1
    kind: ClusterPolicy
    metadata:
      name: create-workload-ecr-repos
    spec:
      background: false
      rules:
      - name: create-workload-ecr-repos
        match:
          any:
          - resources:
              kinds:
              - Workload
        generate:
          kind: WorkloadECRRepo
          apiVersion: tap.vrabbi.cloud/v1alpha1
          name: "{{request.object.metadata.name}}-ecr-repos"
          namespace: "{{request.namespace}}"
          synchronize: false
          data:
            metadata:
              ownerReferences:
              - apiVersion: carto.run/v1alpha1
                kind: Workload
                name: "{{request.object.metadata.name}}"
                uid: "{{request.object.metadata.uid}}"
            spec:
              parameters:
                workloadName: "{{request.object.metadata.name}}"
                region: eu-west-2
                repoPrefix: xxxxxxx.dkr.ecr.eu-west-2.amazonaws.com/tap/workloads
                providerName: aws-provider
    

    As you can see, the policy, defines that the rule is applied based on the source resource being of the type workload, and we are stamping out the object, using parameters from the incoming workload resource, such as its name and namespace.

    Another key thing you can see i have added, is the ownerRefernces section. This is a nice trick one should strongly consider to use in Kyverno Generate Policies, which ties the lifecycle of the generated resource to that of the source resource thats creation triggered it. By doing so, when the workload resource is deleted, so to will the ECR repos be deleted.

    This enables us to clean up after our selves and not leave garbage around in the system.

    With all of this in place, we can hide the complexity and limitations of ECR from our end users, and have the same UX we get from other registries in TAP, with ECR as well.

    Summary

    While this project was interesting in it of itself, I really think it talks to the bigger picture when it comes to one of the key benefits of TAP.

    Because TAP is completely based in Kubernetes, and is managed via CRDs, and controllers, in a very Kubernetes Native manner, possible integrations with the CNCF landscape are endless, and that truly is a game changer in the PaaS world. As great as platforms like Heroku or CF may be, they don’t have a community as large or active as the CNCF landscape and Kubernetes ecosystem at whole, that can be leveraged easily, to extend and enhance them.

    I really love seeing how like in this scenario, different tools can come together to build a full cohesive solution, and through innovation, and collaboration, we can overcome limitations and inconveniences we encounter in ecosystem tooling for the benefit of our end users.

  • Integrating Trivy scanner in TAP

    Integrating Trivy scanner in TAP

    Introduction

    One of the really great features in TAP is the pluggable architecture of the scanning tools.

    TAP by default integrates with Grype as the source code and image scanner, however it also has beta support currently for Snyk and for Carbon Black (both limited currently for image scanning only).

    While these are the provided solutions from VMware, the pluggable architecture of the scanning components, allows us to easily plug in our own scanner of choice. this could be an open source scanner like Trivy or a proprietary tool like Aqua or Prisma.

    In this blog post, we will discuss how one could build such a custom integration, using the very common scanner Trivy.

    TLDR

    I have built the integration for trivy now for both source code and image scanning and it is published including the source code on github at the following URL: https://github/com/vrabbi-tap/tap-scanner-integrations
    Instructions on installing the packages can be found in that repo.

    Overview

    The goal is to create a scantemplate that will provide the exact same UX and features as the provided grype one offers but simply change which scanner we want to use.
    The default scantemplate we will use as our baseline is called private-image-scan-template which is automatically created in your developer namespace defined in the TAP values file.
    The scan template has the following flow in which each step runs in its own dedicated container:

    1. Scan the image
    2. Configure access to the metadata store
    3. input the scan results to the metadata store
    4. check compliance against the defined scan policy
    5. aggregate results and create final output

    In order to integrate our own scanner, the only image we need to build ourselves is the image used for the scanning process itself. we will also make some changes to the command line flags passed to the other containers but we can use the provided images as is without any issues.

    The way data is passed between the containers is via a shared mounted volume which is then mounted into all containers at the path /worksapce.

    The general process we need to follow is:

    1. build an image that accepts an image URI as an input via an environment variable
    2. the image should run the scan against the inputted image
    3. the image must output the scan results in Cyclonedx or SPDX SBOM formats
    4. the image must output a summary YAML with the CVE count in the image split by severity
    5. the SBOM and summary yaml should be saved to files in the shared mounted volume so that they can be used in the following steps of the scanning process.

    Pre Requisites

    1. The first thing we need is a TAP environment with the testing and scanning supply chain installed.
      We will utilize the out of the box scan templates defined for grype later on as a baseline to build from.
    2. A machine with docker installed to build our image

    Lets get this working

    Creating the scanning script

    We will be basing the logic of our script on the script which is used in the official grype scan template which can be retrieved by running the following commands:

    IMAGE=`kubectl get scantemplate -n default private-image-scan-template -o json | jq -r '.spec.template.initContainers[] | select(.name == "scan-plugin") | .image'`
    docker pull $IMAGE
    id=$(docker create $IMAGE)
    docker cp $id:/image/scan-image.sh ./grype-script.sh
    docker rm -v $id
    

    We will start just like the grype script by accepting the scan directory, scan file path and whether or not to pull the image as variables.

    #!/bin/bash
    set -eu
    SCAN_DIR=$1
    SCAN_FILE=$2
    PULL_IMAGE=""
    
    if [[ $# -gt 2 ]]
    then
        PULL_IMAGE=$3
    fi
    

    The next step is to change directories to the shared volume, where we have write permissions

    pushd $SCAN_DIR &> /dev/null
    

    The next step is very important, we need to specify how to reference the image in the scan command in 2 different cases. The first case which is designated by the variable PULL_IMAGE being an empty string is when the source image is a publicly accessible image, not requiring credentials to pull it. The second case is where credentials are needed (this is the default assumption in TAP). In this second case, we need to pull down the image as a tarball and then tell our scanner to scan the local tarball instead of pulling from the registry.
    There are many ways to pull the image but we will use the same tool as VMware use in the grype image called krane, as it also will be beneficial in a later step.

    if [[ -z $PULL_IMAGE ]]
    then
        ARGS=$IMAGE
    else
        krane pull $IMAGE myimage
        ARGS="--input myimage"
    fi
    

    Now the next step is to run the scan itself and output the SBOM with the vulnerability data embedded in it, in a supported format which in the case of Trivy will by CycloneDX JSON and put this in a file.

    trivy image $ARGS --format cyclonedx --security-checks vuln > $SCAN_FILE
    

    While this does give us a valid CycloneDX SBOM as an output, TAP requires 2 specific fields be set correctly in order for the metadata store to be able to index the data correctly which trivy does not do out of the box. The needed fields are ".metadata.component.name" which should be the image repo URI without a tag or a digest at the end, and ".metadata.component.version" which should be the sha256 value of the image.
    In order to solve this, we will extract that data from the SBOM if it is there, and otherwise we will parse the inputted image URI, and finally we will add these fields to the outputted BOM file.

    NAME=`cat $SCAN_FILE | jq -r '.metadata.component.properties[] | select(.name == "aquasecurity:trivy:RepoDigest") | .value | split("@") | .[0]'`
    DIGEST=`cat $SCAN_FILE | jq -r '.metadata.component.properties[] | select(.name == "aquasecurity:trivy:RepoDigest") | .value | split("@") | .[1]'`
    if [[ -z $NAME ]]; then
      NAME=`echo $IMAGE | awk -F "@" '{print $1}'`
    fi
    if [[ -z $DIGEST ]]; then
      DIGEST=`echo $IMAGE | awk -F "@" '{print $2}'`
    fi
    if [[ -z $DIGEST ]]; then
      if [[ -z $PULL_IMAGE ]]; then
        DIGEST=`krane digest --tarball myimage`
      else
        DIGEST=`krane digest $IMAGE`
      fi
    fi
    cat $SCAN_FILE | jq '.metadata.component.name="'$NAME'"' | jq '.metadata.component.version="'$DIGEST'"' > $SCAN_FILE.tmp && mv $SCAN_FILE.tmp $SCAN_FILE
    

    Now we need to create the summary report, of number of CVEs at each of the different CVE criticality levels. TAP has 5 defined levels: critical, high, medium, low, unknown. While this may seem easy, the issue is that in the SBOM, we may receive multiple different ratings for a single vulnerability comming from different sources. In this example, i have decided to go with whatever the highest criticality level is found for each CVE.

    critical=0
    high=0
    medium=0
    low=0
    unknown=0
    
    for row in $(cat $SCAN_FILE | jq -r '.vulnerabilities[] | @base64'); do
      VULN=`echo ${row} | base64 --decode`
      if [[ `echo $VULN | jq '.ratings[] | select(.severity == "critical")'` != "" ]]; then
        critical=$((critical+1))
      elif [[ `echo $VULN | jq '.ratings[] | select(.severity == "high")'` != "" ]]; then
        high=$((high+1))
      elif [[ `echo $VULN | jq '.ratings[] | select(.severity == "medium")'` != "" ]]; then
        medium=$((medium+1))
      elif [[ `echo $VULN | jq '.ratings[] | select(.severity == "low")'` != "" ]]; then
        low=$((low+1))
      elif [[ `echo $VULN | jq '.ratings[] | select(.severity == "info")'` != "" ]]; then
        low=$((low+1))
      else
        unknown=$((unknown+1))
      fi
    done
    

    Now that we have the counters for each level we can create the summary YAML file:

    trivyVersion=`trivy --version --format json | jq -r .Version`
    cat << EOF > $SCAN_DIR/out.yaml
    scan:
      cveCount:
        critical: $critical
        high: $high
        medium: $medium
        low: $low
        unknown: $unknown
      scanner:
        name: Trivy
        vendor: Aqua
        version: $trivyVersion
      reports:
      - /workspace/scan.json
    EOF
    

    And the final step is to print the content of our 2 files, first the SBOM itself and then the summary YAML:

    cat $SCAN_FILE
    cat $SCAN_DIR/out.yaml
    

    The script can now be saved on your machine and in my case, i called it scan-image.sh like VMware call their script in the Grype scanner.

    Creating the Dockerfile

    Now that we have our script all configured and ready to be used, we need to build a container image with the script and all needed tools within it.
    While most of the dependencies are very common and can be downloaded as precompiled binaries, krane is not available as a precompiled binary and building it from source would be too much of a pane. To solve this we will simply copy that file from the image for the grype scanner we referenced earlier as part of the Dockerfile and build process.
    I have decided to use ubuntu as my source image, and know that the dependencies we need beyond krane are jq, wget, curl, trivy, and of course our script we built before.

    chmod 755 scan-image.sh
    IMAGE=`kubectl get scantemplate -n default private-image-scan-template -o json | jq -r '.spec.template.initContainers[] | select(.name == "scan-plugin") | .image'`
    cat <<EOF > Dockerfile
    FROM ubuntu
    RUN apt-get update && apt-get install -y wget curl && rm -rf /var/lib/apt/lists/*                                                                                                                                RUN wget "http://stedolan.github.io/jq/download/linux64/jq" && chmod 755 jq && mv jq /usr/local/bin/jq && curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin && mkdir /workspace                                                                                                                                                                                 COPY --from=$IMAGE /usr/local/bin/krane /usr/local/bin/krane
    COPY scan-image.sh /usr/local/bin/
    USER 65534:65533
    EOF
    

    Now that we have our Dockerfile created, we can build the image and tag it with the repo URL we want this saved to, for example:

    docker build . -t harbor.vrabbi.cloud/tap/trivy-scanner:1.0.0
    

    Now we can push the image to our registry:

    docker push harbor.vrabbi.cloud/tap/trivy-scanner:1.0.0
    

    Creating the scan template CR

    The final preparation step is to create our custom scan template CR YAML.
    The first step is to output the out of the box scan template to a file we can edit:

    kubectl get scantemplate -n default private-image-scan-template -o yaml > private-scan-template.yaml
    

    We now need to clean up the yaml a bit, and remove the following fields:

    • .metadata.annotations
    • .metadata.creationTimestamp
    • .metadata.generation
    • .metadata.labels
    • .metadata.namespace
    • .metadata.resourceVersion
    • .metadata.uid
      The only field under metadata we should have left is the metadata.name field.
      Now we need to make a few changes to the spec of the scan template itself.
      The first change is to point the initContainer with the name "scan-plugin" to use the image we just created and pushed to our registry.
      After that we need to change the arguments passed to the container from:
    ./image/scan-image.sh /workspace /workspace/scan.xml true
    

    To:

    scan-image.sh /workspace /workspace/scan.json true
    

    That deals with our scan step and now we need to make a change in the 3rd initContainer which has the name "metadata-store-plugin". here the change we need to make is also in the arguments passed to the container where we need to change the format it expects as an input, as well as the SBOM file name. We need to change the args section from:

        - args:
          - image
          - add
          - --cyclonedxtype
          - xml
          - --path
          - /workspace/scan.xml
    

    To:

        - args:
          - image
          - add
          - --cyclonedxtype
          - json
          - --path
          - /workspace/scan.json
    

    And the final change we need to make is in the policy compliance step which is the final initContainer and is named "complaince-plugin", where we also need to change the file name and the input format type in the args section from:

        - args:
          - check
          - --policy
          - $(POLICY)
          - --scan-results
          - /workspace/scan.xml
          - --parser
          - xml
          - --format
          - yaml
          - --output
          - /workspace/compliance-plugin/out.yaml
    

    To:

        - args:
          - check
          - --policy
          - $(POLICY)
          - --scan-results
          - /workspace/scan.json
          - --parser
          - json
          - --format
          - yaml
          - --output
          - /workspace/compliance-plugin/out.yaml
    

    Replacing the initial Scan Template

    Because the initial scan template is managed by a carvel package, making the change in place will actually be undone within a matter of minutes, the next time kapp controller reconciles our packages.
    To work around this for testing, we can pause the reconciliation of the relevant package.

    kubectl patch pkgi -n tap-install grype --patch '{"spec": {"paused": true}}' --type=merge
    

    Now we can apply the updated scantemplate to the cluster:

    kubectl apply -f private-scan-template.yaml
    

    If you want to revert back to the initial scantemplate, you can simply run:

    kubectl patch pkgi -n tap-install grype --patch '{"spec": {"paused": false}}' --type=merge
    

    Same UX – Different Scanner

    As TAP is all based on kubernetes and that is always the source of truth, one of the great things, is that TAP GUI picks the data up about what scanner we used to scan the image, and we actually get visibility into that in the supply chain plugin inside of TAP GUI:

    And even with the custom scanner being used, we loose none of the capabilities and nice features we get with TAP like the visibility into CVEs from the supply chain plugin:

    And the data is also integrated exactly the same in the new security analysis plugin which was added in TAP 1.3, which gives you a clear way to see the entire landscape of your workload in terms of security in a clear and concise way.

    Summary

    While there is a bit of work in building an integration like this, especially in terms of the parsing of the data and making sure you output the data in the correct way, the fact that TAP allows us to do this, and when we do, we still get the same great UX, is pretty amazing.
    As CycloneDX is becoming a highly adopted standard, and scanners like Aqua Enterprise and Prisma Cloud already support this format as an output of their scanners, the ability to integrate them into TAP is almost identical to what we have done here, and the huge benefit is, that you can truly integrate TAP with your existing tooling in a non disruptive yet very beneficial way.

  • Finding and Removing Stale CNS Volumes in vSphere

    Finding and Removing Stale CNS Volumes in vSphere

    During a recent reorganization of our vSphere lab environment, I was made aware that we had a very weird situation.
    We use vSAN as our main storage in the lab and we found that it was at over 95% percent used, but only 50 percent could be attributed to VMs and system data.

    After looking into this a bit, we found that the vast majority of the storage being used, was actually orphaned FCDs (First Class Disks), which are the vSphere object which maps to a PV (Persistent Volume) in kubernetes when using the vSphere CSI.

    This actually makes sense, as we spin up Kubernetes clusters for testing on a daily basis, and when you delete a cluster, if you don’t delete the CSI volumes in advance, they will simply stay around indefinitely.

    When trying to assess how best to clean up the environment, it became very clear that this aspect of the vSphere CSI and the CNS Volume implementation in vSphere, simply does not have a great flow for analysis of the situation, as well as their being no way to perform bulk operations.

    When looking at the vSphere UI we can navigate to the relevant Datastore, and under the monitoring tab we will have a section called “Cloud Native Storage” with a “Container Volumes” page.

    These are all the CNS volumes that are on this specific data store. If we click on the info card next to the volume name, we can get additional information about a volume that vSphere has collected.

    The first tab includes basic vSphere related information including IDs, Datastore placement, Overall health, and any other data relating to the applied storage policies and the compliance of the volume with the storage policy.

    The next tab is where we really get interesting information, which includes kubernetes data about the persistent volume that this volume represents.

    This data includes the PV name, the namespace and name of the related PVC, all labels that are applied on that PVC, the pod or pods it is mounted to as well as one other key data point, which is the Kubernetes cluster that this PV was created in.

    With this information, we should be able to build out a report of our persistent volumes with all the needed information, in order to assess what can be deleted and what cant be deleted.

    The first place i looked for such info was vRealize Operations, however CNS Volumes are not even collected by the vCenter adapter, and as such we need to find another solution.

    The next place I looked to get this info in a clear way was RVTools. While RVTools has the tab which includes a list of what it believes to be Orphaned VMDKs, and they do include the CNS volumes, the needed kubernetes metadata is not available making it a no go as well.

    With this being the case, I decided to check what could be done from a CLI based solution. For this specific use case, i decided to go with the golang based CLI tool govc, which is fast and really easy to use.

    Setting up GOVC:

    export GOVC_INSECURE=true
    export GOVC_URL=lab-vcsa.vrabbi.cloud
    export GOVC_USERNAME=scott
    export GOVC_PASSWORD=SuperSecretPassword

    The next step is to list the volumes on the vsanDatastore which can be done using the the volume.ls command:

    govc volume.ls -ds=vsanDatastore

    However, this command simply returned the name and ID of the volumes and not the metadata we need.

    Luckily govc, can give us all of this data, when using the “-json” flag:

    govc volume.ls -ds=vsanDatastore -json

    As we can see, we truly get a huge amount of data in the call, for every CNS volume on the datastore, making this a great starting point.

    The next part is the need to parse this data and extract only the needed parts which for this use case includes:

    • Cluster Name
    • PV Name
    • Namespace of the PVC
    • Owner (The vSphere user the CSI driver used to create this volume)
    • Size of the volume
    • Volume ID
    • Datastore URL the volume resides on

    Once we find where this data is located within the JSON body we get as a response, we can use the common CLI tool for json manipulation “jq” for extracting the needed data:

    govc volume.ls -ds=vsanDatastore -json | jq -r '.[]' | jq '.[] | {cluster: .Metadata.ContainerCluster.ClusterId, pvc: .Name, namespace: .Metadata.EntityMetadata[0].Namespace, owner: .Metadata.ContainerCluster.VSphereUser, sizeGB: (.BackingObjectDetails.CapacityInMb/1024), datastoreUrl: .DatastoreUrl, id: .VolumeId.Id}'

    This command will have an output similar to the following:

    As we can see, the output is exactly the data we want, however it would be nice if we had this in CSV format which would allow us to open it in excel for example, making the analysis much easier to do.

    We can add a simple addition to the this command at the end, again using jq, in order to convert this json data into a simple CSV.

    The addition first needs to convert the multiple json documents outputted by the previous command into an array which is done via the following

    jq -s '.'

    The next step is extract the keys and make them the column headers, and then all values are placed in the rows beneath it, and finally we export this as a CSV

    jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv'

    With all of this in place, we should receive a CSV formatted list of all the CNS volumes on our vsanDatastore, along with the metadata we need in order to analyze it.

    The last step is to simply output this into a file which can be done very easily with a redirection of the output stream to a file. In the end, the final command to get this data is:

    govc volume.ls -ds=vsanDatastore -json | jq -r '.[]' | jq '.[] | {cluster: .Metadata.ContainerCluster.ClusterId, pvc: .Name, namespace: .Metadata.EntityMetadata[0].Namespace, owner: .Metadata.ContainerCluster.VSphereUser, sizeGB: (.BackingObjectDetails.CapacityInMb/1024), datastoreUrl: .DatastoreUrl, id: .VolumeId.Id}' | jq -s '.' | jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' > vsanDatastore-cns-volumes.csv

    Once we have gone through the list of volumes, and have the list of CNS volumes we want to delete, we can simply create a text file containing the ID of those volumes which can be found in the 3rd column of the outputted CSV.

    Once we have a file that looks like this:

    We can simply run a command that loops through this list and via govc, will delete them using the govc volume.rm command:

    xargs -a cns-vols-to-delete.txt -I{} -d'\n' govc volume.rm {}

    Summary

    While the UI is not very useful for these tasks, and neither are the standard monitoring or reporting tools for vSphere, it is good, that this information is accessible via automation tools like govc, which allows us to solve issues like this, in the interim at least, until a more streamlined approach is available via the vSphere UI or other official VMware tooling.

    Another hope i do have, is that down the road, kubernetes distributions like TKG, OCP, Rancher, EKS-A and others as well, will have a mechanism to simply at cluster deletion time, delete all of the remaining PVs, which would eliminate this issue.

    I hope this was helpful, as for me in the lab it was able to save over 6TB of storage which is huge!

  • Tanzu CLI on Windows made easy

    Tanzu CLI on Windows made easy

    As the Tanzu ecosystem grows, the reliance on Tanzu CLI grows as well. While the CLI itself works very well cross platform, the main issue we consistently encounter is the pain of installing the Tanzu CLI on a windows machine.

    The process which is documented by VMware in the TKG and TAP documentation, is a huge list of manual steps which many require administrative permissions, and is simply a really bad user experience.

    In this blog post we will see a small POC i put together that can help solve this issue.

    While this POC is based on the TAP version of the Tanzu CLI, it could easily be tweaked to work with the TKG version as well.

    What options do we have

    When trying to think about simplifying the installation process, my initial thought was to build a WinGet package which is then easily discoverable with the new WinGet exosystem or maybe to build a Chocolatey package.

    While those 2 options are valid in many environments, some enterprises simply don’t allow these package managers, and also still today, the vast majority of machines don’t have them installed.

    As i wanted to build a solution that would make the process better for as many people as possible, I had to find another option.

    Another simple option would have been to just write a simple PowerShell script to do the installation. While this is an option and may be the right choice in some cases, I wanted to have a better experience and actually manage the Tanzu CLI as a package, which can have full lifecycle management from installation, through upgrades and finally uninstalling the CLI and of course cleaning up when uninstalling.

    This all led me to the realization that the best option for what i was looking for was to build an MSI installer for the Tanzu CLI.

    How to build an MSI

    The first thing i needed to figure out was, how do i even build an MSI.

    After some research I found 2 projects that seemed very interesting:

    1. Powershell Pro Tools
    2. Wix Toolset

    When evaluating the 2 options, It was very clear that PowerShell Pro Tools was the right choice for a POC project, as it is much easier to get started with, and actually, the PowerShell Pro Tools uses the Wix mechanism under the hood but simply adds a simple abstraction layer above in PowerShell to make our lives much easier.

    While the PowerShell Pro Tools module is great and can suffice many needs, i did end up having some issues with it, which led me to also utilize the wix toolset itself to patch the generated config for my needs.

    How to build the initial config

    The first step is to install the PowerShell Pro Tools module:

    install-module PowerShellProTools

    Once we have the module installed, we can start building our project.

    You will need to download the Tanzu CLI zip file first as the MSI will need to contain these files.

    Next we will need to generate a GUID which we need to save somewhere we will remember. This GUID is used as an ID of our MSI program which can be used later on when a new version comes out, so that the MSI can perform an upgrade.

    $upgradeCode = ([guid]::NewGuid().ToString())        
    write-host "Tanzu CLI MSI Upgrade Code: $upgradeCode"

    Once we have the Tanzu CLI zip file on our machine we can begin defining some variables in PowerShell:

    # This is the Directory where auto generated files and eventually our MSI will be placed
    $outputDirName = "output"
    # This is the TAP version we are building this package for
    $tapVersion = "1.2.1"
    # This is the name of the zip file we have downloaded already
    $tanzuZipFileName = "tanzu-framework-windows-amd64.zip"
    # This is your Companies name
    $company = "TeraSky Israel"
    # This is where you put the upgrade code from above
    $upgradeCode = 'a8473e56-43ec-4665-9132-2ff94ac32b33'
    # The name of the Product which in our case is TAP
    $productName = "TAP"
    # The current directory where all files will be located
    $DIR=(pwd).path

    Now that we have the needed variable set we need to create a few additional files:

    1. A script which performs the installation of the CLI and its plugins
    2. A script which uninstalls the CLI and its plugins
    3. An icon to be used for the program
    4. A file with VMware’s EULA for the Tanzu CLI

    All of these files, as well as the final script that puts all of this solution together can be found in my GitHub repo .

    Once we have those files locally in our current directory we can begin building our MSI itself.

    A key feature that we are using in this MSI, is the ability to run scripts at specific hooks in the installation process. by default the MSI will simply copy files to a location we request, but we have a set of commands we need to run after the Tanzu CLI zip is copied over to the machine in order to install it as well as its plugins.

    In order to define these actions we have a simple PowerShell command provided via the PowerShell Pro Tools module:

    $installAction = New-InstallerCustomAction -FileId 'InstallPlugins' -CheckReturnValue -RunOnInstall   -arguments '-NoProfile -WindowStyle Normal -InputFormat None -ExecutionPolicy Bypass'
    
    $uninstallAction = New-InstallerCustomAction -FileId 'UninstallPlugins' -RunOnUninstall -arguments '-NoProfile -WindowStyle Normal -InputFormat None -ExecutionPolicy Bypass'

    Now that we have these actions defined in variables, we can run the command to build our MSI:

    New-Installer -Productname $productName -Manufacturer $company -platform x64 -UpgradeCode $upgradeCode -Content {
      New-InstallerDirectory -PredefinedDirectory "ProgramFiles64Folder" -Content {
       New-InstallerDirectory -DirectoryName "tanzu" -Content {
          New-InstallerFile-Source .\$tanzuZipFileName -Id 'bundle'
         New-InstallerFile-Source .\install-plugins.ps1 -Id 'InstallPlugins'
         New-InstallerFile-Source .\uninstall-tanzu-cli.ps1 -Id 'UninstallPlugins'
         New-InstallerFile-Source .\EULA.rtf -Id 'EULA'
      }
     }
    } -OutputDirectory (Join-Path$PSScriptRoot"$outputDirName") -RequiresElevation -version $tapVersion -CustomAction $installAction,$uninstallAction -AddRemoveProgramsIcon $DIR\tanzu-icon.ico
    
    

    This command will generate a few configuration files as well as the initial MSI file, however we are going to delete the MSI file, tweak the configuration files and then regenerate the MSI.

    The first change we need to make is to the way our scripts at installation and uninstall phases are run. As some of the changes we are making require administrative permissions, we need to make sure that the scripts run with elevated permissions.

    This can be achieved via the following command:

    ((Get-Content -path $outputDirName\TAP.$tapVersion.x64.wxs -Raw) -replace '<CustomAction Id=','<CustomAction Impersonate="no" Id=') | Set-Content -Path $outputDirName\TAP.$tapVersion.x64.wxs

    Basically, we need to add ‘Impersonate=”no”‘ to the custom action field in our XML config file.

    The next change we need to make is to integrate a UI form which will have our EULA in it which should come up as part of the installation:

    ((Get-Content-path $outputDirName\TAP.$tapVersion.x64.wxs -Raw) -replace '</Product>','<UIRef Id="WixUI_Minimal" /><UIRef Id="WixUI_ErrorProgressText" /></Product>') | Set-Content -Path $outputDirName\TAP.$tapVersion.x64.wxs

    Once these tweaks have been made, we can now delete the initial MSI and build the new one based on our modified configuration file.

    As mentioned above, the Powershell Pro Tools module utilizes the Wix toolset under the hoods, which means we don’t need to install anything else in order to build our MSI as the tools are already present, we just need to find them:

    $modulePath= (Get-Module -ListAvailable PowerShellProTools)[0].path
    $modulePath = $modulePath.substring(0,$modulePath.LastIndexOf("\"))
    $binPath = "$modulePath\Wix\bin"

    With the variables set with the path to the needed tools, we can now rebuild our MSI using 2 simple commands:

    & $binPath\candle.exe ".\TAP.$tapVersion.x64.wxs"
    & $binPath\light.exe -dWixUILicenseRtf="$DIR\EULA.rtf" -ext WixUIExtension ".\TAP.$tapVersion.x64.wixobj" -o "TAP.$tapVersion.x64.msi"

    Now that we have rebuilt our MSI, lets see what it looks like:

    As we can see, we have a few configuration files, along with our MSI installer.

    If we try to install the MSI via the UI:

    Once we accept the EULA, we now get the UAC prompt as the installation requires admin permissions:

    Once we accept the UAC prompt, the Tanzu CLI installation will begin and we should recieve about a minute later:

    While the UI installation is nice, because this is an MSI, we can also perform the installation via automation using the msiexec.exe CLI tool for example:

    # Must be run from ad elevated command prompt
    msiexec /i TAP.1.2.1.x64.msi ACCEPT=YES /qr+

    Summary

    This was a pretty fun POC, and I think it proves that the experience of installating Tanzu CLI can be much better and more streamlined.

    Hopefully, this is beneficial for those out there looking to simplify installation of Tanzu CLI or other tools that have similar experiences currently in terms of installation experience.

    There are many things that would be needed to make something like this official, and production grade, but as a POC of about 1.5 hours overall, I think the results are pretty cool!

  • Auto Generation of Certs for TAP Workloads

    Auto Generation of Certs for TAP Workloads

    When we setup a TAP environment, one of the key aspects we must take into account is how we will be exposing our workloads outside of the cluster.

    By default, TAP workloads are deployed as Knative services and are exposed via plain HTTP which is not a very secure or production ready solution.

    Another option we get very easily with a few additional lines in our configuration file for TAP, is the ability to provide a secret that has a wildcard certificate in it that will be used for all of our workloads in the cluster.

    While using a wildcard certificate for a platform has become a common practice as we have seen with TAS, OCP and others, wouldn’t it be great if we could have actual workload specific certificates auto generated for us by the platform and not need to use a wildcard?

    In this post we will cover a way to achieve this in which our CA of choice is an Active Directory CA server.

    Why not just use a wildcard

    While the setup and configuration of a wildcard certificate is extremely easy it definitely has some drawbacks we must consider.

    The most TAP centric issue we can encounter with wildcard certificates is that they simply don’t work at scale within TAP without changing how the ingress URLs are generated.

    By default, TAP uses the following naming conventions for ingress URLs:

    <WORKLOAD NAME>.<NAMESPACE>.<DOMAIN>
    

    While that may seem appealing, this does not work well with wildcards, as there is no way to create a wildcard certificate that supports "x" number of subdomains, and a wildcard only can support a single segment (the first one) of a domain name being anything.

    This means that in order to use the default naming convention, we need to not just have a wildcard for ".<DOMAIN NAME>" but rather we will need to add Subject Alternative Names with the format of ".<NAMESPACE>.<DOMAIN>" for all of the namespaces in which we will want to deploy workloads.

    Knowing all of the needed namespaces is not feasible upfront which means we need a solution that is more dynamic.

    Solution #1 – Change the Domain Template

    This is probably the easiest solution. In this solution, we simply add a few more lines to our TAP configuration in which we will change the convention based on which an ingress will be created from:

    <WORKLOAD NAME>.<NAMESPACE>.<DOMAIN>
    

    To:

    <WORKLOAD NAME>-<NAMESPACE>.<DOMAIN>
    

    By doing this, we put everything that is dependant on the workload into a single section of the FQDN, and then a wildcard certificate will work.

    To do this the additional lines we would need to add to our TAP values files for any cluster created with the full, run, or iterate profiles would be:

    cnrs:
      domain_template: '{{.Name}}-{{.Namespace}}.{{.Domain}}'
    

    While the above solution is an easy way to solve the issue of the wildcard certificate mentioned above, there are 2 other issues with wildcards we need to take into consideration.

    1. From a security perspective, the biggest concern with wildcard certificates is that when one server or sub-domain covered by the wildcard is compromised, all sub-domains may be compromised. In other words, the upfront simplicity of the wildcard can create significant problems should things go wrong.
    2. From a maintenance perspective we also have the need to remember on a typically yearly basis to replace the wildcard certificate in our clusters. If we were to forget to change the certificate in the cluster in time, ALL of our workloads would have TLS issues at the same time which could cause severe impact on your business.

    As we can see, using a wildcard may suffice for some use cases, and can be an easy way to get started, there has to be a better way….

    Generating Certificates At Runtime With Cert-Manager

    One of the components included in TAP is Cert-Manager.
    Cert-Manager is an industry standard kubernetes operator that can manage the entire lifecycle of certificates in the context of a Kubernetes environment.

    When we work with public domains, we can use the integration for example with LetsEncrypt or really any ACME server, and generate our certificates in an easy and automated manner.

    One of the nice things with using Cert-Manager is that Knative which is the default deployment mechanism for our workloads in TAP, has an OOTB integration with Cert-Manager, in which it can auto generate the certificates when a new Knative Service is deployed!

    This allows us to not need to worry about certificate creation, and let the platform deal with it automatically!

    The way that this works is that you create a ClusterIssuer CR in your cluster, which is a custom resource that is provided by Cert-Manager, that is utilized for issuing the certificates we request of it.

    The Issue Of On-Prem Environments

    While the idea of auto generating trusted certificates sounds great, it has some challenges when working in a typical On-Prem environment.

    Typically we see that Microsoft’s Active Directory CA solution, is the most commonly used CA when dealing with this type of environment, and unfortunately, Cert-Manager does not have an integration with this CA.

    While we could build such an integration (It has been done in the past), this would require a lot of work and maintenance that simply is not an ideal situation or even possible for many organizations to undertake.

    We could also decide to use the self signed issuer type in Cert-Manager, and simply allow Cert-Manager to create self signed certificates for each of our workloads.

    While self signed certs may work for demo environments or even development environments, they really are not a solution for a production grade platform because everyone will receive certificate warnings any time they try to access the application.

    The Solution – Using an Intermediate CA

    When dealing with certificates, we have the concept of an intermediate or subordinate CA.

    An intermediate CA, is a certificate that has been signed by the root CA, and has been given the "permissions" to issue certificates on behalf of the root CA.

    Once we have an intermediate CAs full chain in a PEM format, as well as the private key for the intermediate CA in PEM format as well, we can use the Cert-Manager CA ClusterIssuer type, and have Cert-Manager generate certificates that are signed by the dedicated intermediate CA, we have provided it.

    How to set this up

    The first step is to install TAP as we always would without this solution. We also do not need to change the default naming template for our services, as certificates can be generated for the default naming convention as well!

    Once we have deployed TAP we are going to next configure Cert-Manager and create the ClusterIssuer we will be using.

    In order to do this you will need to have the certificate chain (cert.cer) and the private key (cert.key), both in PEM format saved in files on your working machine, and then we can create a secret from these files:

    kubectl create secret generic tap-intermediate-ca -n cert-manager \
      --from-file=tls.crt=cert.cer --from-file=tls.key=cert.key
    

    Now that we have the secret created with our intermediate CA data, we can create the ClusterIssuer:

    cat << EOF | kubectl apply -f -
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: ca-issuer
    spec:
      ca:
        secretName: tap-intermediate-ca
    EOF
    

    Now that we have everything setup, and ready to be configured in TAP, we will create one final secret, that contains a YTT overlay that will configure the Knative system to use the newly created ClusterIssuer and auto generate TLS certificates for our workloads:

    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Secret
    metadata:
      name: cnrs-tls-overlay
      namespace: tap-install
    type: Opaque
    stringData:
      tls-overlay.yaml: |
        #@ load("@ytt:overlay", "overlay")
        #@ load("@ytt:data", "data")
        
        ---
        #@overlay/match by=overlay.subset({"metadata":{"name":"config-certmanager"}})
        ---
        data:
          #@overlay/remove missing_ok=True
          _example:
          #@overlay/match missing_ok=True
          issuerRef: |
            kind: ClusterIssuer
            name: ca-issuer
        
        #@overlay/match by=overlay.subset({"metadata":{"name":"config-network"}})
        ---
        data:
          #@overlay/remove missing_ok=True
          _example:
          #@overlay/match missing_ok=True
          autoTLS: "Enabled"
          #@overlay/match missing_ok=True
          httpProtocol: "Redirected"
          #@overlay/match missing_ok=True
          default-tls-secret: "kube-system/wildcard"
          #@overlay/match missing_ok=True
          domainTemplate: "{{.Name}}.{{.Namespace}}.{{.Domain}}"
        
        #@ def kapp_config():
        apiVersion: kapp.k14s.io/v1alpha1
        kind: Config
        #@ end
        
        #@overlay/match by=overlay.subset(kapp_config())
        ---
        rebaseRules:
        #@overlay/append
        - path: [data]
          type: copy
          sources: [new, existing]
          resourceMatchers:
          - kindNamespaceNameMatcher: {kind: ConfigMap, namespace: knative-serving, name: config-certmanager}
          - kindNamespaceNameMatcher: {kind: ConfigMap, namespace: knative-serving, name: config-network}
    EOF
    

    And now we can simply tell TAP to apply this overlay via a simple addition to our TAP values file:

    package_overlays:
    - name: cnrs
      secrets:
      - name: cnrs-tls-overlay
    

    The final step is to simply apply the changes to TAP using the Tanzu CLI:

    tanzu package installed update tap -n tap-install -f <YOUR TAP VALUES FILE>
    

    Summary

    While the setup currently takes a bit of extra work, I strongly believe that this solution is a more secure, and more flexible solution.

    Using this mechanism can work for any CA and is a very simple way to allow for more complex, or unique naming conventions for you Knaitve service ingress URLs in a secure and managed way.

    Another key benefit is that typically an intermediate CA is made valid for 5 years, where as a standard certificate is valid for only 1 year if not less. Cert-Manager, because it is the one managing the certificates, also manages the lifecycle and will auto renew and rotate the certificates before they expire, keeping your mind clear, and freeing you from replacing certificates on a very frequent basis.

    The one thing we have not covered but is a very good idea if you go down this approach, is to utilize External DNS which can auto create and manage DNS records for all of your workloads URLs freeing you from the need to create wildcard DNS records as well!

    Using Industry standards like Cert-Manager and External DNS to enhance the TAP experience is a truly great setup, that offers a secure, flexible and easy to maintain platform!