Making TKGm feel like EKS

Recently I was having a discussion with a customer regarding management of kubernetes environments in multi cloud scenarios.

They are a very large EKS shop, and are also planning now on having a big kubernetes footprint on premises.

One of the things that came up, was the desire to have the experience of deploying an app to either environment as similar as possible, so that in an ideal world, the target location should not matter at all to the application teams.

Another key thing that came up was the need for applications running in the on premises datacenters, to communicate and connect to AWS services like RDS, S3, etc.

While we were discussing this, I was trying to imagine what such a design would look like, and I knew that the key factor in this case, beyond solving this issue, was to require as little change on the existing environment as possible, due to the size of their environment in EKS.

This is when I started to look into if and how I could provide as close to the EKS experience as possible in an on-premises environment.

The Key Component

The key component I tried to tackle was IRSA which stands for “IAM Roles For Service Accounts”.

This is an amazing feature in AWS which is deeply integrated into EKS, in which I can map a Service Account in a kubernetes cluster to a specific IAM role, thereby enabling it to access AWS resources without the need for static credentials saved in secrets or similar solutions.

This is a key feature in EKS and this customer was heavily relying on it within there environment, and I knew this would be an interesting task to try and solve.

A Bit Of History Around IRSA

In 2014, AWS Identity and Access Management added support for federated identities using OpenID Connect (OIDC). This feature allows you to authenticate AWS API calls with supported identity providers and receive a valid OIDC JSON web token (JWT). You can pass this token to the AWS STS AssumeRoleWithWebIdentity API operation and receive IAM temporary role credentials. You can use these credentials to interact with any AWS service, including Amazon S3 and DynamoDB.

Kubernetes has long used service accounts as its own internal identity system. Pods can authenticate with the Kubernetes API server using an auto-mounted token (which was a non-OIDC JWT) that only the Kubernetes API server could validate. These legacy service account tokens don’t expire, and rotating the signing key is a difficult process. In Kubernetes version 1.12, support was added for a new ProjectedServiceAccountToken feature. This feature is an OIDC JSON web token that also contains the service account identity and supports a configurable audience.

Amazon EKS hosts a public OIDC discovery endpoint for each cluster that contains the signing keys for the ProjectedServiceAccountToken JSON web tokens so external systems, such as IAM, can validate and accept the OIDC tokens that are issued by Kubernetes.

How do we make this work On Premises

After researching the subject, and trying to find the correct approach, I found an architecture I thought would work, and decided to try it out.

The first element we need to do is to create an OIDC discovery endpoint that is accessible from AWS services.

This can easily be done by creating an S3 bucket and adding in a discovery json file at the correct path.

The first challenge was that we need the OIDC discovery endpoint to contain the signing keys for our service accounts projected tokens.

At this point i stumbled upon a document that is located on GitHub which covers the steps in detail.

While the document seemed very interesting, it doesn’t work well with the flow that I was trying to achieve, as TKGm manages its own certificates, and the guide assumes that you are basically doing kubernetes the hard way, and manually generating the certificates and configuring the API Server ahead of deployment time.

In general the approach they mention is as follows:

  1. create an S3 bucket
  2. Generate RSA signing key pair
  3. Generate an oidc discovery json file
  4. use a go program to generate the keys.json file
  5. push the files to the S3 bucket
  6. Setup your API server with flags pointing at the S3 bucket URI and at the certificates
  7. Create an IAM IDP and connect it to the S3 bucket
  8. Deploy Cert Manager (not documented well but needed)
  9. Deploy the Pod Identity Webhook

Because TKG will auto generate our certificates for us, and I don’t want to do that out of band, and I also want to do this via Terraform, I have changed the order and steps a bit in order to fit into the TKGm flow better:

  1. create an S3 bucket
  2. Generate an oidc discovery json file
  3. Push the discovery json file to the S3 bucket
  4. Create an IAM IDP and connect it to the S3 bucket
  5. Create a YTT overlay to add 2 needed flags to the API server for TKG clusters
  6. Create the TKGm cluster
  7. Deploy Cert Manager
  8. Retrieve the certificate from the cluster
  9. Generate the keys.json file using Terraform
  10. Push the keys.json file to the S3 bucket
  11. Deploy the Pod Identity Webhook

What does the flow look like

So as mentioned above, I wanted to do this via Terraform, and as such I have built an example Terraform module that can set the entire configuration up for us, in a multi step process:

To deploy a cluster with IRSA we first need to apply our configuration that is needed before the cluster is created. This will create the IAM IDP, S3 bucket, OIDC Discovery json file, and any requested IAM roles for IRSA usage.

Once we do that the terraform module will output as seen above, the manual steps we need to take in order to proceed.

The first step is to create a YTT overlay file which will add 2 flags to the API Server pods for new clusters.

The first flag we are adding is the “api-audience” flag is needed in order to set the correct aud value in our tokens generated for service accounts. when using IRSA, the value should be “sts.amazonaws.com” in all cases, and therefore we have hardcoded the value.

The second flag is the “service-account-issuer” flag, which needs to be set to the URL of our S3 bucket which is hosting the OIDC discovery endpoint for us. In this case, we have templated out the bucket name, which is built in our terraform code to include the cluster name and by doing it this way, we ensure that the YTT overlay can work for all new clusters. This flag typically in TKGm clusters is set by Kubeadm automatically to http://kubernetes.default.svc.cluster.local which is the internal DNS name for the kubernetes API server, and we need to change it so that the issuer field in the OIDC tokens for service accounts point to the correct issuer.

Once we have set that overlay, the next step is to create our TKGm cluster using the Tanzu CLI as we typically would.

Once the cluster is created, we then need to get the admin kubeconfig of the cluster which can be done as you typically would.

Now the last manual step we need to take, is to install cert-manager which can be done using the Tanzu package CLI:

Now that we have Cert Manager deployed we can run the terraform module again after making 1 change to our tfvars file, which is to change the “cluster_created” variable from false to true.

This change signals to the terraform provider to do the following:

  1. retrieve the Service account signing keypairs public key from the management cluster which is stored in a kubernetes secret
  2. retrieve a service account token from the workload cluster from a secret
  3. extract the needed data from both of these sources to build the JWKS json which is needed for the keys.json file
  4. Generate the keys.json file and push it to the S3 bucket
  5. Deploy the Pod Identity Webhook components to the workload cluster

With this we are now done and can use IRSA based authentication in our clusters, just like we would for an EKS cluster.

What are the general use cases

While I mentioned above, the use case which brought me to look into this matter, which was around bridging the gaps between the experience on prem and in EKS to make the co-existence of the two more seamless, There are a few more use cases, which are actually in many cases even more prevalent that this could help with:

  1. Velero – we can use IRSA instead of saving access keys and secret keys in a kubernetes secret in order to use an S3 bucket as our backup target. This has security benefits over plain text and static credentials saved in a cluster within a secret making it a good choice where security is of concern.
  2. Kpack – We can use IRSA so that Kpack can push images to an ECR registry without needing to refresh our credentials every 12 hours as is needed by ECR.
  3. Kaniko – Like with Kpack, we can use IRSA to support pushing images to an ECR registry
  4. FluentBit – We can use IRSA with FluentBit to ship logs to things like OpenSearch or CloudWatch seamlessly.
  5. Crossplane – This is a huge benefit, as Crossplane with the AWS provider, allows us to manage nearly all AWS services directly as CRDs in our kubernetes clusters. using IRSA for crossplane, can make it extremely easy and seamless to support deploying and managing AWS workloads from our on prem environment in a secure and sealess manner.
  6. AWS Controllers For Kubernetes (ACK) – Like Crossplane but specific to AWS, we can use ACK with IRSA in order to support IaC use cases against AWS in a secure manner.
  7. Thanos – A great solution for long term storage for prometheus metrics, which stores the metrics in an object store of your choosing such as AWS S3. We can have Thanos use IRSA easily in order to keep our prometheus metrics for as long as we need in a really simple process.
  8. External DNS – we can utilize IRSA together with external DNS in order to have automated DNS record management for ingress resources we create in our cluster in Route53!
  9. Cert Manager – We can utilize IRSA with the AWS Private CA plugin for Cert Manager, to auto generate certificates for us and manage their lifecycle using an AWS ACM Private CA!
  10. External Secrets – We can use IRSA for the external secrets operator, in order to pull secrets from AWS Secrets Manager or from AWS Parameter Store, in a secure and simple manner.

And that is just a partial list of off the shelf solutions, used very often in many environments, where IRSA could be of great help.

Where is the code

To try this out you can find the terraform code in the dedicated Github Repository.

Summary

This was a fun experiment and I think this has some true benefits for many use cases. I hope to streamline the process more in the future, when TKG cluster provisioning can be streamlined via the Cluster Class mechanism in TKG 2, which will hopefully allow for a single step end to end deployment mechanism for this.


Leave a Reply

Discover more from vRabbi's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading