Home – vRabbi's Blog

ECR Is Finally Usable!

AWS Finally Listens: After 5+ Years, ECR Gets Two Game-Changing Features

The Long Wait Is Over

In late 2025 and early 2026, Amazon Web Services released two of the most requested features in ECR’s history:

Automatic repository creation on image push (December 19th, 2025)
Cross-repository layer sharing via blob mounting (January 20th, 2026)

For those of us who have been managing container workflows on AWS, these announcements are nothing short of a miracle. These two features alone transform ECR from “that registry we avoid” to “a genuinely competitive option.”

For over five years, developers have been asking, begging, and practically pleading for these features. The infamous GitHub issues accumulated:

Feature	Issue	Opened	Upvotes	Wait Time
Create on Push	#853	April 2020	700+	5+ years
Layer Sharing	#531	October 2019	300+	6+ years

Many had lost hope. But AWS finally delivered—on both fronts.

Part 1: Automatic Repository Creation on Push

The Problem

If you’ve ever worked with container registries, you know that the standard workflow is simple:

Build your image
Tag it with your registry URL
Push it

That’s it. Docker Hub, Google Container Registry (GCR), Azure Container Registry (ACR), Quay, Harbor, Artifactory, Nexus—every single major container registry works this way. If the repository doesn’t exist, it gets created automatically.

Except ECR.

With Amazon ECR, before you could push an image, you had to:

Pre-create the repository using the AWS Console, CLI, CloudFormation, or Terraform
Configure the repository settings (encryption, lifecycle policies, permissions)
Then push your image

This might seem like a minor inconvenience, but it created massive headaches in real-world scenarios:

The Developer Experience Nightmare

Consider a development team practicing continuous integration. Every time a developer creates a new microservice, they need to:

Request a new ECR repository (or have permissions to create one)
Wait for the infrastructure team to provision it (if they don’t have permissions)
Update their CI/CD pipeline to reference the new repository
Finally push their image

Compare this to GCR where you just… push. The repository appears. Done.

CI/CD Pipeline Complexity

CI/CD pipelines became unnecessarily complex. You couldn’t just have a simple docker push command. You needed error handling, repository existence checks, and creation logic wrapped around every push:

# The ECR workaround everyone hated
aws ecr describe-repositories --repository-names "${REPO_NAME}" 2>/dev/null || \
    aws ecr create-repository --repository-name "${REPO_NAME}"
docker push "${ECR_URL}/${REPO_NAME}:${TAG}"

This added complexity, potential race conditions, and additional IAM permissions requirements.

Platform Integration Challenges

Modern platforms like Kubernetes, Tanzu Application Platform (RIP), and others often need to push images dynamically. I wrote about this exact challenge back in 2022 in my blog post TAP with ECR – Crossplane and Kyverno to the Rescue, where I detailed the elaborate workarounds we needed to implement just to make Tanzu Application Platform (RIP) work with ECR.

The solution involved:

Crossplane to dynamically provision ECR repositories as Kubernetes resources
Kyverno policies to automatically trigger repository creation based on workload definitions
Complex RBAC and IAM role mappings

It worked, but it was a lot of infrastructure complexity just to achieve what should have been a built-in feature.

The Community Frustration

The comments on the GitHub issue tell the story:

“This is the largest blocker for us in fully transitioning to ECR. Unbelievable it doesn’t exist yet…” — @MichaelErmer, July 2022

“We really need this feature as well, otherwise we simply can’t use ECR.” — @vrabbi (yes, me!), October 2022

“This feature works really well in Artifactory and Nexus… if you are coming from one of these other products where this feature is baked into docker push.” — @sputmayer, December 2021

“Just so you know, this works fine in Google Cloud Container Registry.” — @simonlsk, May 2023

The Solution: Repository Creation Templates with Create on Push

AWS has solved this problem elegantly with Repository Creation Templates that now support a Create on Push applied-for type.

How It Works

The feature works in three key steps:

Create a Repository Creation Template: Define the settings you want applied to any new repositories (encryption, lifecycle policies, permissions, tags, etc.)
Specify “Create on Push” as an Applied-For Type: This tells ECR to use this template when someone pushes to a repository that doesn’t exist
Push Your Images: When you push to a non-existent repository, ECR automatically creates it with your template settings applied

Here’s the workflow visualized:

Developer pushes image to: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp/frontend:v1.0
                                    |
                                    v
                    Repository "myapp/frontend" doesn't exist
                                    |
                                    v
              ECR checks for matching repository creation template
                                    |
                                    v
        Template with prefix "myapp" found with CREATE_ON_PUSH enabled
                                    |
                                    v
        Repository created with template settings (encryption, policies, etc.)
                                    |
                                    v
                          Image pushed successfully!

Template Configuration Options

Repository creation templates give you control over:

Setting	Description
Prefix	Repository namespace prefix (e.g., `prod/`, `myapp/`) or `ROOT` for all
Applied For	`PULL_THROUGH_CACHE`, `CREATE_ON_PUSH`, or `REPLICATION`
Image Tag Mutability	`MUTABLE` or `IMMUTABLE`
Encryption	`AES-256` (default) or `KMS` with custom keys
Repository Policy	IAM resource-based access control
Lifecycle Policy	Automatic image cleanup rules
Resource Tags	Metadata for organization and cost tracking

Part 2: Cross-Repository Layer Sharing (Blob Mounting)

The Problem

ECR’s repository-per-image model created another significant pain point: no layer sharing between repositories.

Container images are built in layers. When you have dozens of microservices all built on the same base image (say, node:18-alpine or your company’s custom base image), those base layers are identical across all your service images.

In most container registries, these common layers are stored once and shared. When you push a new image, the registry recognizes “I already have this layer” and skips the upload.

Not ECR. Every repository was an island. Push the same base layer to 50 different repositories? ECR would happily store 50 copies and charge you for all of them.

The GitHub issue #531, opened in October 2019, captured the frustration:

“Currently ECR doesn’t cache image layers between repositories and with ECR’s model of creating a repo per image this leads to quite poor performance, especially in situations where there are many images being built on a common base-image.” — @jespersoderlund

The impact was significant:

Slow pushes: Every image push uploaded the full image, even when layers already existed in other repositories
Increased storage costs: Duplicate layers stored across repositories
Poor replication performance: Cross-region replication had to transfer duplicate data
Frustrated platform teams: Managing hundreds of microservices meant dealing with these inefficiencies at scale

The Solution: Blob Mounting

On January 20th, 2026, AWS released blob mounting for ECR. This feature enables cross-repository layer sharing within a registry.

How It Works

When blob mounting is enabled:

During an image push, ECR checks if the layer already exists in any repository within the same registry
If found, ECR “mounts” (references) the existing layer instead of storing a duplicate
The push completes faster, and you don’t pay for duplicate storage

Push image to: myapp/service-a (has layers A, B, C)
                    |
                    v
        Layer A already exists in myapp/service-b
                    |
                    v
        ECR mounts existing Layer A (no re-upload needed)
                    |
                    v
        Layers B and C uploaded normally
                    |
                    v
        Push completes faster with less storage used

Key Concepts

Blob mounting only works within the same registry (same account and region)
Repositories must use identical encryption type and keys
Blob mounting is not supported for images created via pull through cache
If you disable blob mounting later, existing mounted layers continue to work

Enabling Blob Mounting

Using the AWS Console

Open the Amazon ECR console
Navigate to Private registry > Feature & Settings > Blob mounting
Click Enable

Using the AWS CLI

aws ecr put-account-setting --name BLOB_MOUNTING --value ENABLED

Using Terraform

The Terraform support is merged and will be available in the next AWS provider release:

resource "aws_ecr_account_setting" "blob_mounting" {
  name  = "BLOB_MOUNTING"
  value = "ENABLED"
}

Getting Started with Create on Push

Using the AWS Console

Open the Amazon ECR console
Navigate to Private registry > Repository creation templates
Click Create template
Choose A specific prefix or Any prefix in your ECR registry
For Applied for, select CREATE_ON_PUSH (and optionally PULL_THROUGH_CACHE and REPLICATION)
Configure your desired settings (encryption, lifecycle policies, etc.)
Click Create

Using the AWS CLI

Create a template configuration file create-on-push-template.json:

{
  "prefix": "ROOT",
  "description": "Default template for all repositories created on push",
  "appliedFor": ["CREATE_ON_PUSH"],
  "imageTagMutability": "MUTABLE",
  "encryptionConfiguration": {
    "encryptionType": "AES256"
  },
  "lifecyclePolicy": "{\"rules\":[{\"rulePriority\":1,\"description\":\"Keep last 30 images\",\"selection\":{\"tagStatus\":\"any\",\"countType\":\"imageCountMoreThan\",\"countNumber\":30},\"action\":{\"type\":\"expire\"}}]}"
}

Then apply it:

aws ecr create-repository-creation-template \
    --cli-input-json file://create-on-push-template.json

Using Terraform

The aws_ecr_repository_creation_template resource was added in AWS provider version 6.28.0. Here are practical examples:

Basic Create on Push Template

resource "aws_ecr_repository_creation_template" "default" {
  prefix       = "ROOT"
  description  = "Default settings for all auto-created repositories"
  applied_for  = ["CREATE_ON_PUSH"]

  image_tag_mutability = "MUTABLE"

  encryption_configuration {
    encryption_type = "AES256"
  }
}

Production Template with Full Configuration

resource "aws_ecr_repository_creation_template" "production" {
  prefix       = "prod"
  description  = "Production repositories with strict settings"
  applied_for  = ["CREATE_ON_PUSH", "REPLICATION"]

  image_tag_mutability = "IMMUTABLE"

  encryption_configuration {
    encryption_type = "KMS"
    kms_key         = aws_kms_key.ecr.arn
  }

  resource_tags = {
    Environment = "production"
    ManagedBy   = "terraform"
    CostCenter  = "platform-team"
  }

  # Custom IAM role required when using KMS or resource tags
  custom_role_arn = aws_iam_role.ecr_template_role.arn

  lifecycle_policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 50 tagged images"
        selection = {
          tagStatus     = "tagged"
          tagPrefixList = ["v"]
          countType     = "imageCountMoreThan"
          countNumber   = 50
        }
        action = {
          type = "expire"
        }
      },
      {
        rulePriority = 2
        description  = "Expire untagged images older than 7 days"
        selection = {
          tagStatus   = "untagged"
          countType   = "sinceImagePushed"
          countUnit   = "days"
          countNumber = 7
        }
        action = {
          type = "expire"
        }
      }
    ]
  })

  repository_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowPull"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
        }
        Action = [
          "ecr:GetDownloadUrlForLayer",
          "ecr:BatchGetImage",
          "ecr:BatchCheckLayerAvailability"
        ]
      }
    ]
  })
}

# KMS key for repository encryption
resource "aws_kms_key" "ecr" {
  description             = "KMS key for ECR repository encryption"
  deletion_window_in_days = 7
  enable_key_rotation     = true
}

# IAM role for repository creation template
resource "aws_iam_role" "ecr_template_role" {
  name = "ecr-repository-creation-template-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "ecr.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

resource "aws_iam_role_policy" "ecr_template_policy" {
  name = "ecr-template-policy"
  role = aws_iam_role.ecr_template_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ecr:CreateRepository",
          "ecr:TagResource",
          "ecr:PutLifecyclePolicy",
          "ecr:SetRepositoryPolicy"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "kms:Encrypt",
          "kms:Decrypt",
          "kms:GenerateDataKey*"
        ]
        Resource = aws_kms_key.ecr.arn
      }
    ]
  })
}

Multiple Environment Templates

locals {
  environments = {
    dev = {
      prefix           = "dev"
      tag_mutability   = "MUTABLE"
      encryption_type  = "AES256"
      image_retention  = 10
    }
    staging = {
      prefix           = "staging"
      tag_mutability   = "MUTABLE"
      encryption_type  = "AES256"
      image_retention  = 20
    }
    prod = {
      prefix           = "prod"
      tag_mutability   = "IMMUTABLE"
      encryption_type  = "KMS"
      image_retention  = 100
    }
  }
}

resource "aws_ecr_repository_creation_template" "environment" {
  for_each = local.environments

  prefix       = each.value.prefix
  description  = "${each.key} environment repositories"
  applied_for  = ["CREATE_ON_PUSH"]

  image_tag_mutability = each.value.tag_mutability

  encryption_configuration {
    encryption_type = each.value.encryption_type
    kms_key         = each.value.encryption_type == "KMS" ? aws_kms_key.ecr.arn : null
  }

  custom_role_arn = each.value.encryption_type == "KMS" ? aws_iam_role.ecr_template_role.arn : null

  resource_tags = {
    Environment = each.key
  }

  lifecycle_policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last ${each.value.image_retention} images"
        selection = {
          tagStatus   = "any"
          countType   = "imageCountMoreThan"
          countNumber = each.value.image_retention
        }
        action = {
          type = "expire"
        }
      }
    ]
  })
}

Complete Terraform Configuration: Both Features

Here’s a complete Terraform configuration that enables both new features:

# Enable blob mounting for cross-repository layer sharing
resource "aws_ecr_account_setting" "blob_mounting" {
  name  = "BLOB_MOUNTING"
  value = "ENABLED"
}

# Create a default repository creation template for create-on-push
resource "aws_ecr_repository_creation_template" "default" {
  prefix       = "ROOT"
  description  = "Default settings for all auto-created repositories"
  applied_for  = ["CREATE_ON_PUSH", "PULL_THROUGH_CACHE", "REPLICATION"]

  image_tag_mutability = "MUTABLE"

  encryption_configuration {
    encryption_type = "AES256"
  }

  lifecycle_policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep last 50 images"
        selection = {
          tagStatus   = "any"
          countType   = "imageCountMoreThan"
          countNumber = 50
        }
        action = {
          type = "expire"
        }
      }
    ]
  })
}

IAM Permissions Required

To create and manage repository creation templates, ensure your IAM principal has these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:CreateRepositoryCreationTemplate",
        "ecr:UpdateRepositoryCreationTemplate",
        "ecr:DescribeRepositoryCreationTemplates",
        "ecr:DeleteRepositoryCreationTemplate",
        "ecr:CreateRepository",
        "ecr:PutLifecyclePolicy",
        "ecr:SetRepositoryPolicy",
        "ecr:PutAccountSetting",
        "ecr:GetAccountSetting"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/ecr-*"
    }
  ]
}

Important Considerations

Template Prefix Matching

Templates are matched based on the most specific prefix. For example:

A repository named prod/team/myapp would match a template with prefix prod/team over one with just prod
Use ROOT as the prefix to create a catch-all template for repositories that don’t match any other template

Templates Don’t Affect Existing Repositories

Repository creation templates only apply when ECR creates a new repository. Existing repositories are not modified. If you need to update existing repositories, you’ll need to do that separately.

KMS and Tags Require an IAM Role

If your template uses KMS encryption or resource tags, you must specify a customRoleArn. This role is assumed by ECR when creating repositories. Without it, repository creation will fail.

Blob Mounting Encryption Requirements

For blob mounting to work between repositories, they must use identical encryption types and keys. If you have repositories with different encryption configurations, layers cannot be shared between them.

Summary

After more than five and six years of waiting respectively, ECR has finally caught up with every other container registry on the planet. These two features together transform ECR from a clunky, DevOps-hostile service into something genuinely competitive.

What we gained with Create on Push:

Seamless push workflows—just push, the repository appears
Consistent repository settings across your organization
Simplified CI/CD pipelines—no more pre-creation scripts
Better platform integrations (goodbye Crossplane workarounds!)
Parity with other major container registries

What we gained with Blob Mounting:

Faster image pushes by reusing existing layers
Reduced storage costs by eliminating duplicate layers
Improved replication performance
Better efficiency for microservices architectures with shared base images

What’s different from other registries:

You need to create at least one repository creation template first
You need to explicitly enable blob mounting at the registry level
Templates give you more control over default settings than most registries offer

This is exactly what the community asked for, and AWS delivered. For anyone who commented on issues #853 and #531, voted them up, or wrote workarounds like I did—our patience has been rewarded.

Now if you’ll excuse me, I have some Crossplane and Kyverno resources to delete.

Resources

Create on Push

Blob Mounting

My Previous Blog: TAP with ECR – Crossplane and Kyverno to the Rescue — A historical record of the workarounds we needed

Posted on January 2026

January 27, 2026

CNCF, containers, DevOps

KYAML in Kubernetes: a solution in search of a problem?
With Kubernetes 1.35, KYAML moved to beta and was enabled by default as a new output format for kubectl. The intent is clear: reduce ambiguity, improve safety, and avoid long-standing YAML footguns like implicit typing.

On paper, that sounds reasonable.

In practice, I think KYAML is a mistake.

This post isn’t meant to be inflammatory, and I fully understand why KYAML exists. But after looking at the format, using it, and thinking about where it fits in the broader configuration ecosystem, I don’t see a legitimate use case where KYAML is the right answer.

What problem is KYAML trying to solve?

KYAML exists primarily to address YAML ambiguity. The canonical example is the so-called “Norway problem,” where values like:
```
enabled: NO
```
are interpreted as booleans instead of strings. KYAML avoids this by enforcing:
- quoted strings
- explicit {} maps and [] lists
- a restricted, predictable subset of YAML
The goal is to make Kubernetes configuration:
- safer
- less surprising
- more machine-friendly
These are all valid goals, but here’s the key question: Is this problem big enough to justify a new format?

In my view, no.

The ambiguity problem is real — but overstated

Yes, YAML’s implicit typing can surprise people. Yes, NO, YES, ON, timestamps, and numbers can bite you, but in real-world Kubernetes usage, this is a known issue, it is well-documented, and it is easy to avoid with basic discipline

Just quote your strings, and treat YAML as data, not prose, and lint your manifests.

This is not a systemic failure — it’s a sharp edge that experienced users already understand, and solving only this problem does not justify introducing an entirely new representation style.

KYAML takes the worst parts of JSON and YAML

This is my core objection.

KYAML combines JSON’s visual noise ({}, [], commas everywhere), with YAML’s complexity and special rules without fully embracing the strengths of either
- If you dislike whitespace sensitivity – use JSON
- If you want human-readable configuration – use YAML
KYAML sits awkwardly in the middle. It looks like JSON, but it isn’t JSON, it parses as YAML, but it doesn’t feel like YAML. Calling it “just YAML” is technically correct — but practically misleading.

“Technically valid YAML” is not the same as “usable YAML”

One of the most common defenses of KYAML is:

“It’s still YAML. Any YAML parser can read it.”

That’s true — and also irrelevant. Formats are not just about parsers. They’re about readability, ergonomics, familiarity,and how humans reason about structure.

KYAML does not look like YAML, it does not read like YAML, and it does not behave like YAML in the ways people care about.

If I showed a KYAML manifest to someone and asked “what format is this?”, almost no one would answer “YAML” without context, and that disconnect matters.

Human readability still matters — a lot

Kubernetes configuration is read far more often than it is written, on-call engineers, debug sessions, quick reviews, copy/paste fixes, teaching and documentation, etc.

Block-style YAML excels here:
```
containers:
  - name: api
    image: nginx:v1
    ports:
      - containerPort: 8080
```
KYAML does not:
```
containers: [
  {
    name: "api",
    image: "nginx:v1",
    ports: [{ containerPort: 8080 }],
  },
]
```
This is denser, noisier, and harder to scan, especially at scale. And multi-line strings? They become actively painful.

For a system that routinely embeds scripts, configs, policies, and templates inside manifests, this is a serious issue in my mind.

As one reddit commentor put it:

It’s basically JSON with trailing commas and comment support… annoying quotes, commas and brackets? No thanks.

Multi-line strings are where KYAML really falls apart

One area where KYAML’s weaknesses become impossible to ignore is multi-line strings.

Kubernetes manifests frequently embed shell scripts, application configuration files, policy documents, JSON blobs, NGINX, Envoy, or HAProxy configs, etc.

This is exactly where block-style YAML shines.

Block YAML: readable and intention-revealing

With standard YAML, multi-line content is explicit, readable, and easy to reason about:
```
data:
  startup.sh: |
    #!/bin/sh
    set -e

    echo "Starting application"
    exec /app/server
```
At a glance, it’s obvious where the string starts, where it ends, what belongs to it, and what does not.

This is one of YAML’s biggest strengths as a human-oriented configuration format.

KYAML: dense, noisy, and hostile to humans

The same content in KYAML-style flow syntax becomes significantly harder to read and edit:
```
data: {
  startup.sh: "#!/bin/sh\nset -e\n\necho \"Starting application\"\nexec /app/server\n",
}
```
This has several practical downsides:
- Newlines are hidden behind escape sequences
- Quoting and escaping noise overwhelms the content
- Editing becomes error-prone
- Diff reviews become painful
- Copy/paste workflows degrade badly
At this point, the configuration stops being something you read and becomes something you decode. This is not an edge case, multi-line strings are not rare in Kubernetes, they are everywhere from ConfigMaps, Secrets (before base64), init scripts, policy engines, sidecar configs, templated applications, etc.

A format that makes this use case worse is a poor fit for Kubernetes in practice.

Even supporters of KYAML have acknowledged this weakness. As one Reddit commenter put it:

“I think I like everything except for multi-line strings… this just seems so obnoxious.”

That reaction is telling. If a format fails at one of Kubernetes’ most common real-world patterns, that’s not a minor flaw, it’s a fundamental mismatch.

Readability matters more than theoretical safety here. Yes, KYAML avoids certain ambiguity issues, but in exchange, it degrades one of the most valuable properties of YAML: clear, readable representation of structured text.

For me, this alone is enough to disqualify KYAML as a default or recommended format for Kubernetes manifests.

If YAML isn’t enough, KYAML isn’t the answer

Here’s where I think the Kubernetes ecosystem already has better answers.

If you want stronger typing, validation, composition, abstraction, and safety guarantees, then configuration languages make sense like KCL, Carvel YTT, Cue, etc.

These tools solve real problems, large-scale config management, reusability, correctness, policy enforcement, etc. and KYAML does none of this. It doesn’t add expressiveness, it doesn’t add validation, and it doesn’t add abstraction. It only changes formatting.

If we’re going to introduce something new, it should meaningfully improve what we can do — not just how braces are arranged.

“But it’s better for machines”

This is the strongest argument in favor of KYAML, and even here, I’m unconvinced.

If the goal is machine-friendliness, JSON already exists, is unambiguous, is universally supported, and is widely understood.

I regularly switch to JSON precisely because I want to avoid whitespace sensitivity. KYAML simply adds complexity without gaining the clarity of a true data format.

Where I land

I don’t believe KYAML has a compelling use case.

It doesn’t replace YAML for humans, it doesn’t replace JSON for machines, and it doesn’t replace higher-level config languages for safety or scale

It solves a narrow problem that already has known, simple mitigations — and in doing so, creates new ergonomic and cognitive costs.

This all being said, it doesn’t mean it won’t see adoption. Kubernetes has enormous gravity, and defaults matter.

But from my perspective, KYAML feels like a format born from good intentions, solving the wrong problem, at the wrong layer.

Summary

It will be interesting to see whether the industry embraces it or not.

I have yet to meet a person who truely likes this new format but I know many are out there, just like many people believe go-templating and jsonnet are good choices, and I don’t.

I do still like that Kubernetes supports normal YAML, and JSON for input and output so I am not against the notion of having another option, I just think this option is worse then what we already have, so it makes no sense to me why we are investing in it at all.

KYAML is a well-intentioned experiment. As an optional output format, it has some value — especially for tools that need predictable structure. But as a default manifest representation, it fails the core requirement of being human-friendly and intuitive.

It fixes one specific YAML footgun at the cost of introducing a format most people will honestly hate reading. Worse, it doesn’t actually solve the deeper problems that make Kubernetes configuration hard at scale.

I’m not anti-improvement. I just think this isn’t the improvement we needed.

If the industry truly wants safer, more robust configs, we should embrace type-safe config languages, schema-aware tooling, and problem-focused ergonomics — not yet another YAML dialect that looks like JSON but behaves worse for humans.
December 28, 2025

CNCF, DevOps, Platform Engineering
Backstage As The Ultimate MCP Server
The evolution of developer platforms has always been about providing the right abstractions at the right time. But what if your platform could not just serve humans through beautiful UIs, but also empower AI agents to interact with your business logic in a standardized, secure way? This is the promise of Backstage’s recent integration with the Model Context Protocol (MCP) – and it’s a game-changer for platform engineering.

The MCP Revolution in Backstage

Over the past few releases, Backstage has undergone a significant transformation in how it exposes functionality to different consumers. Three key milestones have defined this journey:

Backstage 1.40: The Actions Registry and MCP Server

With version 1.40, Backstage introduced the Actions Registry – a centralized system for registering and discovering actions within backend plugins. This wasn’t just another API endpoint; it was the foundation for a new way of thinking about plugin functionality. Actions could now be defined once, with proper input/output schemas, descriptions, and metadata.

The Actions Registry provided:
- Type-safe action definitions with Zod schema validation
- Centralized discovery of available actions across all plugins
- Standardized metadata including names, titles, and descriptions
- Permission integration ensuring consistent authorization across all action invocations
- Automatic MCP server exposure of all registered actions
But there was a critical limitation: authentication required static tokens that had to be pre-configured. These tokens weren’t tied to specific users, meaning:
- No user-specific identity in audit logs
- Manual token creation and management
- Tokens shared across users or use cases
- Permissions couldn’t be enforced at the user level for MCP calls
Backstage 1.43: Scaffolder Integration

Version 1.43 brought the Actions Registry into the scaffolder, allowing template authors to leverage any registered action as a scaffolder step. This meant that functionality previously locked in plugin APIs could now be orchestrated through templates, dramatically expanding the power of software templates.

Backstage 1.43: Dynamic Client Registration – The Game Changer

The experimental Dynamic Client Registration support for MCP Actions Backend solved the authentication problem that had limited 1.40’s MCP capabilities. Instead of static tokens, the system now supports OAuth flows that bring true user authentication to MCP interactions.

Here’s how the OAuth flow works in practice:
1. When an AI assistant (like Claude Desktop or Cursor) attempts to call an MCP action
2. A browser window pops up prompting you to log in to Backstage
3. You authenticate using your normal Backstage login (SSO, GitHub, Microsoft, etc.)
4. You authorize the MCP client to act on your behalf
5. The MCP server receives your user token for all subsequent calls
6. The AI assistant can now make authenticated calls as you until the token expires
This transformation meant:
- Every MCP call is authenticated as YOU – proper user identity throughout
- User-specific permissions are enforced (what you can do in the UI, you can do via MCP)
- No pre-configuration needed – no static tokens to create or manage
- Audit trails properly show which user performed each action
- Zero trust security model – each user proves their identity dynamically
- Session-based access – tokens can expire, requiring re-authentication just like the web UI
The difference between 1.40 and 1.43 is the difference between “the MCP bot has access” and “you have access through the MCP bot.” One is a security risk; the other is a security-first architecture.

The Security Breakthrough: User-Aware AI Interactions

The shift from static tokens to dynamic OAuth-based authentication in 1.43 cannot be overstated. This is the difference between treating AI assistants as service accounts with broad permissions versus treating them as extensions of individual users with specific permissions.

Consider what this enables:
- A junior developer’s AI assistant can’t delete production resources they don’t have access to
- Audit logs show “John deployed via AI assistant” not “MCP service account deployed”
- Permission policies apply uniformly: if you can’t do it in the UI, your AI can’t do it either
- Compliance and governance requirements are met automatically
This is true zero-trust architecture applied to AI interactions. The AI assistant is not a privileged actor – it’s an interface through which your existing identity and permissions flow.

The Platform Engineering Dream: Write Once, Expose Everywhere

The true power of this architecture lies in its simplicity. As a plugin developer, you define your action once:
```
actionsRegistry.register({
  name: 'get_crossplane_resources',
  title: 'Get Crossplane Resources',
  description: 'Returns Crossplane resources and their dependencies',
  schema: {
    input: z => z.object({
      backstageEntityName: z.string().describe('The name of the Backstage entity'),
      backstageEntityKind: z.string().describe('The kind of the Backstage entity. Defaults to component.').optional(),
      backstageEntityNamespace: z.string().describe('The namespace of the Backstage entity. Defaults to default.').optional(),
    }),
    output: z => z.object({
      resources: z.array(z.object({
        type: z.enum(['XRD', 'Claim', 'Resource']).describe('The type of the resource'),
        name: z.string().describe('The name of the resource'),
        // ... more schema definition
      })),
    }),
  },
  action: async ({ input, credentials }) => {
    // Permission check
    const authorized = await permissions.authorize(
      [
        { permission: listClaimsPermission },
        { permission: listCompositeResourcesPermission },
      ],
      { credentials }
    );

    if (authorized.every(a => a.result !== AuthorizeResult.ALLOW)) {
      throw new InputError('Access denied');
    }

    // Business logic
    const result = await service.getResources({...input});
    return { output: result };
  },
});
```
This single action definition becomes:
1. An MCP tool that AI assistants can call
2. A scaffolder action usable in software templates
3. A backend API accessible to frontend plugins
And critically, permission enforcement happens at the action level, ensuring consistent authorization regardless of how the action is invoked.

Real-World Implementation Patterns

In the backstage-plugins repository, we’ve implemented MCP integration using two distinct patterns:

Pattern 1: Direct MCP Actions in Plugins

For plugins with well-defined domain logic, we expose MCP actions directly from the backend plugin. This approach is used in:
- crossplane-resources-backend: Exposes actions for managing Crossplane claims, composites, and managed resources
- kyverno-policy-reports-backend: Provides policy validation and compliance checks
- kro-resources-backend: Manages KRO (Kubernetes Resource Orchestrator) resources
- vcf-automation-backend: Integrates with VMware Cloud Foundation Automation
- vcf-operations-backend: Provides observability and operations insights
- educates-backend: Manages training portal sessions and workshops
Here’s an example from the Kyverno plugin showing how simple it is:
```
export function registerMcpActions(
  actionsRegistry: typeof actionsRegistryServiceRef.T,
  service: KubernetesService,
  permissions: PermissionsService,
  auth: AuthService
) {
  actionsRegistry.register({
    name: 'get_kyverno_policy_reports',
    title: 'Get Kyverno Policy Reports',
    description: 'Returns policy reports for a given entity',
    schema: {
      input: z => z.object({
        entity: z.object({
          metadata: z.object({
            name: z.string().describe('The name of the entity'),
            namespace: z.string().describe('The namespace of the entity'),
          }),
        }).describe('The entity to get policy reports for'),
      }),
      output: z => z.object({
        reports: z.array(z.object({
          // Schema definition
        })),
      }),
    },
    action: async ({ input, credentials }) => {
      // Permission check
      const decision = await permissions.authorize(
        [{ permission: showKyvernoReportsPermission }],
        { credentials }
      );

      if (decision[0].result !== AuthorizeResult.ALLOW) {
        throw new InputError('Access denied');
      }

      // Execute business logic
      const reports = await service.getPolicyReports({ entity: input.entity });
      return { output: { reports } };
    },
  });
}
```
Pattern 2: Dedicated MCP Bridge Plugins

For core Backstage functionality that doesn’t naturally expose actions (like the catalog or RBAC systems), we’ve created dedicated bridge plugins that use the discovery service and plugin-to-plugin communication to call other plugins’ REST APIs:
- catalog-mcp-backend: Bridges to the Catalog API for entity queries
- scaffolder-mcp-backend: Bridges to the Scaffolder API for template operations
- rbac-mcp-backend: Bridges to the RBAC system for permission management
This pattern is particularly elegant because it demonstrates how you can wrap any existing backend API with MCP actions. Here’s how the catalog-mcp plugin makes authenticated calls to the Catalog API:
```
export function registerMcpActions(
  actionsRegistry: typeof actionsRegistryServiceRef.T,
  discovery: DiscoveryService,
  auth: AuthService
) {
  actionsRegistry.register({
    name: 'get_entities_by_owner',
    title: 'Get Entities by Owner',
    description: 'Retrieves all catalog entities owned by a specific user or group',
    schema: {
      input: z => z.object({
        owner: z.string().describe('Owner reference in format "user:namespace/name"'),
      }),
      output: z => z.object({
        entities: z.array(z.any()).describe('Array of catalog entities'),
        count: z.number().describe('Total number of entities found'),
      }),
    },
    action: async ({ input, credentials }) => {
      // Discover the catalog service
      const catalogUrl = await discovery.getBaseUrl('catalog');
      
      // Get authenticated token for catalog API
      const { token } = await auth.getPluginRequestToken({
        onBehalfOf: credentials,
        targetPluginId: 'catalog',
      });

      // Make authenticated API call
      const response = await fetch(
        `${catalogUrl}/entities/by-query?filter=spec.owner=${encodeURIComponent(input.owner)}`,
        { headers: { Authorization: `Bearer ${token}` } }
      );

      const data = await response.json();
      return { output: { entities: data.items, count: data.items.length } };
    },
  });
}
```
This pattern means that any existing Backstage plugin can be MCP-enabled without modifying its core functionality.

Configuration: Declaring Your MCP Sources

To enable MCP actions in your Backstage instance, you simply declare which plugins should have their actions exposed in your app-config.yaml:
```
backend:
  actions:
    pluginSources:
      - 'catalog'
      - 'vcf-automation'
      - 'kro'
      - 'vcf-operations'
      - 'ai-rules'
      - 'kyverno'
      - 'educates'
      - 'crossplane'
      - 'scaffolder-mcp'
      - 'rbac-mcp'
      - 'catalog-mcp'
```
That’s it. No complex routing rules, no additional authentication setup. The MCP Actions Backend automatically discovers and exposes all registered actions from these plugins, maintaining the same permission model that protects your REST APIs and frontend components.

Why This Matters: The AI Platform Play

We’re witnessing a fundamental shift in how platforms are consumed. The assumption that humans are the only consumers is being challenged daily. AI agents, automation systems, and intelligent workflows need programmatic access to platform capabilities – and they need it to be:
1. Discoverable: AI agents should be able to explore what’s possible
2. Well-documented: Each action needs clear descriptions and type information
3. Secure: Authorization must be consistent across all access methods
4. Reliable: Type-safe interfaces prevent integration bugs
Backstage’s MCP integration delivers on all these requirements. But more importantly, it does so without asking platform teams to duplicate effort. The same business logic that powers your internal developer portal now powers your AI agents, your automation workflows, and your software templates.

This is what a mature platform looks like – multiple interfaces over a single source of truth, with security and governance baked in at the core.

The Future Is Multi-Interface

The days of single-purpose platforms are behind us. Modern platforms need to serve:
- Humans through intuitive web interfaces
- Developers through CLI tools and APIs
- AI agents through protocols like MCP
- Automation systems through event-driven architectures
Backstage’s architecture has evolved to support this multi-interface reality. By treating actions as first-class citizens with proper schemas, permissions, and metadata, it enables a single implementation to serve all consumers.

This is the promise of platform engineering realized: build once, expose everywhere, secure by default.

Try It Yourself

Want to see this in action? At KubeCon NA 2025, I had the privilege of presenting a hands-on workshop alongside industry experts Ana Margarita Medina (Upbound), Cortney Nickerson (Nirmata), and Christian Hernandez (GitOps Advocate). Together, we led “Build Your Internal Developer Platform With the Experts: A Hands-On Workshop” covering:
- Building a complete IDP with Backstage, Crossplane, Argo CD, and Kyverno
- Implementing MCP integration across multiple plugins
- Leveraging AI agents to interact with your platform
- Production-proven patterns from real-world implementations
The complete workshop materials, including all the plugins mentioned in this post, are available at:

https://github.com/back-stack/kubecon-na-2025

The repository includes:
- Complete plugin implementations with MCP actions from terasky-oss/backstage-plugins
- Configuration examples and best practices
- A fully functional local development environment
- Documentation on extending the patterns for your use cases
Conclusion

Backstage’s MCP integration represents more than just another feature – it’s a paradigm shift in how we think about platform engineering. By providing a standardized way to expose business logic across multiple interfaces while maintaining security and type safety, Backstage has positioned itself as the ultimate MCP server for enterprise platforms.

What makes this truly revolutionary is the 1.43 authentication breakthrough. By enabling dynamic OAuth flows instead of static tokens, Backstage ensures that AI assistants operate within the same security boundaries as their human users. This isn’t just convenient – it’s the security model that enterprises require before they can truly embrace AI-assisted workflows.

The future of developer platforms isn’t just about serving web pages to humans. It’s about creating intelligent, composable systems that can be consumed by humans, AI agents, and automation systems alike – all while maintaining the same rigorous security, governance, and audit capabilities. With the Actions Registry and dynamic MCP authentication, Backstage has given us the tools to build that future today.

The question is no longer “Can we build a platform that AI agents can use securely?” but rather “What will we build now that AI agents can use our platform as us?”

This post covers Backstage features introduced in versions 1.40 and 1.43. For the latest updates and documentation, visit the Backstage documentation.
December 10, 2025

AI, Backstage, CNCF, DevOps, Platform Engineering, security
Bringing the Power of MCPs to the VI Admins
For years, VI admins have relied on vROPS—now called VMware Cloud Foundation Operations (VCF Operations)—to monitor, troubleshoot, and optimize their environments. But working with these tools often means context switching, learning multiple interfaces, and spending valuable time digging through dashboards to find the right piece of data.

What if you could talk to your infrastructure directly? What if you could say, “What’s wrong with my VM roy-vra?” and get an actionable answer back?

That’s the power of the Model Context Protocol (MCP)—and why we built the VCF Operations MCP Server.

What is MCP?

The Model Context Protocol (MCP) is an emerging standard for connecting Large Language Model (LLM)-based agents with real systems. It provides a common way to define “tools” (APIs, queries, commands) that the agent can call when it needs to act or fetch information.

Think of MCP as a universal adapter for making AI useful instead of just chatty.
- It standardizes how LLMs call into external systems.
- It enables elicitations (interactive back-and-forth to resolve ambiguities).
- It allows VI admins to ask questions naturally—while the MCP server does the heavy lifting behind the scenes.
Introducing the VCF Operations MCP Server

Our MCP server connects directly to VCF Operations and exposes over 40 tools that cover the lifecycle of monitoring, troubleshooting, compliance, and optimization.

Here’s what it gives you:
- Resource Discovery: VMs, hosts, clusters, datastores, vCenters, vSAN clusters.
- Metrics & Performance: Retrieve VM and infrastructure metrics, forecasts, and relationships.
- Alerting & Anomalies: See alerts, symptoms, root causes, and anomalies.
- Troubleshooting & Optimization: SLA checks, recommendations, RCA insights.
- vSAN Management: Capacity analysis, health checks, and performance troubleshooting.
- Adapter Management: Explore adapters and resource kinds.
- Reports: Generate, download, and list reports with improved naming.
- Configuration: Manage SSL/TLS and connection settings.
Why VI Admins Should Care

Traditional monitoring requires dashboards, pivoting between tabs, and filtering lists. With MCP:
- Ask in plain English: “What’s the CPU usage over the past week for the VM with the IP 172.16.20.10?”
- Get guided answers: The MCP server uses elicitation to clarify ambiguous queries. “Did you mean VM roy-vra in cluster A or cluster B?”
- Automate reporting: “Generate a VM performance report for the vSphere cluster where the VM with the IP 10.100.100.100 runs.”
- Troubleshoot faster: “Give me a root cause analysis of the issues for VM roy-vra with actionable next steps.”
Easy to Run

The MCP server is designed to be simple to deploy:
- Run directly with Python 3.11+
- Or run as a Docker container
- Remote MCP support is coming soon
Once running, you can connect it to your favorite MCP-enabled agents. Currently, elicitation support is limited within the different Agents, but works well in Cursor and VSCode GitHub Copilot. Other agents such as Claude Desktop, and Claude Code will hopefully add support soon.

Example Scenarios

Here are some real-world use cases where MCP makes a difference:

1. Quick Health Checks

Prompt: “What is wrong with my VM 172.16.20.125?”
Response:

2. Deep dive on a specific area

Prompt: “Please provide a detailed report on the CPU usage for the past 2 weeks for the VM roy-vra”
Response:

3. Report Generation

Prompt: “Which reports can i generate for the VM roy-vra”
Response:

Prompt: “Please generate a capacity report for this VM in PDF format”
Response:

The Future of MCP for VI Admins

The VCF Operations MCP Server is just the beginning. As more agents support MCP and its emerging feature set, VI admins will be able to:
- Seamlessly integrate AI into daily operations.
- Eliminate time wasted searching for the right dashboard.
- Move from reactive troubleshooting to proactive optimization.
We’re making infrastructure conversations natural—so you can focus less on clicks and more on outcomes.

Closing Thoughts

The role of the VI admin is evolving. With MCP, you don’t just have dashboards—you have an AI-powered assistant that knows your environment, speaks your language, and helps you troubleshoot and optimize faster.

If you’ve ever wished your infrastructure could talk back, now it can.

Bringing the power of MCPs to VI admins starts here.
September 10, 2025

AI, Observability, VCF, vrealize
Crossing Planes – Enhanced Backstage + Crossplane Integration

2 of my favorite CNCF projects are Backstage and Crossplane. just under 6 months ago we released the first versions of our Backstage plugins which integrate Crossplane and Backstage to the community. It has been great to see the growing interest and adoption of the plugins over the past few months. With over 3.5k downloads of the frontend plugin and over 7.6k downloads of the Kubernetes Ingestor plugin, we can see that adoption is growing, and so are external contributions to the plugins from multiple contributors from different companies which is really awesome to see.

I have had the priviledge of talking with multiple companies using these plugins, and the feedback has been really great.

At the end of May we had an awesome conference here in Israel called Platforma which was the first Platform Engineering conference in Israel. The conference itself was great, with good attendnace, great talks, and great ahllway conversations as well with some amazing technologists from many different and diverse companies all coming together to discuss Platform Engineering.

At Platforma I had the honor to speak and give a talk about Integrating Backstage and Crossplane. Bellow you can see the recording of the session which was a really fun one.

The feedback for the session was great, and was really cool to have multiple people come up to me afterwards and mention how they are already using these plugins in there organizations.

One question did keep coming up though, which was when we will add support for Crossplane V2.

Crossplane V2 is the upcoming major release of Crossplane which will include within it many hugely beneficial redesigns and improvements that the community are super excited about. for more information on what to expect with Crossplane V2 i recommend watching the bellow video:

As the changes are quite extensive, and the options have grown to supporitng multiple different API versions, scopes of resources, and modes of operation, I was hesitant at first to start this work until the Crossplane V2 design was finalized.

Whiel that was my initial plan, I kept getting more and more requests in GitHub issues, DMs on Slack, friends sending messages on Whatsapp and more. Finally I decided it was time to invest a bit of deisgn and development efforts to get this work started.

The benefit was that there were almost no breaking changes in Crossplane V2, so the changes were mostly additive and as such the code refactoring was less then initially expected,

I am glad to announce that the TeraSky-OSS Backstage plugins now include support for Crossplane V2 and V1 seamlessly and should work with all different supported permutations of the new APIs.

As the proposal itself for Crossplane V2 is not yet finalized, things may change, and new releases will need to be made, including breaking changes which may occur based on the final design of Crossplane V2 but we are commited to updating these plugins as the design matures and reaches its final state.

The current goal is for Crossplane V2 to GA in August 2025, but the preview version is already available including docs which have been revamped for Crossplane V2 and are available on the Crossplane site.

We also now have support for custom auth methods in the Kubernetes Inmgestor plugin thanks to an external contribution which now allows for integration of the plugins for example with the Upbound hosted control planes and not just with OSS Crossplane, again growing the possibilities and increasing the usability of the plugins.

If you are looking into Crossplane V2 and want a cool and nice UI to create XRs and to visualize and manage them day2 all following GitOps practices, go try out the plugins!

Thanks again to the awesome Backstage and Crossplane communities for all the feedback and the growing number of contributions in recent months!

June 10, 2025

Backstage, CNCF, Crossplane, DevOps, Platform Engineering
Building useful READMEs with runme.dev

One of the things i go through periodically is the new projects added to the CNCF. This is a great way to see some really cool and awesome OpenSource software!

As i was going through the recently added projects I found one that really clicked, an I was super excited to try it out!

The project is called runme.dev and they have built an awesome project similar to Jupyter Notebooks but for Operational DevOps workbooks!

Runme.dev is a CLI and VS Code extension that transforms markdown files—particularly README.md, notebooks, and technical documentation—into executable, interactive workflows. It parses shell code blocks and integrates with your terminal, allowing developers to run commands directly within the markdown context. This improves reproducibility, streamlines onboarding, and reduces friction in DevOps, data science, and API workflows. By making docs runnable, Runme helps eliminate context switching and keeps execution history in sync with documentation.

The idea of simple metadata on code blocks in Markdown files allowing for seamlessly making your docs executable is an amazing idea, and the potential is huge!

They not only can trigger simple shell commands and executables, but also have awesome integrations with nice visualizations for cloud resources which embed a console view of the resources directly in the notebooks output sections.

The customization capabilities is also really amazing! the simplicity yet powerful capabilities of the tool are truly awesome to see.

In the bellow short video I go through a short example of making an existing README.md file i have for setting up a basic Crossplane environment on kind into an interactive and powerful notebook.

As can be seen in the video, the power of the tool is emense and that is just the basic options, with much more advanced configuration knobs and opportunities we did not explore here.

This is an amazing addition to the CNCF in my mind, and I really look forward to seeing this project grow and to get to really embbed it in my day 2 day work as well, cause the potential of such a simple yet powerfull tool is really awesome!

April 29, 2025

CNCF, Crossplane, DevOps, Platform Engineering
Enhancing The Backstage Plugins For Crossplane

What has changed

The initial feedback regarding the Backstage plugins we built which can be found here has been really great especially for the Crossplane related plugins!

While the initial proof of concept work showed potential, it was heavily in need for better performance, enhanced functionality, and better integrations between the different parts of the stack.

We have been hard at work making some key needed changes and in this post I want to discuss some of the key changes we have made, and what these mean practically.

Performance Improvements

We have reevaluated and improved the pulling of data from Kubernetes clusters to now work at nearly 10x the speed due to more exact and targeted API calls against our clusters, making the UX much smoother. We have also added in partial rendering of data as it comes in instead of the previous behavior where we waited for all data before anything was shown. This gives a much more fluid user experience.

Additional Resource Data

While initially the plugin would show the managed resources, the Claim and the Composite resources, many times we also want to see the additional resources which are relevant to this claim. For this we have added an additional table which provides the same visibility into the relevant XRD, Composition, Providers, and Functions. We have also added the provider-config being used for each MR into the Managed resources table to give a better picture at a glance.

Crossplane Overview Card

While the Crossplane frontend plugin still provides the 2 tabs with tabular and graph based visualizations, sometimes a birds eye view is all that is needed. This is now available as well via a simple card with overview data which can be added to the components overview page with basic data around the claim, its status, and general information. With this overview card, we can make the UX much better and streamline the developer experience by providing upfront data about the status of there Crossplane resources without them needing to deep dive and switch tabs.

Kyverno Integration

One of the tools commonly used alongside Crossplane is Kyverno which is an amazing Policy Management tool for Kubernetes. We have built a dedicated Kyverno Policy Report plugin which can visualize Kyverno Policy Reports related to the Kubernetes resources of a component directly in backstage. We have also enhanced the plugin to have better and more streamlined integration with Crossplane Claim backed components, making the User experience much smoother and streamlined. Not only can a developer create Crossplane resources via Backstage using the auto generated Software Templates for each XRD using the Kubernetes Ingestor plugin, and also visualize their Crossplane Claims and underlying resources via the Crossplane Resources plugin, they can now also receive insights and clear visibility into policy violations and policy adherence data directly in Backstage via the Kyverno Policy Reports plugin! We have also added an Overview card for this plugin, making an even more full and cohesive birds eye view possible for Crossplane Claims for any component.

Day 2 Updates

While the Kubernetes Ingestor Plugin auto generates Software Templates for creation of Crossplane Claims, We all know that day2 operations and updates is the true challenge which needs to be tackled. Day0 and Day1 tasks can easily be streamlined but the Day2 maintenance in Backstage is still a story in the making. This is where the new Crossplane Claim Updater plugin comes into play. The new plugin is based on another plugin we released called the Entity Scaffolder Content plugin. The use of these plugins allows us to embed the Backstage Scaffolder in a tab on a component and provide contextual data as a starting point for filling out the forms of a software template. The Claim Updater, via the creation of a custom field extension for the scaffolder along with a provided Software Template, allows a user to request to update a claim manifest. When the user runs this flow, the plugin will retrieve the latest schema of the resources definition based on the XRDs OpenAPI Schema, it will then pull in the existing manifest from GitHub with the current values set by the user, generate a form based on the OpenAPI schema, apply the current values from git in the form, and then allow the user to make any changes needed. When the user submits the form, a new PR is created in the GitOps repo, with the new desired state. Once merged the traditional GitOps tools like Flux CD or Argo CD will pickup the changes and update the claims within your Kubernetes clusters!

Going Beyond Crossplane

While Crossplane is in my opinion the best mechanism for building custom APIs and to build service offerings within a platform, other tools exist in this area like KRO, Kratix, and KubeVela, as well as many purpose built Kubernetes Operators, which we also want to manage the lifecycle of those entities via Backstage. While full support for generic CRDs is not yet fully built out, and has many challenges in terms of design and possibilities, we have begun to extend the capabilities of the plugins to add support for Generic CRDs as well. The first part of this integration is now completed with the ability to now generate Software Templates for any CRD, and also the Claim Updater plugin has been extended to also support the day2 management of custom resources not created via Crossplane claims. Currently the visualization aspects that we have for Crossplane claims does not exist for generic CRDs, but we are looking into this area as well, and how this can be achieved. Because clusters typically have many CRDs, and we don’t want to offer all of our CRDs as Software Templates, we have added the ability to provide a static list of CRDs via the app-config.yaml as well as the ability to provide a label selector where any CRD with this label will have a Software Template auto-generated for it. These CRs will also be automatically added as components into the software catalog, just like claims, and other core Kubernetes resources.

Additional Plugins

While the main focus on this post is around Crossplane related plugins, I also want to call out the additional plugins we have released in the same repo with integration of DevPod, for easy launching of remote dev environments using DevPod in your IDE of choice directly from Backstage, as well as a plugin to bring in data from ScaleOps, the best workload rightsizing tool for Kubernetes on the market.

Final Thoughts

These plugins are still in there early stages, and still need polishing, and are missing features, but they already prove to be very valuable and are paving a path for being a great building block for building an amazing IDP! If you havent seen the video that Viktor Farcic did on these plugins I strongly recommend checking out this video and you can also checkout a demo I did as part an episode of You Choose, where it was a battle between Backstage and Port for the peoples choice for an Internal Developer Portal which can be seen at this link.

February 25, 2025

Backstage, Crossplane, gitops, Platform Engineering
Integrating Backstage With Crossplane
In the platform engineering space, two of the most impactful and interesting tools are Backstage and Crossplane. In this post, we will discuss and demonstrate how the two tools can be integrated to provide a great baseline for the ultimate developer platform.

What is Backstage

Backstage is an open platform for building developer portals. It provides a unified interface for managing software, infrastructure, and resources. By leveraging Backstage’s plugin architecture, organizations can customize and extend its capabilities to suit their needs, offering developers a centralized view of their tools and workflows.

What is Crossplane

Crossplane is an open-source tool that enables organizations to provision and manage cloud infrastructure and services using Kubernetes declaratively. It extends Kubernetes APIs with custom resource definitions (CRDs) to manage both infrastructure and application workloads in a unified way.

What is Possible Out of the Box

Out of the box, Backstage offers plugins and templates to manage software catalogs, scaffolding, CI/CD integrations, and more. Similarly, Crossplane provides a rich ecosystem for declaratively managing infrastructure resources. However, the integration of the two offers a unique capability to visualize and manage infrastructure as part of the developer workflow.

What is Missing Currently

While Backstage and Crossplane are powerful tools individually, some gaps exist in their integration:
- Seamless visualization of Crossplane resources in Backstage.
- Auto-ingestion of Kubernetes resources into Backstage catalogs.
- Enhanced scaffolding capabilities tailored for Crossplane resources.
- Efficient management and cleaning of catalog entities.
How This Can Be Solved

This is where custom plugins come into play. The plugins crossplane-resources, kubernetes-ingestor, and scaffolder-backend-module-terasky-utils bridge these gaps by providing enhanced integration capabilities.

The source code of these plugins, as well as installation instructions for each of the plugins discussed bellow and more can be found on Github in the following repository.

Building a Catalog Processor

The kubernetes-ingestor plugin acts as a catalog processor, ingesting Kubernetes workloads and Crossplane claims directly into Backstage. It supports annotations to customize the ingestion process.
For each workload that is registered based on a standard Kubernetes resource we will recieve the following type of UX OOTB without any annotations or custom configurations:

And for Crossplane Claims it will look like this:
Example Configuration:
```
kubernetesIngestor:
  components:
    enabled: true
    taskRunner:
      frequency: 10
      timeout: 600
  crossplane:
    claims:
      ingestAllClaims: true
```
This configuration enables automatic ingestion of claims and other Kubernetes resources into the catalog without needing to create and manage catalog-info.yaml files in git for each object. Via annotations we can manipulate how that backstage component will be created by providing the data directly on the kubernetes resources. While this is very beneficial for Crossplane, it has a much wider usage opportunity for the entire Kubernetes ecosystem.

An example of the usage of these annotation on a standard Kubernetes Deployment resource could look like this:
```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yelb-appserver
  labels:
    app: yelb-appserver
    tier: middletier
  annotations:
    terasky.backstage.io/add-to-catalog: 'true'
    terasky.backstage.io/owner: dev-team
    terasky.backstage.io/component-type: service
    terasky.backstage.io/lifecycle: experimental
    terasky.backstage.io/dependsOn: 'Component:yelb-db,Component:yelb-redis'
    terasky.backstage.io/kubernetes-label-selector: 'app=yelb-appserver,tier=middletier'
    terasky.backstage.io/source-code-repo-url: "https://github.com/dambor/yelb-catalog"
    terasky.backstage.io/source-branch: "main"
    terasky.backstage.io/techdocs-path: "components/yelb-appserver"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yelb-appserver
      tier: middletier
  template:
    metadata:
      labels:
        app: yelb-appserver
        tier: middletier
    spec:
      containers:
      - name: yelb-appserver
        image: mreferre/yelb-appserver:0.7
        ports:
        - containerPort: 4567
```
This will end up with a resource that has much more information attached and is much more accurately documented in Backstage:
The potential of this, and the simplification this brings to management is endless, and with small customizations in the code, you could add any other annotations or logic as needed per your organizations needs.

Building a Frontend Plugin

The crossplane-resources plugin provides a frontend interface for visualizing Crossplane resources. Developers can view resource details, YAML manifests, and relationships through a graph view, directly on the auto-ingested components based on any crossplane claim in your attached clusters!

Example Usage in Entity Pages
```
<EntityLayout>
  <EntityLayout.Route path="/crossplane-resources" title="Crossplane Resources">
    <CrossplaneAllResourcesTable />
  </EntityLayout.Route>
  <EntityLayout.Route path="/crossplane-graph" title="Crossplane Graph">
    <CrossplaneResourceGraph />
  </EntityLayout.Route>
</EntityLayout>
```
This adds Crossplane-specific views to Backstage’s entity pages.

There are 2 main options for the visualization.

Table View
This also supports viewing the YAML of a resource:

And an events viewer:

Graph View

This also supports viewing the YAML and events of a resource:

Generating Software Templates

The kubernetes-ingestor plugin beyond ingesting resources as components into backstage, also has the ability to generate backstage software templates for Crossplane claims based on the corresponding XRDs. These templates are automatically registered in the Backstage catalog and provide an auto updating, constantly evolving self service portal for generating manifests for your custom APIs defined in Crossplane, without needing to double the work of defining the schema in backstage in order to expose a user friendly UI above the Crossplane abstractions.

The flow in these templates is very simple:

Provide the basic metadata

This is where you supply the name and namespace of the resource you want to create:

Resource Schema

The next page is the spec fields you have defined in your XRD:

This will enforce required fields, enums, field validations and more, just like you defined in your XRD!

Crossplane Settings

This is where you can set the crossplane specific settings like the connection secret name to create lifecycle related fields and composition selection method which by default is set to runtime which means the composition will be chosen at runtime and not specified in the manifest:

You can also select to have it with a direct reference which when this option is selected a dropdown of all the discovered compositions is provided to select from:

Or you can select to use the label based selection:

Creation Settings

These templates can be configured via the app-config.yaml to publish the generated manifests to git via creating a PR for you, or can simply generate a YAML file which you can download. The recommended approach would be to auto push the manifests to git, and have a GitOps tool like FluxCD, Carvel Kapp Controller, or ArgoCD auto deploy these manifests to your cluster, for a seamless end to end experience. When choosing GitOps you can either hide the repo selection and have it defined globally in the app-config.yaml or it can support repo selection in the template itself as shown bellow. This page also enables you to select how the manifest should be placed in the GitOps repo. The default option is cluster-scoped which when selected will allow you to select from a list of the connected clusters where this XRD has been found, and the template will push a copy of the manifest into cluster dedicated folders in the GitOps repo:

You could also chose to have it be namespace scoped in cases where you deploy the same configurations to all clusters and don’t maintain separate folder structures for each cluster:

And if needed there is also a custom option where you can supply the path in the repo you want the file created under:

There also may be some cases where while the environment is setup to support GitOps based management that you don’t want to push the manifest to git. For this use case, you can uncheck the box at the top of the page, and no manifest will be pushed to git, but it will be available for download directly.

Review and Submit

After reviewing the inputs and submitting the form you will receive the results:

In this example we get the link to the generated PR, as well as a link to download the generated YAML manifest directly. Because we selected 2 clusters “kubetopus” and “spectro-shared” and our SQLServerInstance is set to be deployed in the demo namespace, when we look at the PR we will see 2 files being created:
The file paths are in the format <CLUSTER NAME>/<NAMESPACE>/<RESOURCE KIND>/<RESOURCE NAME>.yaml This enables easy gitops configuration, but can easily be changed with just a few lines of code in the kubernetes-ingestor plugin if needed, or you can use one of the other methods mentioned above if they meet your needs.

Building Custom Scaffolder Actions

The scaffolder-backend-module-terasky-utils plugin introduces custom actions like terasky:claim-template for generating YAML manifests for Crossplane claims based off of the inputted parameters by the user, and terasky:catalog-info-cleaner for cleaning catalog entities, and allowing the auto generated component manifests from the Kubernetes Ingestor plugin to be pushed to Git, if and when you decide that you want to manage your catalog in the more traditional manner rather then via Kubernetes annotations.

Example Action: Claim Template
```
steps:
  - id: generate-claim-manifest
    name: Generate Kubernetes Resource Manifest
    action: terasky:claim-template
    input:
      parameters: ${{ parameters }}
      apiVersion: vrabbi.cloud/v1alpha1
      kind: MyCustomCrossplaneAPI
      clusters: ${{ parameters.clusters }}
```
This generates a YAML manifest for a claim and saves it to the filesystem. This can then be further processed and pushed to a git repo for example using the github, gitlab, bitbucket or azure devops actions, or exposed as a yaml file for downloading directly.

Implementing the Permission Framework

The crossplane-resources frontend plugin also includes configuration options to enable permission checks for different exposed elements, allowing you to tailor the view to different personas in the organization. This requires also installing the crossplane-permissions-backend plugin in order to add the permissions to the permission framework within backstage.

In the repository you can see that I have integrated also the community RBAC plugin which allows you to easily manage via CSV files or from the UI creation of roles, and assigning roles to users and groups within your backstage instance, to provide a much more streamlined and simple process, rather then needing to implement permission policies in Typescript embedded within your backstage code base. The permissions available in the plugin are granular by design to allow for you to customize roles based on the different personas and requirements in your organization. When a user does not have any crossplane related permissions they will see for example:
But with the permissions to list claims and composite resources, and view the yaml of claims only it would look like this:
In order to enable the frontend to use the permission framework, you can simply add in your app-config.yaml
```
crossplane:
  enablePermissions: true
```
Lessons Learned
- Clear integration points exist between Backstage and Crossplane, but they require thoughtful customization.
- Annotation-driven approaches simplify resource ingestion but need proper documentation for end-users.
- Combining frontend, backend, and scaffolding plugins creates a holistic developer experience.
- Building Backstage plugins is complex at the beginning but becomes much easier over time.
Next Steps
- Explore additional integrations and possible optimizations of the cluster querying to speed up the frontend plugin rendering, and require less permissions from a Kubernetes perspective to run the plugins.
- Extend the scaffolder actions for more advanced workflows.
- Optimize the UI for larger-scale Crossplane resources, enabling a better UX for complex XRDs, including collapsible sections, and more UX tweaking.
Summary

By combining Backstage and Crossplane with custom plugins, organizations can offer developers a seamless platform experience. These plugins address key gaps in visibility, automation, and management, unlocking the potential for enhanced developer productivity.
December 25, 2024

Backstage, Crossplane, gitops, Platform Engineering

backstage, crossplane, IDP, platform engineering
Shift Down Security with KubeScape’s VEX Generation

One of the biggest challenges in the industry as a whole and in the DevOps world today in particular is vulnerability management.

As we in the industry are trying to implement better security practices and to evolve the secure software supply chain whether by choice or by necessity due to governmental or industry regulations and certifications, we run into a key issue which is the difficulty in handling vulnerabilities at scale.

Vulnerability Exploitability eXchange (VEX) documents have become a critical part of modern software security practices. As organizations increasingly rely on Software Bill of Materials (SBOMs) to gain transparency into the components of their software, managing vulnerabilities has grown more complex. SBOMs provide a detailed list of all open-source libraries, dependencies, and third-party components within an application. However, while SBOMs are essential for identifying potential vulnerabilities, they often result in overwhelming lists of issues that may not be directly exploitable. This is where VEX documents come in—they act as a filter, providing actionable information about whether a vulnerability in a component is actually exploitable in a specific context.

The shift from simply identifying vulnerabilities to assessing their real-world impact has driven the rise of VEX. Without VEX documents, security teams would be forced to investigate each vulnerability individually, regardless of its exploitability, leading to resource drain and inefficiencies. VEX helps narrow the focus to only the vulnerabilities that present genuine risks, enabling better prioritization and more effective mitigation strategies. In the evolving landscape of software security, the combination of SBOMs and VEX has become a powerful duo, helping organizations shift from broad awareness to targeted action.

While the idea of VEX documents is great, generating them can be a very challenging task, but it doesn’t have to be!

Trying to figure out what is actually exploitable is a challenging task, but with some of the innovations in the industry especially eBPF, we can make this task much more of a realistic problem to solve.

KubeScape which is an amazing Open Source project which is a CNCF Sandbox project, has an amazing feature which allows for auto generation of VEX documents for all of our applications running in our clusters!

The core functionality is using another CNCF Sandbox project called Inspektor Gadget which is used by KubeScape as a library. Inspektor Gadget is an Open source eBPF debugging and data collection tool for Kubernetes and Linux which is used by KubeScape within its node-agent daemonset to collect the needed data for generation of the VEX documents.

The way this works is that the node-agent using eBPF probes, looks at the file activity of every running container. When a pod starts up on a node, the node-agent watches all of its containers for a “learning period” and saves the data in an activity log. In addition to this container images used within pods are automatically scanned by KubeScape using the Grype image scanner which also outputs an SBOM. KubeScape then uses the SBOM and the activity log of what is actually being used within the container as the inputs to generate automatically a VEX document for us which is then saved to a Custom Resource we can use for any security measures we need.

There is also a great example showing how this can be used in CI using GitHub actions which can be found in the following repo. While the example is just an example, which can definitely be improved by for example adding in other elements like signing of the vex document or adding it to your OCI Registry using the new OCI 1.1 referrers API specification, it offers an amazing starting point!

Currently the VEX generation feature in KubeScape is experimental and therefore is not enabled by default, but to add it is extremely easy. We simply need to add the following flag to our helm install command of the operator “–set capabilities.vexGeneration=enable”

For more detailed information i recommend checking out the official docs on this feature, and trying it out in your own environments!

VEX Auto generation is a perfect example of implementing a Shift Down approach which is a crucial element of any successful Internal Developer Platform.

September 23, 2024

containers, Platform Engineering, security
Stop Shifting Left, Shift Down to your platform

One of the things I have a huge issue with in the industry today, is the insistence on pushing a shift left approach.

According to the CNCF glossary, Shift Left is the practice of implementing tests, security, or other development practices early in the software development lifecycle rather than towards the end.

While this sounds good in theory, it fails nearly always in practice. Shift left in my mind is a buzz word with no clear definition, which means something completely different to every persona in every company. Some think about testing, others think about IaC, some think about security, while others think about application deployment, and others think about observability, and so on and so forth with every area of the SDLC. The only common denominator is that we increase the cognitive load of our developers and they end up focusing less and less on their actual domain of expertise which is writing code.

The most common area people try to shift left is security. The best explanation of the issues with this approach is like often the case via a great quote from Kelsey Hightower:

“I think we are asking developers to do too much by shifting everything left including security. While it should be a collaborative effort, the idea that developers need to become security experts, in addition to everything else, just isn’t sustainable.”

Shift left was introduced to try and help solve key challenges such as early bug detection, improving software quality, faster time to market, cost efficiency and removing IT and operations from being the blocker of innovation.

While that all sounds great in theory, the approach has serious challenges which from what I see in the industry are mainly, Skill Gaps, Increased initial workload, Cultural resistance, Bad balancing of speed vs quality, lack of separation of concerns, and tools and integration challenges due to many tools being built with different personas in mind.

‘The best analogy for the challenge of Shift left in my approach can be understood through the following story:

“Hey, Mr. Plumber, you’re pretty good at installing pipes in the unfinished walls of our new houses. But it’s kind of hard to schedule the drywall guy to show up the moment you’re done and sometimes the delay causes schedules to slip, so we’re thinking it’d be better if you’d just do the drywall too. And maybe while you’re at it, you can add a coat of paint or some wallpaper, since you’ll be at the wall anyway. That way we won’t have to wait for the painters either. Don’t worry, you’ll still have the same amount of time per job that you had when you were just doing the pipes, and you’ll still get paid the same.”

With these challenges in mind, I have started to push a new concept over the past year which i have termed “Shift Down”.

Shifting down to the platform addresses the challenges that shift left was created to solve but does so in a transparent and seamless way increasing developer productivity, without cognitive overload, and with maintaining a clear separation of concerns.

This approach in my mind is the essence of a successful platform engineering journey a company must embark on too truly advance their SDLC and DevEx.

Some key tools in the industry can be of huge assistance when trying to implement this approach including both OpenSource software as well as commerical products.

Lets examine a few of the main tools which help push this idea into practice:

•KubeScape – Automated VEX Document generation

•Harbor – Automated Image Scanning in the registry on push

•Snyk – Code Scanning in git with PR generation with security fixes

•Backstage – Central visibility for all elementes of the SDLC and runtime information

•Software Templates – custom application starters with security and standards integrated

•Kpack – Kubernetes native build system using Cloud native buildpacks

•ScaleOps – automatically fine tune resource requests based on realtime data

•Pixie – Auto instrumentation and observability for your applications

•Crossplane – Define custom APIs and abstractions above any resource or API

While this is still a new and evolving area, I believe that shifting down to the platform is the only true way forward. for a more detailed and lively explanation of what this approach entails heck out the recording of my talk on this exact topic from DevConf Boston 2024 earlier this year.

September 22, 2024

Backstage, containers, Crossplane, Platform Engineering, security