During a recent reorganization of our vSphere lab environment, I was made aware that we had a very weird situation.
We use vSAN as our main storage in the lab and we found that it was at over 95% percent used, but only 50 percent could be attributed to VMs and system data.

After looking into this a bit, we found that the vast majority of the storage being used, was actually orphaned FCDs (First Class Disks), which are the vSphere object which maps to a PV (Persistent Volume) in kubernetes when using the vSphere CSI.

This actually makes sense, as we spin up Kubernetes clusters for testing on a daily basis, and when you delete a cluster, if you don’t delete the CSI volumes in advance, they will simply stay around indefinitely.

When trying to assess how best to clean up the environment, it became very clear that this aspect of the vSphere CSI and the CNS Volume implementation in vSphere, simply does not have a great flow for analysis of the situation, as well as their being no way to perform bulk operations.

When looking at the vSphere UI we can navigate to the relevant Datastore, and under the monitoring tab we will have a section called “Cloud Native Storage” with a “Container Volumes” page.

These are all the CNS volumes that are on this specific data store. If we click on the info card next to the volume name, we can get additional information about a volume that vSphere has collected.

The first tab includes basic vSphere related information including IDs, Datastore placement, Overall health, and any other data relating to the applied storage policies and the compliance of the volume with the storage policy.

The next tab is where we really get interesting information, which includes kubernetes data about the persistent volume that this volume represents.

This data includes the PV name, the namespace and name of the related PVC, all labels that are applied on that PVC, the pod or pods it is mounted to as well as one other key data point, which is the Kubernetes cluster that this PV was created in.

With this information, we should be able to build out a report of our persistent volumes with all the needed information, in order to assess what can be deleted and what cant be deleted.

The first place i looked for such info was vRealize Operations, however CNS Volumes are not even collected by the vCenter adapter, and as such we need to find another solution.

The next place I looked to get this info in a clear way was RVTools. While RVTools has the tab which includes a list of what it believes to be Orphaned VMDKs, and they do include the CNS volumes, the needed kubernetes metadata is not available making it a no go as well.

With this being the case, I decided to check what could be done from a CLI based solution. For this specific use case, i decided to go with the golang based CLI tool govc, which is fast and really easy to use.

Setting up GOVC:

export GOVC_INSECURE=true
export GOVC_URL=lab-vcsa.vrabbi.cloud
export GOVC_USERNAME=scott
export GOVC_PASSWORD=SuperSecretPassword

The next step is to list the volumes on the vsanDatastore which can be done using the the volume.ls command:

govc volume.ls -ds=vsanDatastore

However, this command simply returned the name and ID of the volumes and not the metadata we need.

Luckily govc, can give us all of this data, when using the “-json” flag:

govc volume.ls -ds=vsanDatastore -json

As we can see, we truly get a huge amount of data in the call, for every CNS volume on the datastore, making this a great starting point.

The next part is the need to parse this data and extract only the needed parts which for this use case includes:

Cluster Name
PV Name
Namespace of the PVC
Owner (The vSphere user the CSI driver used to create this volume)
Size of the volume
Volume ID
Datastore URL the volume resides on

Once we find where this data is located within the JSON body we get as a response, we can use the common CLI tool for json manipulation “jq” for extracting the needed data:

govc volume.ls -ds=vsanDatastore -json | jq -r '.[]' | jq '.[] | {cluster: .Metadata.ContainerCluster.ClusterId, pvc: .Name, namespace: .Metadata.EntityMetadata[0].Namespace, owner: .Metadata.ContainerCluster.VSphereUser, sizeGB: (.BackingObjectDetails.CapacityInMb/1024), datastoreUrl: .DatastoreUrl, id: .VolumeId.Id}'

This command will have an output similar to the following:

As we can see, the output is exactly the data we want, however it would be nice if we had this in CSV format which would allow us to open it in excel for example, making the analysis much easier to do.

We can add a simple addition to the this command at the end, again using jq, in order to convert this json data into a simple CSV.

The addition first needs to convert the multiple json documents outputted by the previous command into an array which is done via the following

jq -s '.'

The next step is extract the keys and make them the column headers, and then all values are placed in the rows beneath it, and finally we export this as a CSV

jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv'

With all of this in place, we should receive a CSV formatted list of all the CNS volumes on our vsanDatastore, along with the metadata we need in order to analyze it.

The last step is to simply output this into a file which can be done very easily with a redirection of the output stream to a file. In the end, the final command to get this data is:

govc volume.ls -ds=vsanDatastore -json | jq -r '.[]' | jq '.[] | {cluster: .Metadata.ContainerCluster.ClusterId, pvc: .Name, namespace: .Metadata.EntityMetadata[0].Namespace, owner: .Metadata.ContainerCluster.VSphereUser, sizeGB: (.BackingObjectDetails.CapacityInMb/1024), datastoreUrl: .DatastoreUrl, id: .VolumeId.Id}' | jq -s '.' | jq -r '(map(keys) | add | unique) as $cols | map(. as $row | $cols | map($row[.])) as $rows | $cols, $rows[] | @csv' > vsanDatastore-cns-volumes.csv

Once we have gone through the list of volumes, and have the list of CNS volumes we want to delete, we can simply create a text file containing the ID of those volumes which can be found in the 3rd column of the outputted CSV.

Once we have a file that looks like this:

We can simply run a command that loops through this list and via govc, will delete them using the govc volume.rm command:

xargs -a cns-vols-to-delete.txt -I{} -d'\n' govc volume.rm {}

Summary

While the UI is not very useful for these tasks, and neither are the standard monitoring or reporting tools for vSphere, it is good, that this information is accessible via automation tools like govc, which allows us to solve issues like this, in the interim at least, until a more streamlined approach is available via the vSphere UI or other official VMware tooling.

Another hope i do have, is that down the road, kubernetes distributions like TKG, OCP, Rancher, EKS-A and others as well, will have a mechanism to simply at cluster deletion time, delete all of the remaining PVs, which would eliminate this issue.

I hope this was helpful, as for me in the lab it was able to save over 6TB of storage which is huge!

Finding and Removing Stale CNS Volumes in vSphere

Summary

Like this:

Related

Leave a ReplyCancel reply

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from vRabbi's Blog