Cleanup NSX Edge VMs after an Update failure of vCenter NSX Edge Cluster

Symptoms

When a previous Update operation has failed:

You cannot start a new Update operation.
Triggering a new Update operation fails with the below error:

Operation not allowed for NSX Edge cluster of Cluster 'domain' in 'UPDATE_FAILED' state.

Purpose

This article provides information on how to cleanup NSX Edge VMs in cases when the Update operation fails for the vCenter NSX Edge Cluster.

Resolution

This is a known issue.

Currently, there is no resolution.

Workaround

Cleaning up after Update failures involves two steps:

Cleanup in the NSX Manager
Cleanup in the vCenter Server

Cleanup in the NSX Manager

Using NSX Manager API

Retrieve the list of edge transport-nodes with this command:

API - GET https://NSX-Manager-IP-Address/api/v1/transport-nodes?node_types=EdgeNode

Using the output, select the edge transport-nodes whose names match the edge VM names, and retrieve the respective transport-node ids.

{
"results": [
{
"node_id": "cfb25f8a-bfed-418b-aed2-e69b306d8673", >>>>>>>>>>> retrieve node-id
"host_switch_spec": {
"host_switches": [
{
"host_switch_name": "overlaySw",
...
"node_settings": {
"hostname": "edge1.domain.com",
"dns_servers": [
"10.162.204.1",
"10.166.1.1"
],
"enable_ssh": true,
"allow_ssh_root_login": true
},
"resource_type": "EdgeNode",
"id": "cfb25f8a-bfed-418b-aed2-e69b306d8673",
"display_name": "edge1", <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< match edge VM name
"external_id": "cfb25f8a-bfed-418b-aed2-e69b306d8673",
"ip_addresses": [
"10.160.141.28"
],
...

Delete each of the edge transport-nodes using the node-ids:

API - `DELETE https://<NSX-Manager-IP-Address/api/v1/transport-nodes/<edge-transport-node-id>`

Using NSX Manager UI

Note: NSX Manager UI can be accessed through https://NSX-Manager-IP-Address or via NSX plugin in vSphere Client UI.

Go to NSX H5 plugin > System > Fabric > Transport nodes > Edge Transport Nodes.
Match the Edge VM names with the Edge field in the Edge Transport nodes page.
Verify that the Edge VMs to delete are not already part of the Edge Cluster by checking that Edge Cluster field for that Edge VM is blank.
Delete the two Edge transport nodes corresponding to the VMs specified in the failed update by using the DELETE option.
The above step is expected to delete the corresponding VMs from vCenter Server as well.

Cleanup in the VC

If the failed VMs still show up in the vSphere Client UI even after deleting from the NSX Manager, manually delete the two VMs using vSphere Client UI.

NSXD Database Cleanup

Identify the cluster-id corresponding to the given cluster. This can be obtained by either of these options:

Browsing to the given vCenter Cluster by going to vCenter Server MOB at https://VC-IP/mob
From the vSphere Client UI, navigate to the given cluster and note down the part of the URL string with the pattern domain-X123.

Connect to the VCSA with SSH and a root user.
Stop WCP service:

vmon-cli --stop wcp

Access NSX Database:

/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -w

After entering the VCDB=# prompt:
1. Find the nsxd.edge_configuration' table entry corresponding to the given cluster:

select * from nsxd.edge_configuration where cluster_id='domain-<X123>';

Verify that inprogress_update_spec column is non-empty.
Set the inprogress_update_spec field to NULL.

UPDATE nsxd.EDGE_CONFIGURATION SET INPROGRESS_UPDATE_SPEC=NULL WHERE CLUSTER_ID='domain-X123';

Make note of the Edge VM names used in the failed update.

Check if there is a corresponding entry in the nsxd.EDGE_VM table for the given cluster

SELECT * FROM nsxd.EDGE_VM where cluster_id='domain-X123';

If entries corresponding to the failed VMs exist in the nsxd.EDGE_VM table, selectively delete them (Make sure to delete ONLY the entries corresponding to the failed VM names).

DELETE FROM nsxd.EDGE_VM where cluster_id='domain-X123' and name_in_spec='VM-NAME-TO-DELETE';

Verify that only the Edge VMs that have previously been successfully been deployed and configured have entries in the nsxd.EDGE_VM table.

Restart WCP service

vmon-cli --start wcp

Once the above steps are complete, you should be able to update the Edge Cluster to add Edge Nodes using the same configuration as before or a new configuration.

thatwaskube

Search This Blog