Skip to main content

Cleanup NSX Edge VMs after an Update failure of vCenter NSX Edge Cluster

  Symptoms

When a previous Update operation has failed:
  • You cannot start a new Update operation.
  • Triggering a new Update operation fails with the below error:
Operation not allowed for NSX Edge cluster of Cluster 'domain' in 'UPDATE_FAILED' state.
Purpose
This article provides information on how to cleanup NSX Edge VMs in cases when the Update operation fails for the vCenter NSX Edge Cluster.
Resolution
This is a known issue.

Currently, there is no resolution.
Workaround
Cleaning up after Update failures involves two steps:
  • Cleanup in the NSX Manager
  • Cleanup in the vCenter Server

Cleanup in the NSX Manager

Using NSX Manager API
  1. Retrieve the list of edge transport-nodes with this command: 
API - GET https://NSX-Manager-IP-Address/api/v1/transport-nodes?node_types=EdgeNode
  1. Using the output, select the edge transport-nodes whose names match the edge VM names, and retrieve the respective transport-node ids.
{
 "results": [
 {
 "node_id": "cfb25f8a-bfed-418b-aed2-e69b306d8673", >>>>>>>>>>> retrieve node-id
 "host_switch_spec": {
 "host_switches": [
 {
 "host_switch_name": "overlaySw",
...
 "node_settings": {
 "hostname": "edge1.domain.com",
 "dns_servers": [
 "10.162.204.1",
 "10.166.1.1"
 ],
 "enable_ssh": true,
 "allow_ssh_root_login": true
 },
 "resource_type": "EdgeNode",
 "id": "cfb25f8a-bfed-418b-aed2-e69b306d8673",
 "display_name": "edge1",    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< match edge VM name
 "external_id": "cfb25f8a-bfed-418b-aed2-e69b306d8673",
 "ip_addresses": [
 "10.160.141.28"
 ],
...
  1. Delete each of the edge transport-nodes using the node-ids:
API - `DELETE https://<NSX-Manager-IP-Address/api/v1/transport-nodes/<edge-transport-node-id>`

Using NSX Manager UI

Note: NSX Manager UI can be accessed through https://NSX-Manager-IP-Address or via NSX plugin in vSphere Client UI.
  1. Go to NSX H5 plugin > System > Fabric > Transport nodes > Edge Transport Nodes.
  2. Match the Edge VM names with the Edge field in the Edge Transport nodes page.
  3. Verify that the Edge VMs to delete are not already part of the Edge Cluster by checking that Edge Cluster field for that Edge VM is blank.
  4. Delete the two Edge transport nodes corresponding to the VMs specified in the failed update by using the DELETE option.
  5. The above step is expected to delete the corresponding VMs from vCenter Server as well.

Cleanup in the VC

If the failed VMs still show up in the vSphere Client UI even after deleting from the NSX Manager, manually delete the two VMs using vSphere Client UI.

NSXD Database Cleanup
  1. Identify the cluster-id corresponding to the given cluster. This can be obtained by either of these options:
  • Browsing to the given vCenter Cluster by going to vCenter Server MOB at https://VC-IP/mob
  • From the vSphere Client UI, navigate to the given cluster and note down the part of the URL string with the pattern domain-X123.
  1. Connect to the VCSA with SSH and a root user.
  2. Stop WCP service:
vmon-cli --stop wcp
  1. Access NSX Database:
/opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres -w
  1. After entering the VCDB=# prompt:
    1. Find the nsxd.edge_configuration' table entry corresponding to the given cluster:
select * from nsxd.edge_configuration where cluster_id='domain-<X123>';
  1. Verify that inprogress_update_spec column is non-empty.
  2. Set the inprogress_update_spec field to NULL.
UPDATE nsxd.EDGE_CONFIGURATION SET INPROGRESS_UPDATE_SPEC=NULL WHERE CLUSTER_ID='domain-X123';
 
Make note of the Edge VM names used in the failed update.
  1. Check if there is a corresponding entry in the nsxd.EDGE_VM table for the given cluster
SELECT * FROM nsxd.EDGE_VM where cluster_id='domain-X123';
  1. If entries corresponding to the failed VMs exist in the nsxd.EDGE_VM table, selectively delete them (Make sure to delete ONLY the entries corresponding to the failed VM names).
DELETE FROM nsxd.EDGE_VM where cluster_id='domain-X123' and name_in_spec='VM-NAME-TO-DELETE';
  1. Verify that only the Edge VMs that have previously been successfully been deployed and configured have entries in the nsxd.EDGE_VM table.
 
  1. Restart WCP service
vmon-cli --start wcp

Once the above steps are complete, you should be able to update the Edge Cluster to add Edge Nodes using the same configuration as before or a new configuration.

Comments

Popular posts from this blog

Error [403] The maximum number of sessions has been exceeded in the H5 client during login or logout

  Symptoms In virgo log, you see messages similar to: [2020-05-19T07:25:45.285Z] [ERROR] http-nio-5090-exec-130 72026859 142953 501051 com.vmware.vise.security.spring.DefaultAuthenticationProvider logout failed for sessionId 142953, clientId 501051 java.lang.IllegalStateException: The specified cardinality of 1..1 for osgi:reference implementing com.vmware.vcenter.apigw.api.ApiGatewaySessionManager in bundle com.vmware.h5ngc requires that exactly one OSGI service satisfies the filtering criteria but no such service was found.         at com.vmware.o6jia.context.ExternalServiceTargetSource.getTarget(ExternalServiceTargetSource.java:99)         at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:192)         at com.sun.proxy.$Proxy159.logout(Unknown Source)   ...

Investigating virtual machine file locks on ESXi

      Details Adding an existing virtual machine disk (VMDK) to a virtual machine that is already powered on fails.                 Failed to add disk scsi0:1. Failed to power on scsi0:1   Powering on the virtual machine results in the power on task remaining at 95% indefinitely. Cannot power on the virtual machine after deploying it from a template. Powering on a virtual machine fails with an error: Unable to open Swap File Unable to access a file since it is locked Unable to access a file <filename> since it is locked Unable to access Virtual machine configuration In the /var/log/vmkernel log file, you see entries similar to: WARNING: World: VM xxxx: xxx: Failed to open swap file <path>: Lock was not free WARNING: World: VM xxxx: xxx: Failed to initialize swap file <path>   When opening a console to the virtual machine, you may receive ...

"Performance data is currently not available for this entity" viewing the performance tab

  Symptoms While accessing the performance tab and navigating to Overview, you see: No data available   The data for Real time, but fails to retrieve it for past 1 day, week, month or year.  While selecting the advance parameter in performance tab, you see: Performance data is currently not available for this entity Cause This issue is caused by the vCenter Server database (Postgress) containing a stale/future time stamp reference for the ESXi host when the data was collected. For vCenter Servers using SQL, see  "Performance data is currently not available for this entity" error after updating rollup in vSphere Resolution Backup the vCenter...