Skip to main content

ESXi host experiences PSOD with references to the FCoE module (qfle3f) in the backtrace.

  Symptoms

  • During long run of connection reset cases we encounter All Paths Down (APD) or similar path down scenarios.
  • ESXi 6.5 or 6.7 host experiences PSOD with references to the FCoE module (qfle3f) in the backtrace.

PSOD: Panic bora/vmkernel/main/dlmalloc.c:4908 - Corruption in DLMALLOC referencing details ql_fcoe_delayed_wq.
 
Similar back trace to:
0x451b9fd9bd50:[0x418037d0ba15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x4302d004c490, 0x4180380a7558, 0x451b9fd9bdf8, 0x0, 0x100000001
 0x451b9fd9bdf0:[0x418037d0bc48]Panic_NoSave@vmkernel#nover+0x4d stack: 0x451b9fd9be50, 0x451b9fd9be10, 0x43120f780c20, 0x4180380a7539, 0x132c
 0x451b9fd9be50:[0x418037d54363]DLM_free@vmkernel#nover+0x6a8 stack: 0x43120f78acc0, 0x418037d51501, 0x5beea699da51a, 0x418037d15653, 0x0
 0x451b9fd9be70:[0x418037d51500]Heap_Free@vmkernel#nover+0x115 stack: 0x0, 0x43120f78acc0, 0x2f, 0x40000000, 0x0
 0x451b9fd9bec0:[0x418037c3d987]vmk_SpinlockDestroy@vmkernel#nover+0x48 stack: 0x43120f5df000, 0x418038ab09ed, 0x0, 0x418038abcb52, 0x43120f5df000
 0x451b9fd9bee0:[0x418038ab09ec]DeleteFabric@(qfle3f)#<None>+0x29 stack: 0x43120f5df000, 0x43120f5df200, 0x0, 0x418038ab2c00, 0x43120f5f3610
 0x451b9fd9bf40:[0x418038ab0bd9]_ReleaseFabricReference@(qfle3f)#<None>+0x2e stack: 0x43120f786000, 0x43120f786018, 0x1, 0x418038abc27b, 0x418038abc1f8
 0x451b9fd9bf70:[0x418038abc27a]ql_fcoe_do_singlethread_work@(qfle3f)#<None>+0x83 stack: 0x2f, 0x418037d2902f, 0x2f, 0x418038abc1f8, 0x418037d2902a
 0x451b9fd9bf90:[0x418037d2902e]vmkWorldFunc@vmkernel#nover+0x4f stack: 0x418037d2902a, 0x0, 0x451b8a6a3100, 0x451b9fda3000, 0x451b8a6a3100
 0x451b9fd9bfe0:[0x418037f0e322]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0

Cause
During FCoE FIP discovery process for specific reasons we may fail queuing of discovery timeout handler.
This results in failing of this iteration of discovery process which is expected.
However while doing so a bug is introduced whereby a reference to that session object remains.
This causes incomplete cleanup of resources which will hamper any re-discovery
Impact / Risks
Requires host reboot.
Resolution

This issue is fixed in VMware vSphere ESXi 6.7 driver version 2.0.123.0 and VMware vSphere ESXi 7.0 driver version 3.0.125.0.

To download go to below links.
For ESXi 6.7 - MyVMware
For ESXi 7.0 - MyVMware

NOTE: If you are running VMware vSphere ESXi 6.5 version, please contact the server hardware vendor.

 

 

 

Workaround
  • If you are not using FCoE for connectivity to storage disable qfle3f as a workaround.

esxcli system module set --enabled=false --module=qfle3f

Reboot the server for above command to take effect.
  • If multiple FCoE VLANs are configured as a workaround  remove multiple VLAN configuration on same fabric.
Related Information
For more information on how to install driver refer to Installing async drivers in ESXi 5.x/6.x/7.x using esxcli and offline bundle

Comments

Popular posts from this blog

Error [403] The maximum number of sessions has been exceeded in the H5 client during login or logout

  Symptoms In virgo log, you see messages similar to: [2020-05-19T07:25:45.285Z] [ERROR] http-nio-5090-exec-130 72026859 142953 501051 com.vmware.vise.security.spring.DefaultAuthenticationProvider logout failed for sessionId 142953, clientId 501051 java.lang.IllegalStateException: The specified cardinality of 1..1 for osgi:reference implementing com.vmware.vcenter.apigw.api.ApiGatewaySessionManager in bundle com.vmware.h5ngc requires that exactly one OSGI service satisfies the filtering criteria but no such service was found.         at com.vmware.o6jia.context.ExternalServiceTargetSource.getTarget(ExternalServiceTargetSource.java:99)         at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:192)         at com.sun.proxy.$Proxy159.logout(Unknown Source)         at com.vmware.vise.security.spring.DefaultAuthenticationProvider.logoutInternal(DefaultAuthenticationProvider.java:548)         at c

Investigating virtual machine file locks on ESXi

      Details Adding an existing virtual machine disk (VMDK) to a virtual machine that is already powered on fails.                 Failed to add disk scsi0:1. Failed to power on scsi0:1   Powering on the virtual machine results in the power on task remaining at 95% indefinitely. Cannot power on the virtual machine after deploying it from a template. Powering on a virtual machine fails with an error: Unable to open Swap File Unable to access a file since it is locked Unable to access a file <filename> since it is locked Unable to access Virtual machine configuration In the /var/log/vmkernel log file, you see entries similar to: WARNING: World: VM xxxx: xxx: Failed to open swap file <path>: Lock was not free WARNING: World: VM xxxx: xxx: Failed to initialize swap file <path>   When opening a console to the virtual machine, you may receive the error: Error connecting to <path><virtual machin

"Performance data is currently not available for this entity" viewing the performance tab

  Symptoms While accessing the performance tab and navigating to Overview, you see: No data available   The data for Real time, but fails to retrieve it for past 1 day, week, month or year.  While selecting the advance parameter in performance tab, you see: Performance data is currently not available for this entity Cause This issue is caused by the vCenter Server database (Postgress) containing a stale/future time stamp reference for the ESXi host when the data was collected. For vCenter Servers using SQL, see  "Performance data is currently not available for this entity" error after updating rollup in vSphere Resolution Backup the vCenter database. For more info