Isolation of Intermittent Network Issue Part III (The Analysis)

Posted by

Introduction:

You followed the steps Here and were unable to isolate the issue. Your network team is still breathing on your neck.

If you are using vSphere 5.5 or above the gods of virtualization are with you.  They have blessed you with with a cool tool , “pktcap-uw” comes to the rescue.

The instruction to use the tool can be found Here

Symptoms:

  • Intermittent network outage for the Virtual machines
  • vMotion of a VM causes a network outage
  • Intermittent network outage for a VM which gets resolved by the vMotion of a VM to another host.
  • Intermittent network outage for a VM which gets resolved when VM’s networking configuration is modified using edit settings option.

Exceptions:

  1. Following procedure will not be suitable for environments with vxLans.
  2. Port channels can only be investigated with the support of network team.

Setup details:

The test setup for explaining the analysis consists of two hosts with following details

Network Layout:

Esxi-1:

Test VMK: vmk1

VMK IP: 192.168.20.81

Physical Nic in use by vmk: vmnic4

Port-ID: 67108870

Esxi-2

Test VMK: vmk1

VMK IP: 192.168.20.82

Physical Nic in use by VMK: vmnic4

Port-ID: 67108870

Commands Executed:

Esxi-1

pktcap-uw–vmk vmk1 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmk1_DIR0_esxi1.pcap –dir 0 -c 25 & pktcap-uw –vmk vmk1 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmk1_DIR1_esxi1.pcap –dir 1-c 25 & pktcap-uw –switchport 67108870 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/67108870_Dir0_esxi1.pcap–dir 0 -c 25 & pktcap-uw –switchport 67108870 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/67108870_Dir1_esxi1.pcap–dir 1 -c 25 & pktcap-uw –uplink vmnic5 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic5_DIR0_esxi1.pcap –dir0 -c 25 & pktcap-uw –uplink vmnic5 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic5_DIR1_esxi1.pcap –dir1 -c 25 & pktcap-uw –uplink vmnic4 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic4_DIR0_esxi1.pcap –dir0 -c 25 & pktcap-uw –uplink vmnic4 -o/vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic4_DIR1_esxi1.pcap –dir1 -c 25 & vmkping -I vmk1 192.168.20.82 -c 10 &

Esxi-2

pktcap-uw –vmk vmk1 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmk1_DIR0_esxi2.pcap –dir 0 -c 25 & pktcap-uw –vmk vmk1 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmk1_DIR1_esxi2.pcap –dir 1 -c 25 & pktcap-uw –switchport 67108870 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/67108870_Dir0_esxi2.pcap –dir 0 -c 25 & pktcap-uw –switchport 67108870 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/67108870_Dir1_esxi2.pcap –dir 1 -c 25 & pktcap-uw –uplink vmnic5 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic5_DIR0_esxi2.pcap –dir 0 -c 25 & pktcap-uw –uplink vmnic5 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic5_DIR1_esxi2.pcap –dir 1 -c 25 & pktcap-uw –uplink vmnic4 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic4_DIR0_esxi2.pcap –dir 0 -c 25 & pktcap-uw –uplink vmnic4 -o /vmfs/volumes/5bc894f7-8dffcc7d-1951-005056016fbb/vmnic4_DIR1_esxi2.pcap –dir 1 -c 25 &

To  terminate the captures you can use following command

kill-9 `lsof | grep pktcap | awk ‘{print $1}’ | sort -u`


How it looks in Wireshark:

we will start with vmk1_DIR0_esxi1.pcap.  This is because in our test we are running ping from esxi-1 vmk1 to esxi-2vmk1

The above indicates that 192.168.20.81 (esxi-1 vmk1) made 10 ping requests to 192.168.20.82 (esxi-2 vmk1). Since pktcap-uw captures are unidirectional we do not see any replies on the above file

In vmk1_DIR1_esxi1.pcap we see a reply for all the 10 ping requests. However, we see duplicate replies

In vmk1_DIR1_esxi2.pcap we see 192.168.20.81 (esxi-1 vmk1) made 20 ping requests to 192.168.20.82(esxi-2vmk1). The requests got duplicated somewhere on their way to 192.168.20.82

In vmk1_DIR0_esxi2.pcap we see  192.168.20.82 (esxi-2 vmk1) responded to 10 ping requests only

The VMK ports on both ESXi-1 and ESXi-2 are only sending 10 packet each.
However, we see duplication for both ICMP requests and replies.

Let’s look at this further to understand why duplication?

Reviewing the files 67108870_DIR0_esxi1.pcap (virtual port id for vmk1) and vmnic4_DIR1_esxi1.pcap (physical up-link in use by vmk1) we see no duplication of ICMP requests. 

Let’s see what the other host received

Reviewing the file vmnic4_DIR0_esxi2.pcap (physical up-link in use by vmk1 on ESXi-2) we see no duplication of ICMP requests. 

Reviewing the file 67108870_DIR1_esxi2.pcap(virtual port id for vmk1 on ESXi-2) we notice duplicate ICMP requests. 

Since the virtual port id received duplicate request the vmk as well received them. Same can be seen in vmk1_DIR1_esxi2.pcap

Summary of Analysis so far

  • ESXi-1 with IP 192.168.20.81 sent out 10 Ping requests to ESXi-2 with IP 192.168.20.82
  • The virtual switch port and physical Nic on ESXi-1 see 10 Ping
    requests
  • Physical Nic on ESXi-2 (vmnic4) received 10 Ping requests
  • The virtual switch port and VMK on ESXi-2 received 20 Ping requests

From the above analysis we can get tempted to conclude that packets are getting duplicated between the physical up-link and virtual switch port on ESXi-2.  A virtual switch could lead to this issue when we have

  • Port mirroring configure on DVS and we have mirrored physical up-link ports
  • We have an incorrectly configured DV filter in the stack

In my setup none of above is configured.

Then why duplicate packets?

Let’s see what is happening on the second up-link on the team supporting vmk1 on ESXi-2

Reviewing the files vmnic5_DIR0_esxi2.pcap we notice that vmnic5 is also
seeing ICMP requests and replies.

Since we see these packets in DIR 0. The vmnic5 is receiving these packets from external environment. 

Hence, we should provide the above analysis to the network team and work further to isolate physical switching configuration issues