Introduction:
One of the most challenging issues to remediate in a Virtual infrastructure is Intermittent network outage for Virtual machines. The challenges are two prong in the nature
- Network administrators do not want to touch a production network unless they have a concrete evidence of a failure on the network side.
- By default, VMware logging does not have an ability to detect failures other than link state on the Edge Switch.
Since, VMware logging is not very help full unless it is a link state failure on the Edge Switch. One would need to conduct a few test to isolate the break point.
Symptoms:
- Intermittent network outage for the Virtual machines
- vMotion of a VM causes a network outage
- Intermittent network outage for a VM which gets resolved by the vMotion of a VM to another host.
- Intermittent network outage for a VM which gets resolved when VM’s networking configuration is modified using edit settings option.
Procedure
The procedure is not be suitable for environments with Port channels
Step1: Identify the VMs facing the issue.
1.1 Make a list of VMs that are facing the issue regularly.
1.2 For the VMs with multiple Nics Make sure we also note down the specific adapter that faces the issue.
If you are unable to isolate, the VMs facing the issue due to operational challenges you can still proceed further.
Step2: Identify a pattern of failures
2.1 Once we have a list of VMs that notice the issue most frequently try to understand if there is pattern to the failures like
2.1.1 All VMs that notice the outage were on same host.
2.1.2 All VMs that notice the outage are from same VLAN.
2.1.3 All VMs that notice the outage are pinned to a same physical Nic on the host.
The information can be obtained using esxtop network stats. Run “esxtop” on the console or SSH session of the host. Press “n” to access network screen.
If you are unable to Identify a pattern of failure due to operational challenges, you can still proceed further.
Step3: Validate the Network Configuration
3.1 Identify a test VM or create a new VM with Linux or Windows guest OS
3.2 Obtain an IP address for the above VM. One IP for each vlan you would like to verify.
3.3 Create a test port group with similar configuration as production port group.
3.4 The port group should be present on the same vSwitch as production port group
3.5 Modify the port group settings and override the Teaming and failover settings. Make sure one uplink is marked as active and other are marked unused.
3.6 If we are using Standard vSwitch Repeat Step 3.3 and 3.4 for each host in the cluster. For a distributed vSwitch, modified port group settings will apply to all host part of the distributed vSwitch
3.7 Make a note of physical uplink that is active for each host. Example below
Host Name | Active Uplink | VLAN |
Esxi-A | Vmnic0 | 10 |
Esxi-B | Vminic1 | 10 |
Esxi-C | Vminc1 | 10 |
3.8 Attach the Test VM created in Step 3.1 to test port group created in Step 3.3
3.9 From the Test VM start a continues ping to the Gateway or an IP we knew failed during the outage.
3.10 Note down the output of Ping and vMotion the VM to another host in the cluster. Example below
Host Name | Active Uplink | VLAN | Ping Outcome |
Esxi-A | Vmnic0 | 10 | Success |
Esxi-B | Vmnic0 | 10 | Failure |
Esxi-C | Vminc0 | 10 | Success |
3.11 Repeat steps 3.4 to 3.9 till all Up-links are tested.
3.12 At the end of this activity, you will have a data similar to the example below. From the data below we can see that Vmnic0 on Esxi-B and Vminc2 on Esxi-C failed to carry the data for a vlan we were testing.
Host Name | Active Uplink | VLAN | Ping Outcome |
Esxi-A | Vmnic0 | 10 | Success |
Esxi-B | Vmnic0 | 10 | Failure |
Esxi-C | Vminc0 | 10 | Success |
Esxi-A | Vmnic1 | 10 | Success |
Esxi-B | Vmnic1 | 10 | Success |
Esxi-C | Vminc1 | 10 | Success |
Esxi-A | Vmnic2 | 10 | Success |
Esxi-B | Vmnic2 | 10 | Success |
Esxi-C | Vminc2 | 10 | Failure |
3.13 Engage networking team to investigate the configuration of the physical switch ports specific to Vmnic0 on Esxi-B and Vminc2 on Esxi-C.
3.14 Repeat steps 3.4 to 3.11 till all Vlans are tested.
If you were unable to isolate the cause of the issue using the steps above proceed to Isolation of Intermittent Network Issue Part II (Captures using pktcap-uw)