VMware vSphere Bitfusion Installation Gotchas!

Posted by
What is VMware vSphere Bitfusion?

VMware Bitfusion is to GPUs what VMware ESXi is to physical servers.

VMware vSphere Bitfusion virtualizes hardware accelerators such as graphical processing units (GPUs) to provide a pool of shared, network-accessible resources that support artificial intelligence (AI) and machine learning (ML) workloads. vSphere Bitfusion works with artificial intelligence frameworks such as TensorFlow and PyTorch.

For more details please review https://docs.vmware.com/en/VMware-vSphere-Bitfusion/index.html

Prerequisites:
  1. Bitfusion server must run on vCenter 7.0 and ESXi 7.0
  2. Bitfusion Clients running in a VM must be hosted on ESXi 6.7 or above
  3. Before installing the Bitfusion server make sure ESXi server are running on Enterprise Plus License 
  4. Firefox version 82 or above ,Chrome version 73
Where to Download VMware vSphere Bitfusion?

You can download VMware vSphere Bitfusion at https://my.vmware.com/web/vmware/downloads/details?downloadGroup=BITFUSION-202&productId=1022&rPId=51899#product_downloads

Bitfusion server installation Gotchas:
  • Bitfusion server does not work for ESXi server in Evaluation License 
  • Bitfusion server does not work for ESXi server with Enterprise Plus License with Add-ons like Kubernetes
  • All vApp options and VM settings must be error free before first boot of Bitfusion server
  • To save yourself the trouble of vApp redeployment, take a snapshot of a VM before powering it on for the first time
  • Revert to the snapshot before fixing any errors in vApp options
  • vApp redeployment is the easiest fix in case you have made an error without snapshot
  • Online snapshot operations do not work on the Bitfusion server. This is due to the presence of PCI passthrough devices
  • The vCenter Server thumbprint must be all caps
  • Bitfusion server does not like it when following options in vSpehre UI are used to perform power operation. I broke the appliance a few times before realizing it. You may use Shut Down Guest OS and Restart Guest OS for nodes in stable cluster. Just avoid using them during node setup.
  • Bitfusion Enable operation on server or client must be performed on a powered off VM
  • You may see error “Error logging in with Token!” or “401: Unable to Log in” on trying to access Bitfusion plugin or performing Bitfusion related operations in linked mode environment. You can only access Bitfusion plugin and its operations by accessing vCenter server using the url were the Bitfusion server was pointed to during the initial setup.
    • These errors may also be observed when you are using Firefox version below 82 or Chrome version below 73
  • In case you want to deploy additional server using vCenter server clone option, the machine must be cloned before first boot
  • The Bitfusion Client may throw an error on executing bitfusion list_gpus “Cannot contact server 10.109.44.144:56001: response didn’t return status OK: code(401) Error validating license: Unable to validate license”. The error indicates that the Bitfusion server 10.109.44.144 was not able to communicate with the vCenter server. This may be due to incorrect vCenter Server thumbprint
  • When deploying Bitfusion server nodes across multiple ESXi server make sure to provide a single source of time
    • This can be done using NTP setting of the Bitfusion server node or by setting up ESXi servers for NTP
    • Failing to do so can lead to node sync issues
    • Nodes must not drift beyond +10 or -10 Milliseconds
  • Choosing to auto download Nvidia drivers blocks incoming SSH. Therefore, you will need to enable incoming SSH using console before you can access the appliance using SSH client like putty

Watch out for VMware vSphere Bitfusion Installation demos