On most servers you can use the button Turn On LED in the Storage Devices view so that you can identify which drives can be safely removed from the server easily. However if the drive is a NVMe SSD (as the ones shown below) this button might have no effect.
In this example an entire disk group (2 TB cache disk and three 8 TB capacity disks) shall be moved from an ESXi host in a vSAN cluster with spare capacity to a host belonging to a cluster where more datastore space is needed. The process is repeated for all hosts until the storage configuration is homogenous across all clusters, as per vSAN recommendations.
In case of the removal of an entire disk group the following window is displayed where the user must select an option for the vSAN data migration. Full data migration is recommended, as all the replicas for the configured Failures to tolerate policies are kept. Otherwise new replicas have to be re-created manually (Repair objects immediately button in the vSAN Skyline health view) or automatically after the repair delay time. This Object Repair Timer is set to 1 hour by default. So if a failure occurs in the components hosting the only remaining replica before the timer expires and the resync process is completed data loss may occur. The timer can be modified, but lower settings may lead to unwanted behavior, as explained in this KB article.
After the disks are removed from the cluster traces of vSAN partitions may remain on it, so I recommend to select the respective disks in the Storage Devices view and click on Erase Partitions as seen in below screenshot. If you forgot to do this and it fails (e.g. on a different server, where the disk has been relocated to, before erasing the partitions on the old one) check the instruction in my previous post. For identifying which bay the NVMe drive is located in you just removed and erased write down the vmhbaX number found in the Path view below.
Now connect to the ESXi host via SSH and find the bus ID using the lspci command (filter using grep to narrow down the results):
Convert the hexadecimal number after the four leading zeros to decimal representation using your favourite tool, e.g. programmer calculator:
In the iDRAC web-interface of your Dell server open the System/Storage/Physical Disks view and find the disk with the bus ID number calculated above.
The string in the Device Description field gives a human readable representation of the drive�s position. Now you know which drive you removed from the vSAN cluster earlier and can safely remove it from the server:
More details, e.g. how to use the racadm CLI tool instead of the iDRAC web-interface, can be found in this KB article by Dell.
In VMware Cloud Foundation (VCF) workloads usually are deployed in one or more dedicated virtual infrastructure (VI) workload domains. During the VCF deployment (as shown in my earlier posts) the management workload domain (MWLD) is created with a minimum of four hosts. The WLD contains among other components the management vCenter and the SDDC manager. For each VI workload domain (WLD) created using the SDDC manager a separate vCenter is deployed in the MWLD. The vCenters manage the WLD�s hosts and use the vSphere linked mode. As only fifteen vCenters can be linked as per current configuration maximums, currently up to 14 WLDs are supported. Before the SDDC manager can create a WLD enough hosts (minimum three per WLD) need to be commissioned. Click on the button “Commission hosts” either in the Dashboard or the Inventory/Hosts view:
The hosts need be be prepared similarly to the VCF deployment. This includes ESXi version, hardware configuration and network settings (e.g. DNS resolution) and shown in below checklist. In a later post I will provide some helpful PowerCLI snippets to accelerate the host preparation.
After clicking on “Proceed” the details of the hosts need to be provided. Either add each individual host manually (Select “Add new”) or perform a bulk commission by preparing and uploading a JSON file:
The JSON template provided looks like this:
{
"hostsSpec": [
{
"hostfqdn": "Fully qual. domain name goes here",
"username": "User Name goes here",
"storageType": "VSAN/NFS",
"password": "Password goes here",
"networkPoolName": "Network Pool Name goes here"
},
{
"hostfqdn": "Fully qual. domain name goes here",
"username": "User Name goes here",
"storageType": "VSAN/NFS",
"password": "Password goes here",
"networkPoolName": "Network Pool Name goes here"
}
]
}
Not only the host�s details (FQDN, credentials) and the storage type (preferably vSAN) needs to be provided, but the network pool to be used. Later on also license keys are required. A total of three license keys for vSphere, vSAN and NSX should be entered in the “Administration/License” screen of the SDDC manager. Network pools are created in the “Administration/Network settings” screen. In this case VLAN-IDs and subnet for vMotion and vSAN separate from the default pool (used by the MWLD) are used:
After the hosts are commissioned they show up in the “Usassigned hosts” tab:
Click on a host to show its details, e.g. manufacturer, model and storage capacity:
To create a new WLD use the “+ workload domain” button in the inventory:
Select your storage in the next dialog box. vSAN and NFS are fully supported out of the box (Fibre Channel can be added later on manually, but must be managed independently):
In the first step of the VI configuration wizard enter names for the WLD, the first cluster and the organization the domain is intended for:
Then enter a free IP address in the management subnet, a FQDN configured in your DNS servers and root password for the WLD�s vCenter:
The most interesting part if you are enthusiastic for VMware�s SDN portfolio is the networking screen, which allows you to choose between the legacy product NSX-V or the 2019 released NSX-T version 2.4. In both cases FQDNs, IP addresses and root/admin password for the NSX managers must be entered, as well as a VLAN ID used for the overlay transport (VXLAN for NSX-V; Geneve for NSX-T):
If you selected vSAN as primary storage provider in the first step you need to enter the PFTT (primary failure to tolerate) parameter in step four. “One failure to tolerate” means each data set is replicated once, similar to RAID 1. This means that any of the three required hosts can fail at any point in time without data loss. If you have at least five hosts you can select PFTT=2, which means data is replicated twice, so two hosts may fail simultaneously. This is only the default setting however. PFTT can be set for each object via storage profiles later on, too.
In the next steps select the hosts which shall be used for initial WLD creation. Further hosts can be added to the WLD later. The host selection screen previews the accumulated resources of the selected hosts:
In the License step select the license keys entered before from the drop down menus. Each license should provide enough capacity for each product (e.g. enough CPU socket count) and not be expired:
The last two screens show a review of all entered parameters and a preview of the component names which will be created:
After finishing the wizard the creation progress can be tracked in the Tasks view in the bottom of the SDDC manager. If you click on the task all of its subtasks and their status are shown below:
After some time the WLD creation tasks should succeed:
Open the overview of the newly created WLD under the “Inventory/Workload Domains” to show its status. The “Services” tab features links to the vCenter and the NSX-T manager GUIs:
After a host is removed from a workload domain or the entire WLD is deleted the hosts are found under the tab “Unassigned hosts” again, but their state shows “Need Cleanup”:
First select the checkbox on the left of each host needing cleanup and click on the button “Decommission selected hosts”.
Then login into the SDDC manager using SSH (e.g. “ssh [email protected]”) and prepare a JSON file containing the hosts and their management credentials as follows:
Now run the following commands found in the VCF documentation to commence the cleanup:
su
cd /opt/vmware/sddc-support
./sos --cleanup-decommissioned-host /tmp/dirty_hosts.json
Afterwards however there is still the task of the network cleanup, which requires access to Direct Console User Interface (DCUI). If the network cleanup is not performed you will be presented with errors as shown below when trying to re-commission the hosts:
When logging into the ESXi management GUI in your browser you can see the left over distributed virtual switch and its port groups from the previous WLD:
Perform the network cleanup by logging into the DCUI with the root user and then select “Network Restore Options”:
Then select “Restore Network Settings” option which resets any network settings and devices to the defaults:
Re-configuration of management network settings like IP address, subnet mask, default gateway and VLAN is needed afterwards. Now of the cleaned hosts are ready to be re-commissioned, which works as shown in the beginning of this post.
When a VMware Cloud Foundation deployment was updated to the current version, as described previously, a few tasks should be done afterwards. First the vSAN datastore disk format version might need an upgrade. To check this head to the “Configure” tab of your DC in vCenter and click on “vSAN /Disk Management”:
Of course you should run the pre-check by clicking on the right button. If everything is working as it should it would look like this:
Now you can click the “Upgrade” button, which informs you this can take a while. Also you should backup your data/VMs elsewhere, especially if you select “Allow Reduced Redundancy”, which speeds up the process:
As you can see now the disk format version has changed from “5” to “7”:
However still some vSAN issues are displayed:
As this deployment is a “dark site”, meaning no internet access is available, the HCL database and Release catalog have to be updated manually.
The URL to download the 14.7 MB file can be found in a post from William Lam from 2015 or in this KB article. The release catalog�s URL is taken from another KB article. This file is less than 8 KB in size. After uploading both using the corresponding “Update from file” buttons the screen should look like this:
The last remaining issue in this case was the firmware version of the host bus adapter connecting the vSAN datastore devices could not be retrieved (“N/A”):
Since the firmware version listed in the hosts iDRAC (see next screenshot) matches one of the “Recommended firmwares” from above I decided to rather hit “Silence alert”. Eventually one could look for an updated VIB file allowing the ESXi host to retrieve the firmware version from the controller.
One more effect of the upgrade from 3.0.1.1 to 3.5 is the appearance of three more VMs in vCenter. These are the old (6.5.x) instances of the platform service controllers and the vCenter. New instances with version 6.7.x have been deployed during the upgrade. After all settings had been imported from the old ones, these were apparently powered off and kept in case something would have gone wrong. After a period of time and confirming everything works as expected those three VMs may be deleted from the datastore:
Sometimes when a storage device (i.e. SSD or HDD) has been used for a previous vSAN deployment or has other leftovers it cannot be re-used (either for vSAN or a local VMFS datastore) right away. When you try to format the drive as shown below the error message “Cannot change the host configuration”:
The easiest way is to change the partition scheme from GPT to MSDOS via CLI (and back via GUI) and has been described in the community before.
However, even that may fail, e.g. because of the error “Read-only file system during write”. This can occur if the ESXi hypervisor finds traces of old vSAN deployments on the drive and refuses to overwrite these. In that case you first have to delete those traces manually. Log into the host in question as the root user and issue the vSAN commands needed. These are the commands for listing all known vSAN disks, deleting a SSD (cache device) and a (capacity) disk: