During holdays a disk broke in our lab, thats a great opportunity to review what we need to do when a disk fails in vSAN and how to fix an error when we try to remove the diskgroup.
Our platform:
- Version: vSphere 8.0
- Hosts: x3
- Diskgroups: 1 x host
- Disks: 1 NVMe disk for cache and 2 NVMe capacity disk
Prerequisites
First we have to check if we can replace only the faulty disk or if we need to recreate the diskgroup completly. To decide what to do remember:
- If we need to replace a cache disk we have to recreate the complete diskgroup always.
- If encryption or deduplication is enabled we also need to recreate the diskgroup, there is no difference if the faulty disk is a cache disk or a capacity disk.
- In other cases where the faulty disk is a capacity disk we could replace only the faulty one without recreating the diskgroup.
Let’s review some pre-requirements and requisites:
- We need to be extra careful when changing a disk in RAID 0, its recommended to check vendor instructions before replacing the disk physically. Don’t forget its recoomended to configure disk in the controller as passthrough.
- When replacing capacity disk it’s recommended to use the same model and size. If we cannot get the same size its recommended to use the same model with the next bigger size available. We need to be carefult with the balancing when using different disk sizes.
- When we change any kind of disks (capacity or cache) its recommended to use disks with the same or better factor of “endurance” and performance”
Diskgroup rebuild
First we need to do is finding the faulty disk. To do that we enter in the cluster configuration and then into the vSAN disk administrator, here we can check which host has has the disk with errors.
Now we click on “Disks” in the host with the alarm.
In our case the missing disk is the cache, in means we need to remove the diskgroup and recreate it.
Click on the line of dots next to the diskgroup label to open options.
Click on “Remove”
If the diskgroup is removed without alarms we can proceed to the following point where we create the new diskgroup again. In our case we got an error “General vSAN Error”
Log in the host with the faulty disk by SSH, we can double check which disk is missing and the UUID of the diskgroup.
Here we can see we only have capacity disks in the list, its the same info we got from the GUI. Write down the UUID for the next step.
[root@ast-esxi01:~] esxcli vsan storage list
t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________D97806418B441B00
Device: t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________D97806418B441B00
Display Name: t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________D97806418B
Is SSD: true
VSAN UUID: 52389c3f-fa52-6905-39a7-c5adbfabcd9d
VSAN Disk Group UUID: 52adc0bc-6971-fe30-4490-c89de109565e
VSAN Disk Group Name:
Used by this host: true
In CMMDS: false
On-disk format version: 17
Deduplication: false
Compression: false
Checksum: 8683338899751340819
Checksum OK: true
Is Capacity Tier: true
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Mon Feb 7 13:20:14 2022
t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________53D306418B441B00
Device: t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________53D306418B441B00
Display Name: t10.NVMe____WDC_WDS200T2B0C2D00PXH0__________________53D306418B
Is SSD: true
VSAN UUID: 52e1b7f0-74c6-3ccb-c441-09d7faed25bf
VSAN Disk Group UUID: 52adc0bc-6971-fe30-4490-c89de109565e
VSAN Disk Group Name:
Used by this host: true
In CMMDS: false
On-disk format version: 17
Deduplication: false
Compression: false
Checksum: 3520512375592328882
Checksum OK: true
Is Capacity Tier: true
Encryption Metadata Checksum OK: true
Encryption: false
DiskKeyLoaded: false
Is Mounted: true
Creation Time: Mon Feb 7 13:20:14 2022
Remove the diskgroup using the UUID from the last step.
[root@ast-esxi01:~] esxcli vsan storage remove -u 52adc0bc-6971-fe30-4490-c89de109565e
Using the GUI check the diskgroup has been removed. We should see that there are no disks in use.
We need to create a new diskgroup. Click on “Disks” and “Create diskgroup”
Doublecheck the diskgroup has been created OK and its healthy.
Now we should have 3 disks in use in green.
Last step is to take out the host from maintenance mode. Objects with broken components should start repairing automatically (60 min counter probably is already expired)
Object rebuild
We have a new working diskgroup, we should check all objects are now “healthy”
Click on the virtual object manager in the “Monitor” section of the cluster. If we have broken objects still to be repeared they will appear in red.
It’s recommended to wait until all objets are rebuilt automatically. This can be checked in the section “Resyncing objects”. If we have rebuilding tasks to complete you will see there how many GBs and the time to complete all tasks.
In the case not all objetcs are automatically rebuilt or if we want to force the resync immediately task we can do it in “Shyline Health” section. Do a test first to have updated information and select “vSAN objects status”, there you can click “Repair Objects Immediately”
We should have all components in green now
Extra: Recreate performance service
Sometimes if we wait too long to recover the performance service we cannot recover the missing components even if we force to repair the objects using Skyline Health. In this case the only thing we need to do is to deactivate the service and activate it again to rebuild the database,
Inside the configuration of the cluster go to vSAN Services and edit the performance service to do it. We can deactivate it or change the storage policy there.
Comments