You may be aware that not so long ago I put together a blog post on how to spread your users across multiple file shares using FSLogix and some excellent scripting from fellow CTP Ryan Revord. The idea was that every time the workers were rebooted, the VHDLocations Registry value would re-order itself based on the highest amount of free disk space, so that new users were always directed to the least-loaded file share in the list.
This has worked really well for us, and an added boon of the feature was that if a file share in the list was unavailable, the user would simply create a new profile on the next available one. Obviously if you are using Cloud Cache or other methods for resiliency the process would be different, but in my particular use case, where resiliency wasn’t really required, we were safe in the knowledge that if we lost a file share or an availability zone, users would simply create new vanilla profiles on one of the other file shares and continue.
Unfortunately, a Microsoft blog post was brought to my attention recently that seemed to indicate that using multiple entries in the VHDLocations Registry value would not provide failover in this fashion. Conscious that I have many thousand users in an environment with a large number of file shares, this got my attention pretty quickly. Joining up with others in the community and at Microsoft, I did some testing and eventually ascertained that this appeared to be a bug. Using an older version of the FSLogix software showed that it originally behaved as intended. Below are the results of my rudimentary testing on an old version of FSLogix which showed the “failover” style behaviour when a file share was unavailable.
- If a profile exists in \\FS\Share1 and \\FS\Share2 and \\FS\Share2 is offline, the profile from \\FS\Share1 is used
- If a profile exists in \\FS\Share1 and \\FS\Share2 and \\FS\Share1 is offline, the profile from \\FS\Share2 is used
- If no profile exists in \\FS\Share1 and \\FS\Share2 and \\FS\Share1 is offline, a new profile is created in \\FS\Share2
- If no profile exists in \\FS\Share1 and \\FS\Share2 and \\FS\Share2 is offline, a new profile is created in \\FS\Share1
- If a profile exists in \\FS\Share1 only, and \\FS\Share1 is offline, a new profile is created in \\FS\Share2
- If a profile exists in \\FS\Share2 only, and \\FS\Share2 is offline, a new profile is created in \\FS\Share1
However, repeating the same tests on the latest version of FSLogix showed different results. If any of the file shares in the VHDLocations list were offline, users with profiles in that file share or any of the file shares after it in the list would be unable to log on, and new users would also be unable to log on if any file share in the list was offline. Testing in the lab showed that when the first file server in the list was offline, users with profiles in the secondary or users with no pre-existing profile found themselves blocked from logging on with this error
Unfortunately, we’ve had confirmation from Microsoft that this bug was introduced back in July of 2020. (Yes, my environment has been hanging by a thread for this long!) To pour salt on the wound, the original code changes were required to fix some critical logon issues, so it isn’t simply a case of Microsoft being able to roll it back. If you did decide to roll back, then going back seven months or so is going to remove a lot of enhancements and bug fixes (particularly around OneDrive). Microsoft are going to address this, but the point I’m trying to make is that it isn’t going to happen quickly, as they will have to scope out the required changes so as not to reintroduce the problems they were attempting to fix in the first place.
So firstly, if you’ve used the method we put together in your environments, then I apologize for bringing you this bad news. If you have, what you need to do is take stock of your options. This also applies if you’re using multiple VHDLocations entries for failover in any way, not just if you’re using the scripted method, so if you’re doing it this way please read on!
Firstly, take note of the fact that the potential scope of failure has increased possibly greatly. If you had users spread over ten file shares, before the bug was introduced, you would be able to lose one file share and the users homed there would simply receive a new, vanilla profile. But now, if you lost the sixth file share in that list, the users on file shares 6-10 would all be unable to log on and neither would any users who didn’t already have a profile. If you’ve got FSLogix configured to allow logon when it can’t mount a profile, then potentially those users could all log on with local profiles – but then that obviously puts a new strain on your local storage and also, in many Citrix/RDSH environments, means that the users would get a new profile every time they log on as they hit different servers or VMs. In summary – your failure domain has just potentially increased and broadened quite dramatically.
It’s also important to think about how you would normally respond and recover to the failure of a file share or a group of file shares (such as in an AZ loss). If your monitoring and recovery capability is swift enough to restore file shares to service before a major interruption occurs, then you may not be as exposed as you think. Users can be configured to use local profiles for the time period until the file share is restored, as long as you can prove that your response and recovery would initiate a quick fix.
Thirdly – what to do until such time as Microsoft can offer a fix? You can a) roll back to the pre-July 2020 version, which means you’d lose a lot of functionality and fixes, b) trust your monitoring and recovery to alert and repair a broken file share, c) hope your file shares all stay online (hey, it’s worked for me for this long!), or d) use object-specific Registry values. You could also try editing the script to possibly run more regularly and re-order itself if an offline file share is detected – this is more challenging but I believe James Kindon has done some work that may help you in this case (see here for his efforts).
Well, the obvious option, and most likely the official line you will get out of Microsoft, is to use object-specific Registry values to split out your file shares to dedicated groups of users. Now this doesn’t achieve what we were previously achieving with Ryan’s script – which was spreading our users dynamically across the file shares in our estate without having to subdivide them into groups – but it does reduce the scope of interruption when a single file share goes down. If you lose a file share in this configuration, then you will simply affect the users which are homed on that location, rather than users on subsequent file shares and new users as well.
I’m not generally a fan of using the object-specific Registry values – it means I have to split users into groups, manage the groups, and some of those groups may fill file shares up faster than others – but in this situation the only other choice I would have is to accept the fact that I’m horribly exposed. So in short – it’s better than what you’ve got until Microsoft issue a fix!
Configuring object-specific settings
These settings work by reading subkeys in the Registry under the keys that normally hold your configuration values. You can do this for either Profile Containers or Office Containers. You create subkeys under them named for the SIDs of the AD group or the AD user object you want to apply them to (for obvious reasons, I’d recommend using AD group rather than user). When FSLogix looks for settings to apply to a user logging on, it does them in this order (the different paths below refer to Profile Containers and Office Containers, respectively):-
If a key exists for HKLM\Software\FSLogix\Profiles\ObjectSpecific\[USERSID] or HKLM\Software\Policies\FSLogix\ODFC\ObjectSpecific\[USERSID], the configuration values are read and applied from here
If a key exists for HKLM\Software\FSLogix\Profiles\ObjectSpecific\[GROUPSID] or HKLM\Software\Policies\FSLogix\ODFC\ObjectSpecific\[GROUPSID], the configuration values are read and applied from here
If a key exists for HKLM\Software\FSLogix\Profiles or HKLM\Software\Policies\FSLogix\ODFC, the configuration values are read and applied from here
So in summary, user-specific first, group-specific second, machine-specific third.
As I said, it doesn’t make much sense to use user-specific, so let’s just quickly demonstrate what would be needed to divide users into two AD groups and assign them a different VHDLocations value based on that group. Obviously, you can do this with any of the FSLogix configuration settings, you’re not simply limited to that one.
Firstly, split your users into groups as required. You may have some logic around this (closest file share), or you may simply want to do it at random. If you’ve got an existing environment using the method that’s now gone wrong, then you will need to split them into groups based on the file share they’re currently located on. New users will have to be assigned a group, which ideally should be the file share which is currently least loaded. You can see why I preferred using the script to doing it this way 🙂
Next, you need to get the SIDs for the group names, which you can easily do with PowerShell
$AdObj = New-Object System.Security.Principal.NTAccount("ADGroupNameHere") $strSID = $AdObj.Translate([System.Security.Principal.SecurityIdentifier]) $strSID.Value
Repeat as many times as necessary, noting the SIDs.
Once you’ve noted the SIDs, you then need to create Registry keys on your target machines. Computer Config | Preferences | Windows Settings | Registry is the obvious way to do this. Bear in mind, that if you have previously configured FSLogix GPOs to set these settings, you don’t actually need to set them to Not Configured for this to work – as mentioned above, the object-specific entries will take priority, so you don’t need to edit the GPOs at all. Add the SID to the key path as below and create your different entries in the respective values
As I said before, you can configure as many settings as you wish in this way – I’m just doing VHDLocations because of its obvious relevance to the problem you may be facing!
Once you’ve done this, you can wait for replication to proceed, and you should now be in a position where you aren’t left exposed terribly by the loss of a single file share. Once Microsoft get around to fixing the code, however, you can go back to the previous method, and hopefully they will now be aware that despite what they thought, people were using VHDLocations for rudimentary failover purposes.
Big thanks due to Jim Moyle for getting the feedback from Microsoft and letting me know about this, and obviously to James Richards for alerting us all to it in the first place.
1,376 total views, 3 views today