Skype for Business – WARNING: Standard Edition Pool Failover Disaster

I have been setting up a Standard Edition pool pair for disaster recovery for a customer and wanted to share my experiences around failover. The deployment and migration of services, users and data from the legacy Lync installation went absolutely fine and without issues. I successfully paired the two Skype for Business Standard Edition servers together, both the backup and replication services were happily synchronising data.

A few prudent powershell commands to prove correct replication returned all values as expected. Once I was happy with the configuration I wanted to perform a controlled failover from the primary to the backup pool including users and CMS to prove the failover process worked as expected. At this point I would like to thank Chris Hayward (@WeakestLync) for warning me of a potential issue during failover that screws up your CMS.

It turns out that when performing the failover, Skype for Business leaves the CMS active on both Servers! However, this is not immediately apparent or clear and I wanted to detail my experience in identifying that this is the case and what I had to do to resolve this issue. I don’t have any screenshots of the problem because I was too busy trying to resolve it, so will do my best to explain.

Failing Over

Performing the failover, I followed the steps listed on TechNet (https://technet.microsoft.com/en-us/library/jj204678(v=ocs.15).aspx) as they have worked fine in previous versions and is the same process for Skype for Business.

When running the Invoke-CsManagementServerFailover commandlet with the –Whatif parameter the results showed correctly that the CMS was on the primary server and would be failed over to the backup pool server.

Running Get-CsManagementStoreReplicationStatus returned TRUE for every server in the topology.

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned the primary server as the Active Master and Active File Transfer Agent with the backup server listed in the Active Replicas list as expected

Running Get-CsService –CentralManagement showed that the primary server was active for the CMS and the backup server as false as expected

Downloading the current topology showed the primary server as the active CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Normal State and the same for the backup server.

To ensure that the CMS was properly up to date on both servers I then ran the Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.com and checked for any replication issues in event viewer and by using CLS Logging using the HADR scenario. Everything looked positive

One last invocation to ensure servers where up to date was to force replication to the RTCLOCAL databases on each server by running Invoke-CsManagementStoreReplication command

Once I was absolutely sure I was in a position to test this by re-running the Get commands above to triple check everything I decided on Chris’s advice to take a backup of the XDS, and Lis databases, just in case.

Export-CsConfiguration –Filename c:\cms.zip

Export-CsLisConfiguration –Filename c:\lis.zip

Now I went ahead and followed the TechNet procedure by setting the Edge server next hop to the backup server using Set-CsEdgeServer –identity edgepool.domain.local –Registrar fe2.domain.local command.

Next, ran the Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe2.domain.local –BackupSqlInstanceName RTC –Force

Here is where the problems started…

When failing over the verification process was failing to verify the CMS on the backup server with the following error:

“Backup Central Management Store state is Active, the expected status is Backup. Note that if the local replica is out of date, the topology document may be obsolete. Ensure that the local replica is up to date, and run Test Management Server Cmdlet. Central management server verification failed. Verification execution will be retried once a minute for 14 more minutes. Since Failover has already finished, the user can press Ctrl + c to end the current verification task at any time, and Failover will not be affected”

I let all the retries complete but none were a success.

I then ran the following commands to see what had actually happened and what state the CMS is in at this moment.

Running Get-CsManagementStoreReplicationStatus
did not return any values at all

Running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
did not return any values at all.

Running Get-CsService –CentralManagement showed that the backup server was the ACTIVE server for the CMS

Running Get-CsManagementConnection returned the primary server as the ACTIVE CMS

Downloading the current topology showed the primary server STILL as the ACTIVE CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Error State and the same for the backup server.

So I double checked the properties of the Active Directory Service Connection Point (SCP) for Skype for Business using ADSI Edit under the Configuration context

CN=<topology guid>,CN=Topology Settings,CN=RTC Service,CN=Services,CN=Configuration,DC=domain,DC=local

The msRTCSIP-BackEndServer attribute was set to the primary server fe1.domain.local/RTC

At this point I did a lot of panicking and head scratching, using various commands, restarting services etc to try and get the Active server to show the backup server and restart replication. By restarting the Replica Replica and File Transfer Agent services on both front end servers, I managed to get some results back from the following commands

Running Get-CsManagementStoreReplicationStatus returned all servers replication status as FALSE

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned values for the Active Replicas, but nothing for the Active Master Fqdn or Active File Transfer Agent Fqdn, so replication is never going to work.

Attempting to set the SCP using Set-CsManagementServer –Identity fe2.domain.local, although did update the SCP in AD, did not set this server as the Active Master or Active File Transfer Agent.

At this point there were no errors being reported in the Lync application log and users had full feature access.

I decided then to take a look at the XDS database in SQL management studio to see what that was reporting as the master server. So I opened the database and the table dbo.Component.

In this table it showed 3 entries – I was expecting only 2 as I have only 2 CMS servers!! The entries showed the following

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1

How it should have looked

Fqdn Component Registered
Fe1.domain.local fta 1
Fe2.domain.local Master 1

So at this point it looks as though the XDS database ACTIVE ON BOTH NODES. Knowing I had a backup of this already I decided that I would try and manipulate this table to turn it back into the expected state. What a bad move that was and only made things worse by adding a new line entry like so:

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1
Fe1.domain.local Master 1

Now faced with the total loss of the CMS database I had no choice but to revert my changes and restore the CMS from the backup. The below process details my recovery steps:

  1. On the primary server ran the following command Set-CsManagementServer- Identity fe1.domain.local to update the SCP back to the primary server
  2. On the primary server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe1.domain.local –ForInstance RTC –Clean
  3. On the backup server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe2.domain.local –ForInstance RTC –Clean
  4. Stopped the replication services and backup service on both servers
  5. On the primary server ran the Import-CsConfiguration –Filename c:\cms.zip to import the CMS data from my backup
  6. On the primary server ran the Import-CsLisConfiguration –Filename c:\lis.zip to import the CMS data from my backup
  7. Ran Enable-CsTopology
  8. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the Primary Server
  9. Ran Step 2 Install / Remove components on the primary server
  10. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the backup Server
  11. Ran Step 2 Install / Remove components on the backup server
  12. Ran Get-CsManagementConnection showed the primary server as the active node
  13. Ran Get-CsService –CentralManagement showed the primary server as the active node and the backup as false (expected)
  14. Started the backup and replica services on both front end servers
  15. Ran Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.local
  16. Ran Invoke-CsManagementStoreReplication
  17. Ran Get-CsManagementStoreReplicationStatus and the results returned TRUE
  18. Ran Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus and the active master and active file transfer agent was now set to the primary server
  19. Event viewer showed no errors and replication is now happening OK

So the biggest lesson learned here, take a backup of the CMS before failing over the pool just in case this happens to you. Without it I am not sure I would have still been in a job!

Workaround Theory

As I am not the only one who has experienced this issue, it could be a problem with Skype for Business itself. I feel that if I try and failover the CMS again the same problem will occur. So I have come up with a theory that I am going to attempt to qualify in a lab, but welcome any suggestions

1. Create a daily backup of the XDS and Lis Databases and store them on the backup pool server (done with PowerShell) something like this to give me 5 points of recovery

# CMS Backup Script workaround
#Set Backup Locaton
$backupfolder = \\fe2.domain.local\CMS_BACKUP
#Days to Keep
$retention = “5”
#backup file names
$date = Get-Date -Format dd-MM-yy
$cmsfilename = “CMS-$($date).zip”
$lisfilename = “lis-$($date).zip”
#backup store cleanup
$limit = (Get-Date).AddDays(“-$($retention)”)
Get-ChildItem -Path $backupfolder -Recurse -Force | Where-Object { !$_.PSIsContainer -and $_.CreationTime -lt $limit } | Remove-Item –Force
Import-Module SkypeforBusiness
Export-CsConfiguration -Filename “$($backupfolder)\$($cmsfilename)” -ErrorAction SilentlyContinue
Export-CsLisConfiguration -Filename “$($backupfolder)\$($lisfilename)” -ErrorAction SilentlyContinue

(Export-RgsConfiguration too if you have these setup)

2. When failing over to the backup pool perform the setting of the edge server(s) next hope and invoke-CsPoolFailover to fail the users across.

3. Then repeat steps to reinstall the CMS to the backup server in a clean state and then reset the SCP. At this point Skype for Business should (in my mind) treat the backup server as the master

4. When failing back repeat the process on the primary

I guess the best method here is to move the CMS database to a SQL cluster away from the Standard Editions and probably going to be the recommendation from me to my customers moving forward.

Anyway, the moral of this story is that make sure you have a backup and make sure you test (but be aware of this issue) failover in a controlled manner before having to rely on it for real. If anyone has any suggestions, want to share their experiences or receives information from Microsoft about this please share in the comment section below.

Archives