When you configure a highly available application or service in a Windows Server 2008 failover cluster, the group failover threshold value is incorrect. The default value is set equal to the number of nodes that are configured in the cluster.
For example, in a five-node cluster, any highly available application or service resource grouping has a default failover threshold set equal to five. By default, the Period: (hours):
setting is set to six hours. Therefore, when a highly available service or application group experiences a failure of one or more resources in the group, the service or application group tries to fail the group over to another node in the cluster up to five times in a six-hour period. After the fifth failover attempt, the service or application group remains in a "Failed" state.
In this situation, a total of n - 1 failovers occurs in the six-hour period. Therefore, four failovers occur. The failover process works correctly. However, the number that appears in the Failover Cluster Management snap-in is incorrect. In this situation, the number is 5.
No adverse effect is associated with the incorrect display in the Failover Cluster Management snap-in. Failover continues to function correctly. No action is required.
This is a known issue in Windows Server 2008 Failover Clustering. This problem is corrected in Windows Server 2008 R2.
Steps to reproduce the issue
- In the Failover Cluster Management snap-in navigation pane, expand one of the managed clusters that has a highly available application or service configured.
- Expand the Services and Applications category.
- Select and then right-click one of the groups, and then click Properties.
- Click the Failover tab, and then view the Maximum failures in the specified period setting.
The number that you see is equal to the number of nodes in the cluster.
- To simulate the behavior, select and then right-click a resource in the group, and then click Simulate Failure of this Resource under More Actions.
The default restart behavior for a cluster resource is to try to restart the original owning node. Therefore, the failure that you have started causes a failure. The resource comes back online on the owning node.
- Start a failure again. This causes the group to go offline, and then to move to another node in the cluster.
- Execute step 5 and step 6 until the resource remains in a "Failed" state. Make sure that you count the number of times that the group comes online on other nodes in the cluster. The final count is equal to n - 1.
For additional testing, follow these steps:
- Select and then configure another service or application.
- Increase the Maximum failures in the specified period setting by one.
- Select and then right-click a resource in the group, and then click Simulate Failure of this Resource under More Actions.
- Start a failure again.
The failover count now matches the new setting.
Article ID: 950804 - Last Review: September 11, 2010 - Revision: 2.1
- Windows Server 2008 Datacenter without Hyper-V
- Windows Server 2008 Enterprise without Hyper-V
- Windows Server 2008 for Itanium-Based Systems
- Windows Server 2008 Datacenter
- Windows Server 2008 Enterprise
|kbclustering kbexpertiseadvanced kbtshoot kbprb KB950804|