Another quick post today.
I was having issues at a customer with SQL cluster monitoring. The SQL server was quite beefy, an HP server with 1,5 TB of memory running Server 2012 with 25 SQL instances, of which 13 were failover cluster instances, the others were SQL availability groups.
As SCOM has to monitor this server on all levels, you can imagine with the amount of roles on this server (over 1000 databases hosted) and cluster resources that it was hammering WMI hard. Very hard.
Although intervals were already adjusted to meet this kind of cluster size, it started leading to potential issues. SCOM eventually caused cluster resources to deadlock because of it, and roles started failing over to the other node.
The culprit in this case was Microsoft.Windows.Server.MonitorClusterDisks.vbs, as it kept timeing out as it ran over 300 seconds.
I found this post by Kevin Holman , which is a big name in the SCOM community, he says that apparently the WMI namespaces for clusters are poorly optimized and not designed to handle this many objects. Another possibility is that this script is simply not designed to work with that amount of cluster disks (100+).
In my case, I didn’t really have a need to monitor the cluster disks, as all my cluster disks contained SQL databases. The SQL Management Pack monitors free space on the disk when a database is hosted on it by default (if autogrow is enabled). So I disabled the Cluster Disk discovery for SQL servers.
Post analysis showed that the CPU usage by the cluster service was reduced significantly.
Here’s a graph of the CPU usage (keep in mind, scale is 0,1).
When cluster disk monitoring was enabled:
After disabling disk monitoring:
Although officially 25 clustered SQL instances on one box are supported by Microsoft, it is definitely not recommended if you want to monitor your server properly, as you will run into a shortage of management resources (WMI etc).