SQL Resources FAIL when restarting "Passive" Node

Giganews Newsgroups
Subject: SQL Resources FAIL when restarting "Passive" Node
Posted by:  Pat R. (Pat …@discussions.microsoft.com)
Date: Thu, 8 Mar 2007

I have a two node cluster running W2K3 EE SP1 and MSSQL 2000 SP4.  IBM 366
Quad duo core w/ hyperthreading and 16 GB RAM each.  I am hosting 5 instances
"Multi-instance" of MSSQL combining a total of 120 Databases - Eggs in one
basket for sure. MSDTC separated from the instance resources and functional,
resources have been running fine either all on one node or load balanced
between the nodes. HF's known to date have been applied.
Anyway, during the recent DST patching cycle I had performed a rolling
upgrade by patching the "passive" node after moving all the owned resrouces
to the stand by node - I load balance the resource groups normally. Despite
the small interruption to SQL during the move group operation, the resources
were brought online, SQL checkpointing and recovery was completed
successfully. Right after they were brought online and made available for
user connections, I proceeded to patch the standby node then restarted the
node gracefully. Only to my surprise upon disjoining the cluster as part of
the restart the resources on the ACTIVE node FAILED. Dropping all our
production applications to its knee's. Only way to recover was to wait for
the Restart to complete on the stand by node (patched) to join the cluster,
then REBOOT the ACTIVE node to FORCE the removal of it, then restart the
PASSIVE node's MSCS service which then was able to grab the Quorum and FORM
the cluster hence resources coming online.

Any ideas what could be doing this ?  PSS has not been able to find anything
on this behavior.. :-| Though from my chair, it looks like a dependency
somewhere either in the registry or the quorum has been "created" and when
the specific node restarts, this is broken and catupaults the cluster into
the infamous SPLIT BRAIN condition.  Thanks in advance.