Production Issue: How to diagnose issue when SQL is thread starved

We have had several recent production issues with our SQL 2008 database where so many queries stacked up that we ran out of threads and sql was unresponsive.  We are having a tough time diagnosing the issue since Idera could not connect.  We couldn't tell via any of the views what led up to the issue.  Any ideas here?  We are on version 9.1.  Are newer editions better?  Also, we tried at one point turning on the query analyis feature but this at one point took down the sql server. Has anyone had luck with that in the newer versions?  

  • We are on version 10.2 and we also experience random "unable to monitor" alerts from Idera, very random.  And as you have already experienced, the query analysis is hard on the server and directly affects performance, at least that has been our experience on busy systems.

    Sorry I don't have a good answer for you but the issues you describe in version 9.1 are similar to what we have seen sporadically on 10.2

  • There is a thing called "Dedicated Admin Connection". With this kind of connection (only one per instance) you will be able to connect from Management Studio, even if all worker threads are occupied, because it runs in its own scheduler. At least you can do some basic troubleshooting then, if it was only to save the current session information to a DBA database for later analysis.

    Another option would be to create a custom counter in SQLdm called "spare workers"

    Custom Counter Query:
    -- maximum worker count
    (SELECT max_workers_count FROM sys.dm_os_sys_info)
    -- current worker count (all threads)
    (SELECT COUNT(*) FROM sys.dm_os_workers)
    -- system workers (do not count for the limit)
    (SELECT COUNT(*) FROM sys.dm_exec_sessions AS s
    INNER JOIN sys.dm_exec_requests AS r ON s.session_id = r.session_id
    INNER JOIN sys.dm_os_tasks AS t ON r.task_address = t.task_address
    INNER JOIN sys.dm_os_workers AS w ON t.worker_address = w.worker_address
    WHERE s.is_user_process = 0
    AS spare_workers -- should be above 0
    Set a threshold to get alarmed if the counter goes below a certain low value (depends on your environment). At least you could then start looking before the server is non-reactive anymore.