Recently I was tagged by Jeremiah Peschka with a question that goes a little something like this:
"So You’re On A Deserted Island With WiFi and you’re still on the clock
at work. Okay, so not a very good situational exercise here, but let’s
roll with it; we’ll call it a virtual deserted island. Perhaps what I
should simply ask is if you had a month without any walk-up work, no
projects due, no performance issues that require you to devote time
from anything other than a wishlist of items you’ve been wanting to get
accomplished at work but keep getting pulled away from I ask this
question: what would be the top items that would get your attention?"
By comparison of an impromptu poll at some webcast the other month which I don't remember, where I work is towards the low end of # of servers/DBA. Still, for numerous reasons, we don't have any 3rd party tools that help automate the monitoring of our SQL Servers, so alot of what we do is a combination of automation that we've built ourselves, manual monitoring when necessary, and prioritizing the machines that need the highest level of scrutiny.
If I had this mythical month, I'd reinforce my monitoring web along a 3-part strategy, making every effort to use it as an opportunity to learn new technologies or skills.
When something goes horribly wrong, the DBAs need to know about it right away, and preferably in advance of the users. All of our servers should notify us directly of any catastrophic errors, and if applications have database functionality that requires high availability, these components should notify the DBAs when they break. I'm happy to say that I'd have very little work to do in this area.
In this arena, we have a fair number of gaps, and I could easily see myself spending quite a bit of time in my 'free' month shoring up our reports and initiating the collection of metrics for historical analysis. I like reports for a number of reasons. They provide another line of defense for alerts and help identify any alerting components that haven't been configured properly. They help us identify the beginnings of issues so that we can intervene on them before they become alerts. They help us know our systems better. But, the type of reporting I'm referring to here is really only real-time, or over a window of time no longer than a month, and most likely just within the last 24 hours.
The chief technologies I'd leverage for this aspect of monitoring would be PowerShell and SQL Server Reporting Services. I like PowerShell because of the ease with which I can use it to connect to many different sources of information, like the OS, hardware, SQL Server instances, AD, etc., and I like Reporting Services for the presentation layer to make the data I collect via T-SQL and PowerShell look pretty.
Analysis and Forecasting
In this final layer, I'd start storing the data I'm gathering from my real-time or near real-time reports so that I can make more accurate long term projections about disk space needs, resource usage, user connections, and peak-use times, to name a few. And, I'd be able to analyze historical trends to confirm or reject the notions that we currently have about our systems.
I know that to some of you the monitoring I've described above are things that you already have in place and couldn't live without. In fact, you may think that not having these things in place borders on criminally negligent. Well, for us, most of the systems that use SQL Server are secondary systems, at best. Yes, they are still important, but most of them could tolerate outages of a day or more without adversely impacting the business.
To keep the question alive, I'd like to tag Stuart over at codegumbo, Trevor Barkhouse, who helped me immensely with a tricky PowerShell scripting question recently, and Chad Miller, who continues to lead the way on using PowerShell with SQL Server.