Tuesday, April 24, 2007

Systems monitoring - historical views - best practice

One of the best things I have seen is the standardized use of a monitoring framework with historical reporting for the technical aspects of a system (CPU, i/o, network, memory, database, kernel ..).

There are several tools out there in the marketplace that do this. IBM Tivoli, HP Openview, BMC, EMC SMARTS (and then some), all offer solutions along these lines. The key is to instrument agents / data collectors across the estate (on each server) and have a central database & reporting web-site that allows IT folks to select a node and display historical results around a wide variety of technical aspects. The value seems ambiguous, however, let me tell you that it makes my life much much easier.

When in crisis mode, this data helps immensely. It is crucial data that tells you when something changed. It allows technical teams to quickly focus, analyze and resolve a set of issues that normally are thorny and contribute to large MTTR numbers on problem incidents. Yes, logging into the box and monitoring real time tells you there is a problem, however, a historical view tells you when something changed. Also, this is crucial for another best practice - server capacity management and monitoring.

The key is universal rollout / standardization. Don't get trapped in the technology selection mode. Pick one and implement universally. This isn't difficult work, however, best implemented within your server provisioning process so that anything new automatically has the standardized framework.

It is amazing how telling something as simple as a historical CPU profile is. You see processing/business utilization patterns, exactly when backups occur, batch jobs etc. and more importantly, when something CHANGED.

1 comment:

Michael Janke said...

Good thoughts. I'd agree that strip charts (like MRTG) are invaluable when troubleshooting. We have automatic MRTG charting of CPU, I/O, TCP connections, disk space & a few other variables for all servers.

We overlay multiple strip charts for CPU, I/O etc. to correlate events. (High CPU + high I/O)

The historical data presented by a simple MRTG strip chart is very valuable for troubleshooting and trend detection.