Thursday, March 1, 2007

systems availability vs. systems effectiveness

First of a series of posts where I will cover the area of application support & service management. This is probably one of the largest problem areas in an IT portfolio and the number one reason that leads to CIO departures.

Providing excellent day-to-day service ! What is the role of IT in this ? What is the role of a particular application support team ?

A typical CIO challenge is to take the IT group up the value chain within a company. This applies to all disciplines including providing day-to-day support.

Support teams and IT value is typically stuck at the systems availability monitoring and reporting level. 99.95% uptime. Famous words. We've all heard this. Somehow that 0.05% seems to hide a massive amount of operational impact. Putting that under the microscope usually leads to startling revelations.

An alternate strategy is to focus on systems effectiveness - my name for nothing other than business process monitoring, however, this is a little different. Here, you apply the concepts of business process monitoring in a 'systemy' way.

To clarify, each system typically performs a specific function within a process chain. Systems monitoring at the technical level covers all the engineering aspects of the platform eg.
Database, hardware, CPU, I/O, Filesystem space etc. Usually, this stuff is trapped using tools like HP Openview, BMC etc. and monitored by a 7x24 bridge operation. When alerts are received, automatic callouts are performed with an extra pair of eyes to make sure.

Better groups take this up one level. Monitoring of log files for errors eg. SQL errors, core dumps, etc. However, this is also usually insufficient. Even better groups start getting sophesticated around application level capacity monitoring - eg. thread utilization, queuing behavior and other subtleties around bad jvm characteristics eg. full GCs.

However, that also isn't usually sufficient. The trick is to customize a set of measures that are relevant to the business use of the application and monitor for that. Keep your finger on that pulse and magic happens. Your operational partners will no longer care if the system goes up or down .. they are happy for you to measure based on the business performance. An example of this is to measure performance response time and variance for transactions that are time sensitive - eg. those that a call center application calls on the back-office systems. Alternately, in the case of workflow, some measure of cycle time and right first time (on-time being trivial case of RFT).

Do this and suddenly, you have gone up the value chain and made your life simpler. Your teams grow as they go from being purely reactive to being proactive. Also, more importantly, they learn the operational side of things and recognize exactly the criticality and value of their system in the larger picture.

This isn't anything fancy. I'm not talking about a full-scale business process monitoring framework here. Full BPM requires a standardization of metrics and process and usually abstracts away from the systems design and implementation. For architectures that are a mix of legacy and new, this is usually never perfect either. I'm talking about a simple application of common sense to what you monitor. They challenge is usually understanding the design of the system and extrapolating the meaningful set of measures that the end-to-end business process depends on. This is usually very specific to the design and implementation of the system as the data must be harvested frequently and usually in real-time.

Milan Gupta

1 comment:

ash said...

Hi Milan, Thanks for such useful posts. Please provide some guidelines which can help me resolve business critical tickets.