Saturday, July 7, 2007

on monitoring enterprise shared/common components

I had an encounter with a commonly used enterprise component - single sign-on, specifically, a tool called Site-minder (now owned by computer associates).

First my rant, as this cost me 3 days of my life and a weekend away from my family. While quite a nifty tool and what appears to be a highly scalable platform, I was not prepared for the level of 'blindness' to simple things like throughput and response time. You can get all sorts of information about connections, threads etc. however, the information isn't sufficient within the package to fully monitor what is going in and out of this black-box. Also, no historical graphs in the monitor ? Wait, that is another product you have to buy from CA ? Wouldn't it be awesome if only we could reset the stats/counters at run-time ? That way, you could tune, then reset the stats/counters & re-measure.

OK. Got that off my chest and it wasn't completely venomous. Believe me, its been a tough week.

Seriously, shared infrastructure typically have really compelling business cases and yes, there are truly efficiencies to be gained, however, effective monitoring becomes absolutely critical. All eggs in one basket means you save on baskets and runners, however, the stakes for a mistake go way up !! So you better be careful.

Shared infrastructure is also more complex to model and monitor, specifically when you are dealing with layered distributed systems. Eg., in the Siteminder model, there are agents that consume transactions (may be locally cached) from a series of policy servers which in turn consume transactions from downstream services, eg. LDAP directory servers, authentication servers ...

In such a framework, I would closely monitor the following aspects :
- daily/hourly transaction arrival rate from each agent (and the cache rate).
- transaction response time variance (this is a sign of downstream bottlenecks)
- resource consumption in the policy server (no. of threads active/in-use) .. monitor at peak
- above three for each of the downstream consumables.

If your application support team can do this, they have the first clue about what is going on within their framework else, guess what, tomorrow you may go through what I just went through this past week.