Wednesday, March 7, 2007

database maintenance


One recurring theme I see with junior dbas is the lack of understanding of the '' proceedure. This is crucial for proper database performance. Can be scheduled once a week or more frequently based on system profile.

Basically, the 'analyze table' command gathers statistics on the table and stores them internally as hints for the query optimizer. Indexes are picked up based on these statistics. For dynamic tables (large growth or shrinkage or changes to key indexed fields), it is imperative this is performed frequently. For safety, run it anyway after key database operations.

Know what 'good' looks like for your system. Baseline and save (better still commit to memory) the key behavioral aspects of your system eg. cpu utilization profile, throughput, response times, i/o levels etc. That way, you will know when this profile changes and will be a key trigger for you to action for early detection of problems. More importantly, when you implement change, you should compare the before and after profile of your system.

Saturday, March 3, 2007

Change management

The first question I ask a team when something breaks is what changed ?. 99% of problems end up relating to change. Amazing !!

I did not put change as the top problem within my post on what keeps me busy because I believe that change is good. It is necessary and as inevitable as evolution. It is necessary for healthy growth. So IT cannot take the simplistic approach of 'if it ain't broke, don't fix it!'.

Any decent sized telco typically has an IT estate of 3000+ systems, 50K+ computing assets and 10K+ people all changing stuff. Is it any surprise that systems availability & stability actually goes up during christmas ?

What is key to establishing a high performance IT organization is managing change effectively. It is instrumental to have a proper inventory of systems, more importantly, their inter-dependencies and impact on business process. Most importantly, a team that understands this model.

While end-to-end testing frameworks can flesh out unexpected side-effects of change, it isn't reasonable to expect that all work will go through this framework. With 10K+ employees, there will be leakage and impact. Specifically around stuff that you would least suspect.

Proper and effective change management framework would consist of the following :
  • A team as described above that performs the function of a change approval board (centralized or decentralized)
  • An e2e testing framework that certifies each change
  • An effective communication framework typically a change ticket process that notifies the relevant enterprise pieces that are potentially impacted by the change
  • Leadership & support from the development teams on implementing change.
  • Post-implementation verification of change (ideally monitoring key business KPIs before and after the change).

Milan Gupta

proactive vs. reactive application support

A key aspect of a good application support model that is often missed is the importance of each application support group behaving as a customer of a downstream dependent system.

A common fallacy is to assume that each application and its support group is an independent unit purely acting on a reactive basis. This puts the onus of taking a business user problem and translating it to a specific application problem, onto some form of a centralized service management wrap. Not the most efficient as most problems are first detected within the application support teams (provided they are awake). It is imperative that they drive the resolution of the problem to their downstream dependent system.

As an example, consider the picture wherein the application arena is a simple service order processing chain wherein system A is some CRM system, system B is an orchestration / workflow layer, system C is a inventory / assignment system and system D is a service activation system. Typically, there is tight coupling and dependencies between each of these layers - any anomalies impact business KPIs and flow-through. The users of the CRM systems will see these anomalies as orders not being completed on time. The orchestration system will see these as orders stuck at a particular stage. The problem may actually lie within the assign / inventory layer or the service activation layer which also will be noticed by the respective app support teams.

Behavior within the teams should be as follows :
1> Ideally, the bridge monitoring system should have received alarms and alerted the respective systems .. this would represent IT being pro-active.
2> Failsafe on this would be the app-support team for the CRM system creating an IT fault on the orchestration system who in turn would transfer the fault to the assign / inventory system. Not the most effective, however, necessary as a failsafe and reinforces proper organizational behavior. For complex scenarios, a service management layer may be introduced. I still consider this pro-active.
3> Least ideal is 1 & 2 failing with the users reporting the fault - this is IT being reactive.

The usual breakdown I observe is in #2 with companies mostly operating in #3. #1 requires a sophesticated business process monitoring infrastructure, something I consider to be still an industry wide problem given state of investment and commitment to such projects within an IT portfolio. Breakdown in #2 is usually an artifact of organizational boundaries and/or poor skillsets & focus. Each team operates in a silo and purely on a reactive basis. A truly dangerous place to be for any CIO.

Milan Gupta

Friday, March 2, 2007

synchronous vs asynchronous transactions

If you have ever built call center apps .. you will already have learned this lesson. For some reason, we keep repeating these mistakes over and over and over ...

Remember - synchronous transactions for time-sensitive stuff. For transactions that a call center rep has to wait on (while customer is on the phone) .. use synchronous backplane eg. web-services. For others (non-time sensitive), use asynchronous. Your messaging architecture MUST support both.

The thing that creates havoc the most in call centers is transaction performance variance. Not always just transaction performance. If something consistently takes 90 seconds, you will find your call center reps work around this poor performance by predicting this period of wait and filling it with other work or small talk with the customer. What makes call center agents mad is transaction variance - sometimes it only takes 4 secs, sometimes 300. That's when the customer on the line gets the embarrassed comments - 'my system is slow .. my system has frozen up ..' etc. etc.

Milan Gupta

Thursday, March 1, 2007

systems availability vs. systems effectiveness

First of a series of posts where I will cover the area of application support & service management. This is probably one of the largest problem areas in an IT portfolio and the number one reason that leads to CIO departures.

Providing excellent day-to-day service ! What is the role of IT in this ? What is the role of a particular application support team ?

A typical CIO challenge is to take the IT group up the value chain within a company. This applies to all disciplines including providing day-to-day support.

Support teams and IT value is typically stuck at the systems availability monitoring and reporting level. 99.95% uptime. Famous words. We've all heard this. Somehow that 0.05% seems to hide a massive amount of operational impact. Putting that under the microscope usually leads to startling revelations.

An alternate strategy is to focus on systems effectiveness - my name for nothing other than business process monitoring, however, this is a little different. Here, you apply the concepts of business process monitoring in a 'systemy' way.

To clarify, each system typically performs a specific function within a process chain. Systems monitoring at the technical level covers all the engineering aspects of the platform eg.
Database, hardware, CPU, I/O, Filesystem space etc. Usually, this stuff is trapped using tools like HP Openview, BMC etc. and monitored by a 7x24 bridge operation. When alerts are received, automatic callouts are performed with an extra pair of eyes to make sure.

Better groups take this up one level. Monitoring of log files for errors eg. SQL errors, core dumps, etc. However, this is also usually insufficient. Even better groups start getting sophesticated around application level capacity monitoring - eg. thread utilization, queuing behavior and other subtleties around bad jvm characteristics eg. full GCs.

However, that also isn't usually sufficient. The trick is to customize a set of measures that are relevant to the business use of the application and monitor for that. Keep your finger on that pulse and magic happens. Your operational partners will no longer care if the system goes up or down .. they are happy for you to measure based on the business performance. An example of this is to measure performance response time and variance for transactions that are time sensitive - eg. those that a call center application calls on the back-office systems. Alternately, in the case of workflow, some measure of cycle time and right first time (on-time being trivial case of RFT).

Do this and suddenly, you have gone up the value chain and made your life simpler. Your teams grow as they go from being purely reactive to being proactive. Also, more importantly, they learn the operational side of things and recognize exactly the criticality and value of their system in the larger picture.

This isn't anything fancy. I'm not talking about a full-scale business process monitoring framework here. Full BPM requires a standardization of metrics and process and usually abstracts away from the systems design and implementation. For architectures that are a mix of legacy and new, this is usually never perfect either. I'm talking about a simple application of common sense to what you monitor. They challenge is usually understanding the design of the system and extrapolating the meaningful set of measures that the end-to-end business process depends on. This is usually very specific to the design and implementation of the system as the data must be harvested frequently and usually in real-time.

Milan Gupta