knownbugs: April 2007

Wednesday, April 25, 2007

on the dark arts of search engine optimization

I have been trying to figure out why I cannot search the internet (google, msn, yahoo ...) and get to my blog even after doing an exact search for keyword combinations exclusive to my blog. What opened up to me is a whole new world. I know, I know, I am obsolete .. what world have I been living in ?! I am an old UNIX/C guy and html/xml really doesnt classify as programming to me. I now feel bad about poking fun at the mainframe guys back in the 90s. In saying that, now I really feel old !! I digress, sorry for the soapbox.

My quest is to get a hit in google search using a combination of my name and 'knownbugs' keywords. Should be unique with the top hit bringing me to the website hosted on google blogspot .. big assumption being the search engines give you results ordered by relevance (occurence of all keywords).

Well, not quite so. So, reading up on recommendations, I first researched tagging. Technorati tags is an emerging 'power player' in the world of blogs. Supposedly, 'labels', 'titles', 'headers' in blog content/articles should be automatically picked up by the Technorati engine (invoked when you 'ping'). Alternatively, you can force a tag by using the 'Technorati Tags' method.

Even so, this only makes your tags visible within Technorati's blog search. For the normal google internet search, blog content hosted on blogspot appears invisible, however, on google's blog search, it works.

I also did the wait 30 days and magic will happen thing. This is what some recommend as the time it takes for spiders to crawl your content.

So, further tricks/tips. I am now in the process of getting a custom domain name. Godaddy.com offers cheap registrar services. My selection - www.knownbugs.org (or .info, .net, .biz .. unfortunately, .com was taken by someone who wants to make money by selling the name).

What a custom domain name will do, is treat the blog content as regular www content and hopefully allow the search engine 'spiders' to index the content making it visible in the regular google search world.

I suspect, I will find other gotchas as there clearly is money at play here.

Instead of us being in the age of 'content in king', we appear to be living in the age of 'content control is king'. Here is where there is a war going on. The behemoths - google, microsoft, yahoo .. all at play.

'Influencing' search engines is worth a lot of money nowadays. A massive amount of complexity behind the scenes with 'SEC' or search engine optimization being a real growth area. I worry !

I worry about such central control on information access, however, hacks like us always have a way of breaking free.

More as my quest progresses !! Am still waiting for the DNS servers to update so expect to be redirected to www.knownbugs.org when you go to knownbugs.blogspot.com shortly.

Tuesday, April 24, 2007

common pitfalls in outage/crisis situations

Teams dealing with outages typically suffer the following behavioral problems. Some of these conflict with each other in aims .. there isn't unfortunately a set formula I can come up with as each situation has its own variables/complexities. A standardized process/template however, would be great !

1> Trial and error
Don't fall for this. You know you are in trouble when you get ambiguity from the technical teams. If you are stuck, ask yourself what can you do to get additional information onto the table. Often, you will find teams stuck 'enjoying' the problem because they have no method for infusing new information or experience into the problem diagnosis. Doing things trial & error mode also force you into a sequential analysis mode. Also, this leads to the '2 hr ETA bait' (see below).

2> Debugging in live
While prevention of recurrence is key and that requires some investment in analysis / data collection during an outage, it MUST be capped. Do not fall into the trap of ... "give me 20 more minutes, I have almost figured it out ..". You will hate yourself later. Walk into the situation with a time limit in your mind upon which you will trigger a failsafe way of restoring service. Communicate that upfront to the team (I am assuming that you have a failsafe procedure to restart the system to restore service that you will trigger on .. may be something as simple as rebooting the servers). Of course, best practice is that you always execute the failsafe and never debug in live, however, that requires a significant investment in test infrastructures that are capable of reproducing the problem. Remember, there is a value to getting to true root cause as that is the only way you will prevent recurrence.

3> Sequential analysis
Distinguish 'sequential analysis' from 'sequential execution'. Sequential execution is good, sequential analysis is BAD. On the analysis side, you should try to split off multiple teams (assuming you have the resources) on different aspects of the problem. That allows you to cover all bases quickly vs. an elongated recovery path where you are problem solving only one thing at a time. Sequential execution is GOOD because you want to introduce only one variable at a time else, you will break the cause and effect chain. Usually, problem solving is about eliminating variables and then incrementally fixing one thing at a time using a measured/scientific approach.

4> 2 hr ETA bait ...
Setting expectations on 'expected time to restoral' is really really hard. Here is the dark art of estimation at its finest. Setting no expectation is unacceptable (it will be fixed when it is fixed .. attitude). Your business partners/users will not be as upset about an outage as they will be about setting false expectations. Usually, a significant outage will require operational teams to build workaround/catch up plans where they may have to staff overtime or weekends. These plans depend on your estimates.

And .. what's worse is none of your technology suppliers / partners will co-operate.

On crisis situations with financial implications, vendors get very very conservative or worse, clam up ! In a crisis situation, you always will feel the information is inadequate to make a decision or set an expectation. I usually follow my instincts here (of course, harnessing whatever facts are available on the situation). Don't try this unless you have the right technical experience.

enterprise support from technology partners

Five years ago, I was pounding Microsoft on their lack of understanding of enterprise support. It is amazing how far they have come. I remember a couple of years back, an incident relating to a system based on SQL server. It was terrible !! The answers back from Microsoft were very casual .. try this patch ! Of course, nothing being hot patchable however, luckily not requiring a complete rebuild of WindowsNT server, an hour later when we figured that didn't work, the answer was, OK, try this now. We felt really foolish architecting a mission critical enterprise application on SQL server.

Microsoft has really come a long way since that. I was very pleasantly surprized in a recent encounter on how they have matured. Their crisis technical lead was clear, crisp, unfazed by pressure and clearly knew what he was talking about. That instilled confidence. He knew how to distill and present the facts and avoid making false promises. Also, their follow-the-sun model actually worked !! The transitions were seamless with knowledgement transfer occuring behind the scene and a warm hand-off with 1 hr overlap. Their account team was on the ball and follow-up and follow through was perfect. In fact, they chased me !!

Other examples of great support I have received are from BEA. BEA's account manager takes the unique honor of being the only sales guy I know who stuck with me for 36 hours straight during a crisis situation helping with anything he could (including doing the coffee rounds). I never believed a sales guy had that kind of stamina ;-). Oracle's down systems group are also top notch.

Technology partners usually have to support crisis situations remotely. They will depend on you for information and one of the challenges is to be able to supply it to them - real-time. Simple things like file-size limits in your email servers can look like bad ideas in these circumstances. Firewalls are a fact of life so, have a strategy on how your technology partners get access to your systems/intranet when you need them to.

crisis bridge protocol

When dealing with outage situations, it is important to establish a clear bridge protocol for the participants. Hopefully, you won't have to go through these on each call as this will contribute to your MTTR (remember, you are in an outage/crisis situation).

a> one person speak at a time
b> identify the lead (hopefully you !)
c> people mute when not speaking
d> people not put you on hold (most PBXs will play music for the rest of the participants)
e> mute if you want to have a sidebar conversation
f> remember, if you go to sleep, you will be spotted because of your snoring
g> no calling from a cell phone (or c becomes very important)
h> establish clearly the participants and their role/what function they represent

Traditional conference bridges are slowly evolving into a multimedia facility - IM session in parallel is becoming commonplace with Netmeeting/Livemeeting quickly following. At Qwest, it was nice to have the facility to dial into an 800 number and then select a sub-bridge (option 1..9). That way, the main bridge team could quickly branch sub-teams off without confusion and avoid wasting time on communicating bridge numbers.

Separate out from the start a management bridge, customer bridge and the technical bridge. Chaos ensues if you mix them all into one.

Just my two minutes of brain dump .. will add more as I flesh this out/collect my thoughts.

Systems monitoring - historical views - best practice

One of the best things I have seen is the standardized use of a monitoring framework with historical reporting for the technical aspects of a system (CPU, i/o, network, memory, database, kernel ..).

There are several tools out there in the marketplace that do this. IBM Tivoli, HP Openview, BMC, EMC SMARTS (and then some), all offer solutions along these lines. The key is to instrument agents / data collectors across the estate (on each server) and have a central database & reporting web-site that allows IT folks to select a node and display historical results around a wide variety of technical aspects. The value seems ambiguous, however, let me tell you that it makes my life much much easier.

When in crisis mode, this data helps immensely. It is crucial data that tells you when something changed. It allows technical teams to quickly focus, analyze and resolve a set of issues that normally are thorny and contribute to large MTTR numbers on problem incidents. Yes, logging into the box and monitoring real time tells you there is a problem, however, a historical view tells you when something changed. Also, this is crucial for another best practice - server capacity management and monitoring.

The key is universal rollout / standardization. Don't get trapped in the technology selection mode. Pick one and implement universally. This isn't difficult work, however, best implemented within your server provisioning process so that anything new automatically has the standardized framework.

It is amazing how telling something as simple as a historical CPU profile is. You see processing/business utilization patterns, exactly when backups occur, batch jobs etc. and more importantly, when something CHANGED.

knownbugs