knownbugs

Friday, March 20, 2020

Production Support

I have seen my share of IT processes across a number of companies over the past 25 years. The ones around production support have always been the most interesting, specifically, in situations where teams/organizations get themselves into trouble and then spend their life on crisis/bridge calls trying to get themselves out of it. A lot can be learned about the culture of a company going through one of these ..

Whilst the basics of best-practices around IT production support have been enshrined in standards such as ITIL, just following a book/rules has never been a recipe for extra-ordinary.

One specific dynamic I have often debated in my mind is around the "segregation of duties". I remember my time in the early 90s where it became a bad thing suddenly for us developers to have access to production. Heaven forbid, we would make changes in production on the fly .. this was true in IT telecom (i.e. at least until telecom became IT), primarily driven via the discipline of managing mission critical networks and related lifeline services. I did have a certain respect for this, especially given the hard-coded culture of service availability in telecom companies.

Of course, in the finance industry, I found that even data-center sysadmins were not trusted with privileged access to the very servers they were meant to administer. Also two-eyes-four-eyes .. something we can thank the banking sector for.

To my point and the picture above. I have tried to capture what I see is an often misunderstood and consequently dysfunctional state of affairs within an IT organization. I define "commando" as the behaviors where folks make changes to production without any due diligence or testing etc. I define "process driven" as everything by the book and in the extreme case, overly constraining and time consuming (without adding any value).

On the other axis, I define "trial & error" as the mode of analysis/resolution that teams resort to when they lack the technical knowledge/understanding/skills of what they are supporting. This can be methodical and process driven and will ultimately yield a result (however, on speed, you need to be lucky). NB> I don't classify the "bounce the servers" solution as necessarily "trial & error" as it is an effective step in cutting your losses on troubleshooting when you have SLAs. I define "knowledge/skills" based as state where the highest technical skills (typically the original developers/engineers) are applied and engaged on problem solving. NB> This is NOT a line that defines Tier2 application support vs Tier3.

On target zones, it really depends on the business impact. Typically, however, in most companies, it is a one-size fits all approach. Companies have a hard enough time getting consistency in their performance.

Break-glass is an interesting one and typically is meant as a safety or "panic" button, when normal process doesn't work or speed is required. This allows for the developers to take control and break the "segregation of duties" barriers.

Questions to ask when you assess where you are on the chart :
1. when in a crisis, are your smartest/highest skilled people engaged (and accountable) ?
2. do they have access when needed (and the tools) ?
3. are they allowed to lead or are they muted by process ?

Thursday, March 12, 2020

I am always amazed when I find IT teams with this behavior :
1. Business requires something urgently from IT
2. IT assesses the change
3. IT then designs and solutions the change
4. IT then risk assesses the change as "high risk"
5. IT then presents the solution with a "high risk" profile to the business pretty much scaring the pants off everyone.
6. Business then backs off the ask
7. End = do nothing. IT feels happy they made a good "risk" based decision

Bright futures in such companies ..

Thursday, January 28, 2010

barely sufficient

I observed a very talented engineering team making a basic mistake today.

I think we often mis-interpret what 'barely sufficient' means in the context of agile software development and delivery. It is mistakenly interpreted as an excuse to cut down requirements. In reality, I believe it really applies more to engineering and design than to requirements. There are two different syndromes to be careful of in a software project - 'scope creep' and 'creeping elegance'.

In an agile methodology, we iterate through cycles of 'design a little', 'code a little', 'test a little'. We embarq on writing software often without fully understanding the problem. I am a big fan of this vs up front. Personally, I pretty much think on the keyboard as I believe, people learn incrementally. That said, I am aware always that there is a risk in this mode until I have a full grasp of the problem. My energy is always directed towards activities or areas that help me flesh out unknowns. Whilst in this mode, I 'hack' for speed and refactor only after I think I have my arms around the business problem.

What I found the team doing was laying the 'foundation' down aka building middleware. That by itself wasnt a problem, however, they had taken their eyes of the business problem and failed to deliver the results in the needed timeframe.

Programming competitions are a great way to teach developers this mindset. You have a fixed duration so you have to be quick. You have to focus on the problem and only the problem .. no deviating onto bunny trails. You have to solve the problem in the simplest quickest way. As engineers, we love complexity so, the last discipline is the hardest.

Thursday, June 4, 2009

Service granularity and re-use

Architects in the IT organisation where I work display an interesting tendency to equate re-usabilty with granularity. The received wisdom seems to be that the more granular a service is, the more re-useable it is. To a certain extent it is useful to have the ability to mix and match just the bits of functionality you require. This becomes detrimental at the point where it starts to push behaviour to toward the consuming systems, of which there are usually more than one.

As a case in point, a new system is being written, which (among other things) exposes the ability to lookup items in a cache. It employs a side-caching strategy to do this. It's API looks something like this:

MyItemCache
findItem(itemId : String) : Item
createItem(item : Item)

Consumers of this service will call findItem() to check for the item in the cache, using the Item if it is there. If the Item is not there, they are expected to fetch the Item from the source system - which is a different system entirely - and then add it to the cache, using createItem().

There are a number of issues with this approach.

* The fact that the cache is a side cache (and therefore does not front the source system) means that consumers of this cache must also have knowledge of the source system - making the overall architecture more complicated and brittle.

* Each consuming system is expected to implement over again the logic required to check in the cache, then fetch and add the Item if it is not already cached.

While this last point may seem like a small amount of code to write, it should be remembered that forcing each client system to re-implement this logic means that the difficultly of changing this logic is multiplied by the number of client systems, with all the associated co-ordination of teams that this involves. Add to that the impact that differing or buggy implementations might add to the mix and you have a problem far more damaging that the cost of a few lines of code.

An alternative approach, which would eliminate both of these problems, would be to make the cache a through-cache, with the cache itself handling the "if not cached: fetch from source" behaviour. Having eliminated the need to manually add things to the cache, the service interface is simplified, as follows:

MyService
findItem(itemId : String) : Item

Client systems are then relieved of responsibility for implementing this logic over and over, and need not have knowledge of yet another system. Re-use is enhanced, while complexity is reduced. Everyone is happy! :-)

Friday, May 2, 2008

Performance Engineering

A good article from Alok Mahajan & Nikhil Sharma from Infosys ! Am quite stunned to be frank.