Friday, March 20, 2020

Production Support


I have seen my share of IT processes across a number of companies over the past 25 years. The ones around production support have always been the most interesting, specifically, in situations where teams/organizations get themselves into trouble and then spend their life on crisis/bridge calls trying to get themselves out of it. A lot can be learned about the culture of a company going through one of these ..

Whilst the basics of best-practices around IT production support have been enshrined in standards such as ITIL, just following a book/rules has never been a recipe for extra-ordinary.

One specific dynamic I have often debated in my mind is around the "segregation of duties". I remember my time in the early 90s where it became a bad thing suddenly for us developers to have access to production. Heaven forbid, we would make changes in production on the fly .. this was true in IT telecom (i.e. at least until telecom became IT), primarily driven via the discipline of managing mission critical networks and related lifeline services. I did have a certain respect for this, especially given the hard-coded culture of service availability in telecom companies.

Of course, in the finance industry, I found that even data-center sysadmins were not trusted with privileged access to the very servers they were meant to administer. Also two-eyes-four-eyes .. something we can thank the banking sector for.

To my point and the picture above. I have tried to capture what I see is an often misunderstood and consequently dysfunctional state of affairs within an IT organization. I define "commando" as the behaviors where folks make changes to production without any due diligence or testing etc. I define "process driven" as everything by the book and in the extreme case, overly constraining and time consuming (without adding any value).

On the other axis, I define "trial & error" as the mode of analysis/resolution that teams resort to when they lack the technical knowledge/understanding/skills of what they are supporting. This can be methodical and process driven and will ultimately yield a result (however, on speed, you need to be lucky). NB> I don't classify the "bounce the servers" solution as necessarily "trial & error" as it is an effective step in cutting your losses on troubleshooting when you have SLAs. I define "knowledge/skills" based as state where the highest technical skills (typically the original developers/engineers) are applied and engaged on problem solving. NB> This is NOT a line that defines Tier2 application support vs Tier3.

On target zones, it really depends on the business impact. Typically, however, in most companies, it is a one-size fits all approach. Companies have a hard enough time getting consistency in their performance.

Break-glass is an interesting one and typically is meant as a safety or "panic" button, when normal process doesn't work or speed is required. This allows for the developers to take control and break the "segregation of duties" barriers.

Questions to ask when you assess where you are on the chart :
1. when in a crisis, are your smartest/highest skilled people engaged (and accountable) ?
2. do they have access when needed (and the tools) ?
3. are they allowed to lead or are they muted by process ?


No comments: