Thursday, September 29, 2016

Operational Maturity

OK, I'm coining a new phrase. It's called operational maturity, and it's the difference between a someone who possesses great skill, and someone who is great at what they do.

First, a few real-life examples:

Case 1: Missed Opportunity with the Family
Joe has a server deployment due on Friday. Bob, the hardware guy, owns the task of racking the server, and configuring hardware according to organizational specifications. Joe's job is to take the server over, once racked, and to get the operating system and configuration portion done. Once Joe is done, he'll hand the server off to Sarah. She will install the applications, configure them, and ensure that when the server boots, it's ready for duty. The problem is, it's Thursday night, Joe just saw the e-mail that the server has been racked and configured, and he is scrambling to get the work completed so the server can be delivered. The OS is installed, but the server just won't boot, but if you're paying attention, that's irrelevant. Joe's night with the family is shot as he's working through troubleshooting and getting the host stood up.

Case 2: It Was Fine When I Left It...
Mark put in his change order. He conformed to change control procedures by documenting the steps to execute the change. He put in a nice back-out plan, and even included a validation plan. The change was intended to resolve a performance issue where it was suspected that I/O was slow due to excessively large buffers. So the steps included changing the size of the buffers in a configuration file, making the same change to the running operating system, and testing. His validation plan was to print out the configuration file to the console, and look for the new setting in the output. This would ensure that the steps of the change were performed correctly. When he was done, he dutifully performed the validation, which passed, and disconnected for the night. Meanwhile, the server is now dropping an average of 450 packets per second, and performance is no better than it was.

Case 3: The Case of the Disappearing Server
Sheila is troubleshooting a Linux server. It's running, but is experiencing I/O errors, and there is concern that the box may fall over. For the uninitiated, there are two ways to access the server: ssh (think Putty or SecureCRT; specifically, access via the network interface), and a server console connection through its out-of-band interface (HP iLO, Dell iDRAC, etc.). Just to complicate things, the server is in a secondary data center over 200 miles away. Connected into the server using ssh, she makes some configuration changes, then reboots the server. Two minutes pass, then three, then five. Pings continue to fail; something is wrong here. Attempts to connect to the out-of-band interface are coming back with errors indicating the remote device cannot be reached.

Each person is technically very proficient, but they all made an error, and it was the exact same error in every case. In case 1, Bob handed the server over to Joe the night before it was due to be delivered. Mark validated the change steps, rather than the intent of the change. Shelia assumed that when the server was rebooted, it would come back. Had Bob turned the server over to Joe two days prior, there would have been time to deal with the unruly hardware. Had Mark devised a performance test, he might have seen that packets were dropping. Had Sheila simply opened an out-of-band console session prior to the start of the work, she would have realized the danger in restarting the server, since she would have known in advance that a failed reboot would mean no access to the server at all.

So the error? In every case, the engineers planned their work around the scenario they hoped would happen. Those possessing operational maturity recognize that things can and will go wrong, and they build contingencies into their planning. We will all make mistakes, either through mistyping a command, making an assumption, or simply overlooking a detail. But, by demonstrating operational maturity, we build in the safety nets that protect the business/organization.