Simply, Hello
May 5th, 2008 by Mickey PanayiotakisMy plan, sitting down to write this log entry, is to present some philosophy about technology, redundancy, simplicity, and human error.I’m still battling the tense: not sure if that is my plan, or if that was my plan: I’ve realized that of more immediate concern, or at least more civilized priority, is an introduction. So allow me introduce: Me! My name is Mickey and I am the new co-author of WCDC. Hopefully I can help make this discourse between us a little more frequent.
“Discourse,” I say? And “us”? Sure, I maintain that this is a conversation. Certainly between you, the reader, and Ernesto and me, the writers. But the “us” still remains unclear: Is it between you and me? Ernesto and me? You and Us? Us and Them? Besides bringing more frequent updates, what is my role here? Am I the ying to Ernesto’s yang? Am I the Mac to his PC? (Yes, I do use a Mac, and no, Ernesto and I don’t always agree.) After a little thinking about all that, I decided to take my own advice, the one that I meant to write about before the whole introduction business confused things: I decided to keep things simple. Forget about playing a role against my co-author, who is after all the originator of this blog and therefore has last say. Forget about keeping in costume. Just say what I have to say. Practice the fine art of spouting. Isn’t Spouting what blogs are all about anyway? And what better way to start spouting than with a bit of philosophy. So here is my spout about simplicity. And this being a technology blog, I will talk about simplicity in technology.
In deductive sciences, we often follow Occam’s razor and accept that the simplest explanation is the best. This is common practice in areas where we observe a phenomenon which we subsequently try to explain. In technology, however, we often create the phenomenon. And we quite often overcomplicate it. The problem with this is not the creation or complication: computers and technology will perform their assigned tasks regardless of how simple or complex this tangle of tasks may be. The problem is that we assume that technology might fail while failing to remember that people fail as well. In fact, people fail more frequently. We take great pains to create fail-proof technology that is self-healing, doubly-redundant and fail-safe. We add spares, and spares to the spares. We double our servers, and add redundant channels and pathways. And things work. You take down part A and the Thing stays up. You bring A back and take parts C and B down and the Thing still stays up. That’s the beauty of technology: It does what it’s supposed to, no matter how complex its design. A technology architect can design a very complex system that performs exactly the way it’s supposed to, down to the last specification. Whether by assumption or by mandate from the customer, the design can be completely redundant with every piece independent of any other. And it works. The technology, that is, works.
The technology can do that, and it does. And we’ve come to expect that every piece of technology is over-engineered to that level. What we forget too often is the Human factor: sure, the technology works. But every modification (every new widget, framistat, gadget or sprocket that’s ever added, removed, or changed) needs to take into account all the intricacies of the original design. Of course, the original architect can make those changes. But every time a change happens, it requires human intervention. And every human intervention has one significant flaw: the human. As more redundancy is added to the system, more complexity is added as well. More variables enter the equation, and with each variable another chance for a human to make an error.
Case study: I had a customer a few years back that specified a fully redundant hosting network. And that’s what they got. Not a simple system, but certainly fully redundant. Fail-proof, of course. But fool-proof? Three things we can count on: Death, Taxes, and Human folly (the designer’s as well as the user’s). So over time, we had to add new elements, remove some old ones, modify existing ones. You guessed it: once in a while, the person making the changes would forget some bit of configuration, some variable. Most of the time, things would continue to work. But the redundancy was compromised and with every missed variable the system configuration resembled the original less and less. Eventually, things failed, with a long history of changes as the possible cause. Even though the equipment itself never failed, and the redundancy was never tested in a real-life situation, the system did go down. Due to human error.
So, to my point: A fail-proof design? Sure, we can do that. But only when machines, and only machines are involved. A system is operated by, maintained by, and interacts with humans. A well-designed system can be close to 100% reliable. A living being never so. The overall system includes the human fools that operate it: the reliability of the system depends as much on the human factor as it does on the technology design. If to err is human, then (symmetrically) to be human is to err. The more knobs there are to turn, the higher the probability for human error. Machines, by contrast, rarely err. It would best serve the client to take into account the probability of human error vs. machine failure. (And to help the client realize this inequality.) Most people (architects and clients alike) assume machines will fail (because they will), yet completely ignore the fact the humans will fail (and they will fail more reliably than machines).Told you that to tell you this: We all try to minimize machine error. But maybe (just maybe!) there are times when it’s best to compromise full redundancy to the benefit of minimizing the possibility for human error.
Most outages are caused by human error, not machines. A long post to state short philosophy: K.I.S.S.!