“Take the opportunity in an incident to apply the skill for a specific problem with effort that’s isolated,” said Andy Ellis (@csoandy), CSO for Akamai as I broke into a conversation he was having at the 2015 RSA Conference. “[The problem is] people in a vacuum will overdesign software. They want it to solve all problems. In an incident you say, ‘I need it to solve one problem,’ let me code to do one problem. And it’s not software engineering often, it’s really just coding.”
That led me to ask Ellis about Netflix’s Chaos Monkey, a tool that purposely creates problems on their network to maintain the network’s ruggedness. He explained Netflix’s theory upon releasing Chaos Monkey.
“The Internet is so large and the users are so massive in terms of what they can do that you cannot actually do reasonable testing. In production, we have to deal with chaos, so let’s enforce chaos on us. So if we happen to have six months where things go swimmingly well, nobody would become inured to that and say, ‘Oh great, we don’t have to worry about chaos anymore,’” said Ellis. “No, they’re enforcing chaos in their system to insure they build robust software. They’re generating a requirement that you must operate in an environment that sucks. And frankly, the Internet kind of sucks too. Our goal is to make it suck less.”
Given that Akamai is so widely distributed, they don’t need to impose chaos or incidents because incidents just simply happen all the time. “Our Chaos Monkey is ‘terra,’” said Ellis. The Internet as is causes enough problems they don’t need to make any new ones.