Looking at the root cause of several cloud outages, it’s neither power outage, nor hardware crash. Not even natural disaster. We made hardware, disks, network, cooling, power, even data centers redundant and we arrived at the next challenge: the software. The software that runs in these super-redundant units has became the single point of failure. With the same software running in multiple locations, one little bug can stop everything at the same time, despite the redundant hardware. Windows Azure and other cloud services do a great job in rolling out software updates at multiple update zones, but still, that little bug is just waiting to kick off at the right time and make the press chew on the next big story.
So, what is a redundant software like?
Don’t look at me, I don’t know the answer. But I assume that a redundant software is built by two separate teams who are not supposed to speak to each other. They need to make sure that they solve the same problem – but differently. They take the same inputs and produce the same outputs. But do the work in between differently. Last decade was about hardware failures. Now, they are the past. This decade is when software comes to life.
What do you think? Any comments and conversations – welcome!