HPTS Position Paper: Fault Tolerance Rather than Fault Avoidance--Testing Doesn’t Scale

HPTS 1999 Position Paper

Fault Avoidance vs. Fault Tolerance:

Testing Doesn’t Scale

James Hamilton

Microsoft SQL Server Development

JamesRH@microsoft.com

Abstract

I argue that software testing doesn’t scale and, as software complexity and size increase, it will eventually become prohibitively expensive and time consuming to reach the required levels of software quality via careful design, rigorous process, and painstaking testing. Addressing the problem through software simplification doesn’t deliver competitive levels of system function. Addressing the problem by componentizing the system can reduce the bloat of any single installation but it does nothing to reduce the exploding test matrix. Actually, increasing the number of different supported component configurations may increase testing complexity. Further exacerbating the problem is that an ever increasing percentage of transaction systems are accessible globally and directly via the Internet. As a result, windows where the system can be brought down for maintenance are often nonexistent and all failures are immediately externally visible. Our transaction systems are growing larger and more complex at the same time that availability requirements are rising and our delivery cycle times are less than ½ what they were 10 years ago. Conventional approaches to system reliability and availability involving careful process, long beta cycles, and extensive testing have never worked especially well and these approaches are failing badly now. Rather than working harder to avoid software faults we should realize that faults are unavoidable and instead focus on fault tolerance. Basically, if we can’t make a problem go away, we should attempt to make the problem invisible.
1.Software size and complexity is inevitable and in some ways desirable

It has become fashionable to complain about "bloatware": unnecessarily large and feature-rich software products. Successful software products do get increasingly large with each new release and it does appear that large, feature rich products end up dominating a software class. Two obvious examples are word processors and web browsers. Dominant entries in each of these software categories are quite large. Why didn’t more svelte competitors win? In the end, customers buy features – the ability to efficiently solve problems – and although it is true that while nobody needs or wants to pay for all of the features of a given product, the union of the features for which customers actually are willing to pay is often huge. My conclusion is that, although most users hate fat software products, they typically buy feature rich solutions.

We’ve seen software bloat on the desktop software world and perhaps attributed it to sloppy development processes or lack of discipline on the part of the developers, but the same process is clearly at work in server-side transaction processing systems. The server-side software stack is inexorably growing in size. For example, SAP is just over 47 million lines of code, and it depends upon a relational database system and, of course, an operating system. There doesn’t exist a top tier relational database system smaller than 1.5 million lines of code. Recent press reports have pegged NT Server with all supporting software at over 50 million lines of code. Perhaps an exaggeration but, at even half that size, we would be conservatively looking at a server-side software stack of over 75 million lines of code.

So, although none of us are comfortable or happy with the implementation size and complexity growth of modern transaction processing systems, it is undeniably taking place. I argue that, as our systems grow, our ability to deliver reliable, predictable and available TP systems is at risk.

2.Componentized software doesn’t reduce development complexity

I’ve argued that modern TP systems are large and increasingly complex systems and that simple economics drive feature richness. Appealing to a broad class of users normally yields successful systems that have many more features than any single customer would ever want or use. If we assume that this is an inevitable outcome of producing broad-based commodity systems, perhaps we can address the size issue by building these systems from component parts. No customer needs more than a fraction of the total delivered components so we would have addressed the bloatware concern. However, rather than declaring success, we should recognize that disk space is cheap and software size isn’t the real enemy of reliability and availability. The real problem with "bloatware" isn’t size but complexity. Large feature rich systems are complex, nearly impossible to properly test, and the inter-feature interactions and dependencies are subtle, hard to predict, and difficult to find. A system delivered as a large number of components can yield smaller installed systems but the testing problem is arguably made worse. Rather than needing to test that all possible paths through a single delivered system function as expected, a componentized system would require the factorial explosion of possible component groupings be tested. The result is that each delivered system is smaller but the complexity of the development problem remains unsolved and the test complexity problem has been worsened.

Admittedly, a transaction processing system developed as a composition of modest sized components each of which can be independently tested would be an order of magnitude easier to work with than many of the system our industry is shipping today. Some componentization is already occurring, as it would not be possible to deliver systems of the size the industry is currently shipping if we were building totally monolithic systems. Many modern DB systems have well defined internal interfaces separating well-defined component boundaries – some more pure than others – and this does reduce complexity. Yet still the leading cause for a feature not making a given release is an inability to get it tested and stabilized prior to delivery.

Testing remains the bottleneck in TP system development and, at least to date, component software, whether externally visible or just internally part of the system architecture, have failed to address the complexity problem. Testing remains a key tool for establishing that a system faithfully implements a given function without unintended inter-feature interactions, but does not adequately address problems beyond that fairly limited scope.

3.The back office has become the front office

While we are facing rapidly growing size and complexity in TP infrastructures, these systems are also increasingly directly accessible to customers. Frequently, customers, suppliers, and partners via the Internet can directly query back office TP systems. These systems are no longer hidden behind a customer service representative, so failures are much more immediately obvious and can stop business rather than merely slow it down. The back office has become the front office and, as a consequence, system availability is becoming even more important than it was in the recent past – unintended failures have catastrophic business impact across a broader range of TP systems than in the past.

In addition to a broader range of TP systems being extremely sensitive to unintended downtime, these systems also are intolerant of scheduled downtime. The business is always open, so there doesn’t exist a good time to bring the system down for maintenance, upgrading, or bulk operations.

The back office becoming the front office means that system reliability and availability is more important today than it ever was yet, at the same time, we are building much large and more complex system which brings correspondingly larger testing problems.

4.Rapid development and deployment reduces software reliability

In addition to the Internet exposing back office systems directly to customers, it also brings a wildly accelerating pace of business. Many companies, for example Amazon, view their TP systems and the speed with which these TP systems are evolved as amongst their strongest competitive advantages. For many of us, this increased dependence upon TP systems as a competitive advantage is great news, but with it comes a need to evolve these systems far faster than in the past. Modern ERP systems go through 6, 12, or even more minor revisions per year. Many electronic commerce systems are changed even more frequently. Gone are the days where database development cycles were 4 to 5 years with deployment cycles running another 2 to 4 years.

The consequence of these changes is that while we are building much more complicated systems than ever before, the testing and "burn-in" period is now much shorter. This is just another factor drawing into question the efficacy of continuing to rely on testing and long beta cycles to achieve the desired quality and application availability goals

5.Software testing doesn’t scale

As TP infrastructure system size continues to grow, system complexity and testing difficulty grow even more quickly. When I started working on database systems, the test group was a small fraction of the size of the development team yet, on the current project less than a decade later, the test team is now somewhat larger than the development group. And, even with a substantial organization of very talented engineers completely focused on quality assurance, one of our biggest concerns remains adequate testing. Testing, and QA in general, appears to be one of the few engineering tasks still more difficult, or at least less well understood, than the development of large TP systems.

Without substantially changing the way we test these systems, the need to grow the test teams will continue to far exceed the rate of growth of the development teams and it’s not clear that larger teams will be particularly effective at maintaining current quality levels. We face a conundrum in that TP system size and complexity is growing rapidly, development processes have, to be charitable, only improved modestly, test and beta duration are being forced shorter, while system availability requirements are increasing.

Current testing methodologies are incrementally improving. We can point to many test innovations that are really helping such as automatic fault injection, random test case generation and verification, code path analysis tools, etc., but none of these appear to be adequately addressing the problem. Current testing methodologies actually do allow us to functionally test complex systems fairly efficiently and with high accuracy, but do little to help us with complex, inter-feature dependencies that fail only rarely and are very difficult to reproduce.

Just as formal methods have failed to efficiently establish the correctness of anything but very small programs, current testing methodologies appear to be failing to establish the correctness of very large software systems. As the systems grow, we need a more scalable means of ensuring robust and highly available TP system software.

6. Fault avoidance vs. fault tolerance– a more scalable approach

Summarizing the trends we’ve observed: increasingly TP systems are directly accessible by customers and suppliers so reliability and availability, always important features, are becoming even more so. System size and complexity is on the increase – many have grown by at least a factor of two over the last 5 years – yet development cycles times are being reduced. Software testing costs are climbing far faster than the size and complexity growth of the systems themselves and is becoming both the dominant engineering cost and the limiting factor in system evolution speed.

Highly reliable mainframe TP infrastructure systems are typically correspondingly expensive, difficult (read expensive) to manage, hard to adapt, and feature poor. Commodity TP systems software is typically much more functionally rich, much easier to install and configure, but two orders of magnitude less available.

So, as an industry, we are currently offering our customers a choice between fairly reliable systems that evolve quite slowly and are very expensive to manage or considerably less reliable systems that have been evolving much more quickly. Neither alternative is a great answer for those trying to deploy modern TP systems and, worse, both classes of systems are straining our ability to adequately test. Increasingly the industry is finding that throwing more money at the test organization, and hoping for the best, as a solution just doesn’t scale.

At SIGMOD’98, Eric Brewer described the Inktomi search engine approach to high availability and, although search engines are targeting a very forgiving domain, his presentation offers some interesting solutions to the more serious problems we’re facing. Eric argues that administrative errors are a huge cause of TP system downtime, so the first lesson is to develop systems where no operator or administrator is needed or even allowed in the room with the system. Develop the system such that failures don’t cause down time, there is no need to page administrators in the middle of the night, and nobody has to race to work and desperately try to repair a failure and, in that process, possibly do further damage. The room needs to be kept dark and, to do that, the system must be able to survive all failures. In the Inktomi example, surviving system failures is easy in that, if some random percentage of the data goes off-line, the results returned are still acceptable to the sponsors and users. Most TP systems don’t have that luxury. Nonetheless, his lesson remains valid: make everything redundant and ensure that the system can remain available through all failures. Companies such as Tandem have been offering TP systems that can survive most system failures for years. The difference is in the way Inktomi administers their systems. There is nobody in the room and parts are only replaced once a week and never in a rush. This is a good step towards reliability.

The second strategy to achieving high availability is more radical and, although it’s something we all could have been doing for years, I view it as quite a departure. Inktomi is an Internet search engine and which is a very competitive market that moves very quickly. As a consequence, they release new software frequently, often more than monthly. So, they have a very large TP system and they are attempting to revise the software at a rate that would scare most of us. What about the tough to find race conditions that can take weeks or months to chase down? Memory leaks? Uninitialized variables? How do they get any degree of reliability when putting transaction processing software online with little testing and no beta cycle? Rather than trying to design and develop with rigorous care and then to painstakingly test the system over the large part of a year, they recognize that they don’t have time to make the system reliable, so they make it available instead. They’ve designed the system so that failures of any component don’t affect availability and, having done that, they simply allow the constituent parts to fail. As long as the system stays available, the failures of the parts really don’t matter and we’ve already argued that it’s simply not possible to develop reliable systems on this type of schedule.

The Inktomi system suffers from all the same problems that most TP systems suffer from but the key difference is that they assume that it will have memory leaks and so they reboot component parts nightly or as needed. They assume there will be Uninitialized variables and unfound race conditions and rather than, slowly working them out of the system, they assume they are going to occur and design the system to be impervious to frequent failures of the constituent parts.

Many TP designers have been building systems for years that can survive hardware and software failures; I mentioned Tandem earlier. The change of approach is to use this technique of dealing with periodic system failure to deal with problems that we typically spend a fortune removing from our systems. It may seem almost irresponsible to release a TP system with a memory leak but, on the other hand, if the leak can be made invisible to the users of the system, it may be the right answer.

Clearly, the system still has to be functionally correct, but it turns out that we do an excellent job of testing for functional correctness. Where we frequently fail is in the removal of complex and unreproducible multi-user interactions and inter-feature dependencies. So, a system that can mask these failures can be substantially cheaper to produce, and can be evolved and adapted to meet changing customer requirements much more quickly.

My observation is rather than testing harder and longer in an effort to avoid software faults we should be designing systems that are available in the presence of these faults. In effect, we should build fault tolerance rather than attempting to avoid faults. With enough redundancy we can produce highly reliable systems out of very unreliable parts. Essentially, this is an argument that all TP systems, even small ones, should be built upon clusters of redundant parts. And, we should focus on functional testing to make sure that all single user transactions can be executed faithfully. But rather than stress testing and long beta cycles, we should recognize that these techniques don’t scale, and instead assume the parts will fail frequently and not fight it but instead focus on making sure that these frequent failures aren’t externally visible.