Why Large Software Systems Fail


(Originally published 10/31/2006 on zachmortensen.net)

In a previous article we took a critical look at Metcalfe’s law, focusing on the recent and well-reasoned argument that otherwise-intelligent people frequently overestimate the intrinsic value of network effects. In this article we’ll examine the second half of Metcalfe’s law — the cost curve — to demonstrate that while there might be a “critical mass crossover” point in communication networks beyond which network effects begin to dominate and the network explodes as a success, software systems exhibit the exact opposite behavior: A crossover point beyond which a system rapidly collapses on itself, creating a black hole that sucks up every unfortunate dollar that floats within its Schwarzschild radius.

If you think the following topic is esoteric, hard to follow, or even scarily geeky, you may be right. But given the amount of money you may have riding on bets that any one software architecture can be all things to all people, you owe it to yourself to become somewhat familiar with the underlying principles that seem to govern how software behaves on a large scale. Best of all, this article will provide a foundation for a subsequent empirical analysis of the long-term viability of the software and vendors that currently power the healthcare IT industry. Read on, and I’ll do my best to keep the topic understandable.

Our journey begins decades ago in the field of Electrical Engineering. Since most early computers were built and programmed by engineers, it’s only natural that these early computer scientists put their computers and programs to work solving the problems that were most relevant to their own work. One such problem is known as the Boolean satisfiability problem (SAT).

The driving force behind SAT was to reduce the cost of building circuits by simplifying them. A logic circuit is designed to take some set of Boolean (0-1) inputs, apply some logical operators (AND, OR, NOT, etc.), and produce a Boolean output that tells you something useful about the inputs. That is, of course, assuming that the circuit is actually capable of telling you something useful in the first place. Such a useful circuit is called satisfiable, meaning that there is some combination of inputs that will produce an output of 1. A circuit that is not satisfiable will always produce an output of 0 regardless of the inputs, meaning that the entire circuit can be replaced by a 0, thereby saving the cost of building a circuit that does nothing.

An obvious way to solve SAT is by brute force: For a given circuit, test all possible combinations of inputs until you get an output of 1. Easy enough for circuits with 2 or 3 inputs, but consider that even a fairly trivial circuit with 10 inputs has 1024 unique combinations of those inputs and you begin to appreciate the scope of the problem. For a circuit with n inputs, you will need up to 2^n test cases. A circuit with 32 inputs could require more than 4 billion test cases to prove satisfiability. Ouch!

It’s easy to see that SAT can quickly become quite a hard problem to solve as the number of inputs grows. As a matter of fact, SAT is so hard that it came to define what the term “hard problem” means. Computer scientists began to say that SAT is NP-complete, meaning that the problem can’t be solved deterministically in a polynomial number of steps with respect to the number of inputs. In other words, the cost of solving SAT — or any other NP-complete problem — is at least an exponential function of the number of inputs.

But not all SAT instances are created equal. Some can be shown to be satisfiable very quickly, whereas others could take up to the theoretical maximum number of steps before we could prove or disprove satisfiability. In 1992, Mitchell et al. published a paper entitled “Hard and Easy Distributions of SAT Problems” that contained the results of an experiment that involved generating numerous random SAT problems of varying complexity and measuring the probability that a problem of a given complexity is satisfiable. The results were somewhat surprising.

Intuition would say that complexity and the probability of satisfiability are negatively correlated. That is, the more constraints you put on the inputs of a circuit, the greater the probability that they all confound each other and produce no meaningful result. Mitchell’s research found that there is indeed negative correlation; but while one might imagine that the relationship between probability and complexity would be somewhat linear, it turns out to be much more interesting:

Mitchell’s research showed that very simple SAT problems are always satisfiable, very complex problems are never satisfiable, and somewhere in between the very simple and very complex problems there exists a limit around which probability changes quickly and dramatically, and this is where the hard problems lie. The cost of determining satisfiability peaks roughly at the point where the probability equals 0.5, meaning that it is most expensive to solve problems whose chances of satisfiability (or not) are 50-50.

Geek trivia: Many researchers have worked to refine this experiment since Mitchell’s original work, and interestingly enough the maximum cost and probability of 0.5 always seem to concur at the point where complexity (a ratio of the number of constraints per input) is about equal to 4.3, a number that is beginning to appear to be woven as tightly into the fabric of information science as is pi into that of geometry or e into that of exponential growth and decay.

Here’s where things start to get fun: If you can prove that a particular problem can be reduced mathematically to SAT, then that problem is also NP-complete and behaves the same way SAT does in terms of both the probability that it is solvable and the cost of the solution as a function of complexity. It turns out that software design is reducible to SAT, which leads us to a fundamental law of software systems that helps us understand the economics of software production: The cost of producing software increases exponentially as we add constraints — also known as features — to a software system.

Remember that Metcalfe said the cost of creating a communication network would increase linearly and the value of the network would increase quadratically (polynomially) as functions of network size, leading to “network effects” once a “critical mass crossover” point is reached. Odlyzko et al. did a good job explaining that Metcalfe’s value function isn’t really quadratic, it’s n log n; but that relationship still produces network effects, it just moves the crossover point significantly farther out from where some of the failed telecoms of the past decade bet that it would be.

In contrast to that of hardware networks, the cost of creating software increases exponentially as a function of size, and this behavior completely changes the economics of network effects for featurewise growth of a software system.

Notice that for both communication networks and software systems the cost and value curves cross over each other. In the case of communication networks, the value curve overtakes the cost curve at what Metcalfe appropriately called the “critical mass crossover” point, alluding of course to the amount of fissile material necessary to create a self-sustaining nuclear chain reaction. Large communication networks are therefore self-sustaining and feasible whereas small networks are bound to peter out if they fail to reach critical mass.

In the case of software systems, however, the opposite is true: The cost curve overtakes the value curve, meaning that small software systems are feasible and sufficiently-large ones are not. It’s important to note that this cost-value crossover is an inevitability, the cost function is higher-order (exponential) than the value function (n log n, or at best polynomial of you don’t believe Odlyzko), so cost will overtake value at some point as software complexity increases, it’s just a question of where.

We can use the exponential software-development cost curve to describe three classes of software companies. I think this exercise will help you see that you can classify a software company’s proximity to the critical-mass-crossover point based on some key characteristics of its business model:

1. Startup. The exponentially-increasing cost of software development starts out deceptively low, much lower than the value the software provides. Most software also starts out as a blank slate, its design largely unconstrained by previous decisions. Accordingly, in the startup phase software development is a relatively easy and very high-leverage activity. It’s easy to make the software do exactly what you think the market wants because the code and the team are small, easy to manage, and quite unconstrained. Every dollar invested into development yields many more dollars of value for the company, meaning that in this phase a company experiences what economists call increasing returns to scale, commonly called “economies of scale”. If a company doesn’t survive this phase, it’s demise usually has nothing to do with the economics of its software and more to do with poor product management, i.e. building the wrong product, competing in the wrong market, etc.

It should go without saying that since startup-phase software is the easiest software to get right, software startups and other companies still operating in the startup phase should account for the majority of software companies in business at any given time. Despite a lack of hard data to prove it, I’m highly confident that such is the case; even a quick tour of the exhibit floor at HIMSS always shows a disproportionate share of newcomers and other startups.

2. Slowdown. Success in the startup phase can give software entrepreneurs starry-eyed visions of world domination. I’m reminded of a Dilbert strip in which the Pointy-Haired Boss gleefully exclaims something to the effect of “The farther I scroll this spreadsheet to the right, the bigger the numbers get!” No doubt those in similarly exuberant states of mind are basing their forecasts on a linear software-cost curve, the assumption that a feature tomorrow will cost the same as did a feature yesterday or today, ad infinitum.

 

But we now know that the linear-cost assumption is flawed. It will cost the company more to add a feature to its product tomorrow precisely because it added features yesterday and today. The probability that those features together — plus all those created heretofore — will do something unexpected (bad) increases exponentially, as therefore does the cost of creating a design that actually works, to say nothing of verifying it later.

The result is that the cost of producing software increases as the company hits the “knee” of the exponential-cost curve, a region characterized by more or less constant returns to scale where the slope of the cost function is about the same as the slope of the value function. The economies of scale of the past are now gone, and the company may find itself in dire straits if it relied exclusively on that extreme leverage for its profitability during the startup phase.

Though I can’t present supporting evidence at this time, my conjecture is that the vast majority of software companies that successfully leave the startup phase end up starving to death in the slowdown phase as the ballooning costs of software development overwhelm their profit margins. As they begin to falter, these companies may be picked off by bigger competitors or have their lunch eaten by a startup with a lower cost structure who can compete more effectively on price, and a lucky few may survive to the next phase.

3. Stagnation. Some companies may survive the slowdown phase, but if their strategy is to continue adding features in an attempt to dominate the market, their software is destined for stagnation.

An astute company that is measuring the profitability of its software-development projects will recognize when it is realizing decreasing returns to scale, or diseconomies of scale, on its continued investment in featurewise development of its software product. Diseconomies of scale in this phase may seem somewhat counterintuitive since we typically associate “economies of scale” with the larger companies who tend to be the ones that make it this far. But any economist will tell you that marginal cost (MC) curves always turn upward as quantity of production increases, and the more-astute software companies understand that they cannot be exempt from the increasing marginal cost of developing features, their primary units of production.

A software company may choose to implement one or more strategies to deal with the stagnation that comes as software development costs increase rapidly on a collision course with the value curve:

  • It may begin to invest more heavily in a related professional-services business, as such businesses typically enjoy margins that are relatively high when compared with those of stagnating software products. Service businesses also scale nicely as they are not bound by the nasty exponential cost curve of software development.
  • It may play “You-Bet-Your-Company” by raising large amounts of capital to redesign the software product from the ground up to remove unnecessary constraints. This game is a gamble that when the rewrite is over the company will enjoy a lower cost structure that will enable future growth, and that no essential feature (constraint) will be left behind in the meantime by mistake.
  • It may choose to grow its software-product portfolio by acquisition rather than by its own development efforts. Buying a smaller complementary product makes good economic sense because features can be added to the new smaller product at a lower marginal cost than that of adding the same features to the stagnant product. However, the company will have to provide some level of integration of the two product lines, and tight integration of a newly-acquired product can be just as expensive as building the product from scratch in the first place. More on this topic later.
  • It may stick its head into the sand and ignore the problem, especially when facing its customers and prospects. After all, it’s likely that customers know very little about software development or the behavior of software in general despite their willingness to spend enormous amounts of money to acquire and maintain it.
  • It may foresee the inevitability of its software situation and shift its business strategy away from product excellence — featurewise domination of every market niche — and toward operational excellence by focusing its efforts on its most-valuable software assets and finding new ways to leverage those strengths into downmarket segments by simplifying its operations to lower its cost structure.

In spite of any company’s best efforts to the contrary, it is extremely likely — even certain — that decreasing returns to scale will continue and even accelerate as the complexity of software increases and eventually reaches a point of instability, just as the formation of a black hole is unavoidable once enough matter is gathered into one place. Regardless of how highly structured the matter may be, once the mass fits within its Schwarzschild radius it will unavoidably collapse on itself and form a supermassive black hole. In sharp contrast to the “critical mass crossover” point of communication networks, that of software systems isn’t one of self-sustaining success, rather one of the inevitable and total collapse of a product sufficiently bloated with features: The Schwarzschild radius of software systems.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s