“The problem… as it often is, are the metrics. It is a situation where if you can’t count what’s important, you make what you can count important.” — Army advisor James Willbanks on the use of body counts as a metric of success during the Vietnam war.
Once a week I play volleyball at an indoor sand volleyball center about 5 miles from my home in Colorado. I discovered volleyball as an adult when I was living in Hermosa Beach and developed a passion for the sport. Despite this, and the fact that I’ve now been playing for more than a decade, I’ll never be considered “good” by any serious volleyball player. Nevertheless, I occasionally enter and play in competitive amateur tournaments against others that are often a level or two better than me. This happened at a pre-pandemic tournament in Vail, where my partner and I unsuspectingly signed up for an open tournament where the only qualification was that the sum of our ages exceeds 70. As it turns out, these players, though older, are pretty good. Some are very good.
As competitors, we realized that our only hope was to out-hustle our superior opponents. We ran down every ball, scrapped for every possible point and applied every ounce of will and determination we could muster. At the end of our first match, we shook hands with our opponents (remember when shaking hands was a thing?). We had covered twice as much ground, burned five times the calories and likely swallowed more sand than the rest of the field combined. We lost 21–8. As we walked off the court, we were thankful the margin wasn’t worse.
Anyone that has played competitive sports understands the role of effort to performance. Practice hard, play hard and results eventually follow. But what if we valued effort above results? What if we gave the Heisman trophy to the player that tried the hardest or the Stanley Cup to the team that was most tired at the end of the season? Are you laughing yet? Of course you are. But this is exactly how the federal government is spending your tax dollars in the defense industry, particularly when it comes to building software. What makes this more egregious is that, unlike athletics, writing code is a leveraged activity. That means good programmers can outperform bad programmers with a tiny fraction of the effort. In fact, the higher the creativity component of a profession, the more likely it is that inputs are disconnected from outputs. Silicon Valley guru Naval Ravikant tells us to “forget 10x programmers — 1000x programmers really exist, we just don’t want to acknowledge it.” Naval says he’ll take a single skilled software engineer over ten engineers exerting 10x the effort (see The Almanack of Naval Ravikant). So why would we pay an entire industry this way?
Since the beginning of time (or at least the beginning of software), developers have been frustrated by their customers’ demands for plans and estimates. Developers were expected to estimate a precise cost for a fixed set of requirements. For decades, this didn’t work. Agile practice emerged in the early 2000’s as a response to this: between cost, content and schedule, a customer can demand that two of those variable be fixed, but not all three. In other words, when we are building something for the first time, those who will be building it require flexibility to account for unknowns. If there were no unknowns, you wouldn’t need a developer — you’d simply license what someone else has built, or better yet, download it or copy it for free!
While the commercial world has, for the most part, caught on to this concept, the government has not. There are obvious and rational reasons for this. For one thing, the government can’t just enter into a $100M development contract and tell the contractor to “get as much of what’s on this list done in 3 years.” Likewise, they can’t tell the contractor “let’s see what you can do with this ten million dollars.” They may actually be much better off if they could do this, but there are federal laws to follow, and at the end of the day, we all want more rigor in place when it comes to responsibly investing tax payer money.
During the waterfall era, cost estimates for software were based largely on what’s known as “SLOC” or Software Lines of Code. The more lines of code the contractor estimated, the higher the cost. Each SLOC was typically associated with a number of hours, including how long it took to write the code, test the code, integrate the code and usually fix/rewrite the code. This estimate was based on historical data of past similar projects. As programs ballooned in cost, so did the estimates upon which they were based. This literally continued for decades as costs for engineering manufacturing and development (EMD) contracts skyrocketed. Meanwhile, the Agile movement of the early 2000’s and the DevOps revolution of the decade that followed created a gap between the defense and commercial software development industries that was impossible to ignore any longer. The Defense Innovation Board (which included participation by Alphabet’s Eric Schmidt) published a number of reports and studies on the state of software development in the DoD confirming this widening gap. In 2018, Commander of the Air Force Space Command, General John Raymond, along with Assistant Secretary of the Air Force for Acquisitions Technology and Logistics Will Roper, signed off on a memorandum directing a “unified DevOps construct” to enable “industry partners to rapidly develop, integrate and deliver capability on a timeline of weeks to months, instead of months to years.”
Overnight (which in government speak is 1–2 years), organizational reform was underway. Consultants were brought in, a new parlance adopted and speed of delivery was the subject of every conversation. The Space and Missiles System Command (SMC) — Air Force Space Command’s $1B/yr acquisition arm — unveiled their “SMC 2.0”, a complete overhaul that included promises to “break down bottlenecks, cut through red tape, and deliver at the speed of need.” They even stood up an entire division called ATLAS-X, chartered to represent the “intersection of agility, innovation and analytics in support of the SMC Vision: To forge an agile team that delivers innovative, war winning capabilities.” It all sounds nice, but now that’s it well underway, is it succeeding?
The answer to this question must be clear at the outset if we’re going to have any hope for change. When I asked the ATLAS-X team how SMC 2.0 was doing from an Agility standpoint, there answer was “great”. When I pressed for how they knew that it was great, they told me that “every program now has at least one Agile coach.” So, at least for this organization (or the people I spoke to within the organization), success was measured by the presence of Agile coaches. Is that success? Perhaps, but only if you’re in the Agile coaching business.
The Cambridge Dictionary defines a proxy as “a situation, process, or activity to which another situation, etc. is compared, especially in order to calculate how successful or unsuccessful it is.” Proxies are used when the actual measurement you’re seeking is difficult or impossible to ascertain, so instead you use a substitute or surrogate. The more a proxy correlates to the actual outcome or measurement, the more value it has. For example, the unemployment rate is a reasonable proxy for the state of the economy, because there is a high correlation between people having jobs and a prosperous economy. Literacy rate can be a proxy for the education quality of a country, and so on. A good proxy is highly correlated between what you can measure and what you actually want to measure.
In The Principles of Product Development Flow, Reinertsen argues that all measurements are a proxy for lifecycle profits. A company is compelled to deliver products that result in net positive lifecycle profits. Failure to accomplish this will result in the failure of the company. The competitive nature of this scoring system almost guarantees that the most profitable companies are those that are delivering the most value to their users sooner and more frequently than their competitors. Deliver something that your user doesn’t want and they won’t buy it. Take too long and the user will get it from someone else. It’s this competition that drove virtually all development to Agile in the 2000’s and to DevOps in the 2010's.
One goal of initiatives like SMC 2.0 is to get caught up with all of the advancements that have largely been ignored by an industry insulated from broad competition, an industry that has settled into an alarming stagnation. In the end, the government’s version of success looks very similar to the commercial version: they want quality products, faster and cheaper. Agile and DevOps are simply tools that will help the government build things faster, cheaper and better. With success defined in this manner, the critically unanswered question is how do we keep score without the so-called basis for all proxies — a product’s P&L? (Defense industry developers obviously have P&L’s, which are unfortunately often improved by poor performance on cost-plus contracts — a topic of discussion for another post perhaps.)
Since speed of delivery is of the utmost importance to DoD organizations seeking to equip warfighters with the best available technology, it may seem reasonable to begin with average team velocity as the most natural and most important proxy. This decision appears to be popular across the board, particularly within large industry developers. The purpose of this blog is to ring the alarm on this harmful and dangerous mistake that dooms most Agile transformations within DoD before they even begin.
In Agile development, a Scrum team of typically 6–8 members uses velocity to help them determine how much work they can accomplish over the course of a Sprint. Work on a team’s backlog is broken up into User Stories — small pieces of functionality described from the end user’s perspective that a team can theoretically build and demonstrate within the duration of one Sprint (or iteration.) Breaking User Stories up such that they can be demonstrated at the end of a Sprint establishes a feedback loop that allows the Product Owner of the development team to make changes to the team’s Product Backlog based on regular feedback. This feedback loop is the essence of Agility.
At Sprint Planning, teams need to decide how many User Stories they should take on such that they can all be demonstrated within the Sprint. Most Scrum teams will use Story Points to assist with this estimation. By relatively sizing stories using team-generated estimates (teams typically use the Fibonacci sequence where fidelity decreases with magnitude) they can get a rough idea of how many stories they are likely to finish within a Sprint. Because velocity is empirical, it gets more accurate over time. For example, a team may initially estimate they can complete 15 points, but if they actually complete 6 points, that is their velocity which becomes the capacity estimate for the following Sprint (teams may choose to use a three-Sprint rolling average or another average, provided it’s grounded in actual results).
Why, then, is team velocity a bad proxy for organizational performance, agility and throughput? Why, when an organization makes the decision to put team velocity front and center does it cause so much harm?
1. It wasn’t meant for that: Team velocity and story points are a tool for team estimating and relative performance improvements. As soon as you take these estimates away from the teams, you’ve immediately disempowered them.
2. Point inflation: As soon as the teams realize they’re being graded by how many points they can complete, there’s a massive incentive to game the system. The easiest way to increase velocity is to start increasing the Story Point estimates. The second easiest way is to cherry pick the easiest stories. Either one of these options destroys organizations in the long term.
3. Calibration with time: Comparing one team’s velocity against another team’s velocity is like comparing apples to oranges. These are supposed to be relative sizing exercises with empirical checks every iteration. The only way to correct this is to calibrate points to time. Once you’ve done that, you’re not using point estimation, you’re using time estimation. The reason Story Points were invented was to divorce estimations from time and use relative size instead. Distance and time are a good analogy. How far are you from New York City? Both teams may be 300 miles, but they may be coming from different directions. Team A might get there in 6 hours while Team B might get there in 8. If we tell both teams that the answer needs to be 6 hours, we’ve undermined the whole process, and again, disempowered our teams. Unfortunately, the most common Agile framework in government today (SAFe) recommends normalizing points so that team estimates can be rolled up across a program. But even SAFe recommends normalizing to no greater granularity than a half day, and further recommends that “there is no need to recalibrate normalized estimates. It is just a common starting point.” These nuances are almost universally overlooked within the organizations we work with.
4. Meaningless to End Users: A story point means literally nothing to your end users. What matters to your end users is working code. Your performance must be correlated with how fast and how well your teams can deliver useful, working code.
5. Partial Credit: Making the story point the central focus encourages partial completion of work and out-of-control Work In Progress (WIP). We’ve observed more than one program where the strategy was to start as many User Stories as possible in order to take partial credit along the way. In one case, I observed a team take on 30 User Stories in a single Sprint in order to rack up as many story points as possible. They “completed” 50 points (many of which did not involve writing any code, let alone working code) and carried 27 of the Stories to the next Sprint.
6. Local Efficiency: Focusing on how fast you can click off points encourages teams to organize by skillset. Engineers design, Coders code, Testers test, Integrators integrate. Cross functionality (one of Scrum’s major tenets) is in the way. There’s no time to build T-shaped skills — continuous learning only slows things down. Eli Goldratt once said “local optimization is enemy #1.” Using story points in this manner is the new #1 enemy.
7. Morale: Developers don’t like to be treated like point machines. Walk through the halls of any of these programs and you’ll see beaten down, broken spirits.
8. Retention: Of course, most of the top talent will go elsewhere. Of those remaining, developer salaries on these programs run 30–40% above market, a cost that we all share as taxpayers.
9. Customer Confusion: I want a working product and you’re giving me a point total instead. In addition to providing a meaningless metric to the end user, story points don’t mean anything to the customer as well, which are different entities in government contract space. I can’t do anything with points, where’s my product?
10. It’s really just waterfall with Scrum teams: Most of these programs have done nothing to adopt true Agile practice. Recently, I held a User Story writing workshop for an auditorium of developers where not a single one had interacted with or could even describe the user! At the same time, the Product Management Office (PMO) employs dozens of Cost Account Managers (CAMs) to ensure this behavior persists. As an Agile Coach on one program, I was told to divide my time each day across 8 charge codes. Some engineers had 20 or more codes they were supposed to divide their time between. The number of charge numbers was so out of control that the CAMs provided a spreadsheet that delineated the appropriate labor mix for each person — I was told to assign 12.5%, 32%, 21%, 4% and 30% to each of the respective charge codes. 300 people participated in the meaningless exercise every day, with the results provided to the government as the primary indication of project conformance to plan.
During Vietnam, James Willbanks realized that the Army’s choice of proxies (body count) was literally killing people. The misapplication of the Story Point in DoD Agile development is literally killing Agile. Why is this so hard to correct? We gravitate to the familiar, to what we can easily see. Reinertsen uses the analogy of a drunk searching for his keys under the street light instead of in the place where he dropped them in reference to this interesting phenomenon.
It enables them to feel safe among all those scary Agile concepts like team empowerment, iterative deliveries and adaptation. The only problem is that the familiarity is grounded in a major misunderstanding of Agile principles. And once the Story Point is latched onto in this manner, it subversively reverts the Agile transformation back to waterfall. There’s very little that Agile coaching can accomplish until this is corrected.
To illustrate how rampant and outrageously over-the-top this anti-pattern has become, here are some examples of how we’ve seen story points applied on DoD programs in just the past 18 months:
I could go on, but you get the idea. The only thing more concerning than what’s on that list is the almost universal obliviousness to how utterly backwards it is.
It’s hard to underestimate the magnitude of the problem we’re facing. According to Bloomberg Government, defense contract spending in 2019 was $404B. Nearly all of the contracts issued by the DoD today require Agile development practices. It appears that the industry intends to move ahead with what’s being referred to as “capacity-based” contracts where the government purchases a fixed number of Story Points up front. While there are some positive elements to this approach (e.g., the ability to adapt the backlog according to evolving needs), the Story Points’ role in all of this will ensure that no true advancement is made.
At the end of the day, there needs to be the realization that buying story points is a joke at the taxpayers’ expense. Since national security is at stake, no one should be laughing.
Recognizing the unique needs of government development to be able to estimate and track progress to plan, Centil has filed a patent application for a new scoring methodology that we believe provides leaders a metric that correlates highly with the outcomes they really want to achieve. We’re excited about the prospect of our Value Performance Index (VPI) because it actually forces organizations to align fundamental principles of Agility. Misalignment will be detected early, and there will be nowhere to hide the bodies. VPI is in development and will be the subject of its own blog post very soon. In the meantime, we are looking for government programs that currently use EVM to begin collecting results with actual program data.
We feel compelled to clarify that our experiential findings apply to defense industry contracts only. There is a movement underway within DoD to completely disrupt the traditional methods of developing software described in this blog. The Air Force for example, under the leadership of its Chief Software Officer Nicolas Chaillan, is investing in a common delivery platform with multi-level security called Platform One. The initiative has brought into existence multiple “Software Factories” where hired developers work alongside government workers to develop and deploy code following commercial best practice. It appears that the intention is to move big industry contractors to the new platform at some point as well. This is exciting, but could be disastrous unless fundamental changes occur. Establishing a scoring system that illuminates major anti-patterns is a critical first step and one the government must make now.
Please follow us on LinkedIn to track how we’re planning to 10x our nation’s ROI in defense. We are hard at work building our new score system demo and will be releasing this to the public very soon!