Following on from the previous article about declining marginal productivity with scale - wouldn’t it be great if we could measure individual engineer productivity and find evidence that either disproves or supports that core thesis?

What do engineers produce?

In order to measure Software Engineer productivity, we must first answer the question: what is it that Software Engineers produce?

“Software” is perhaps the easy answer to this question, but then what is software? Is it simply the number of lines of code, or the cumulative number of commits, or the number of UI pages/widgets/buttons, or the number of Jira tickets resolved, or the number of backend endpoints, or the number of functions, or the total number of deployments? All of these answers should be deeply dissatisfying. Many of them may broadly correlate with engineering effort, much like the electricity usage of a factory may correlate with its actual output, but a factory’s output is not electricity, and a software team’s output is not function points.

Software ultimately represents a set of solutions to a set of problems, and how do you objectively quantify the size of solutions? Asking for (subjective) time estimates might be one way to solve the problem, but if the solution actually takes twice as long to build as the estimate, was the solution actually twice as “large” as first thought, or did the engineer just work half as hard?

The problem with existing metrics

Even if one accepts that lines of code or the number of commits correlates with solution size, that doesn’t mean that those metrics can be reasonably compared across individual engineers. If Engineer A produces 1000 SLOC in one week, but Engineer B produces 500 SLOC, it doesn’t mean Engineer A produced twice as much or worked twice as hard. Similarly, if Engineer A submits 20 commits, but Engineer B submits 40 commits, it doesn’t mean that Engineer B produced twice as much or worked twice as hard.

Various firms in the annals of history have made the mistake of actually trying to operationalise these flawed measures, and have fairly quickly run into Goodhart’s law that “When a measure becomes a target, it ceases to be a good measure”. Using metrics to set targets creates incentives. That lines of code correlates with solution size is true until you incentivise people by the number of lines of code they produce, and at that point any correlation is swiftly destroyed. Setting targets based on flawed measures incentivises a change in focus from value-creating work (e.g. creating solutions) to non-value creating work (e.g. committing every individual line to git). It’s important to note that you don’t actually have to explicitly set targets to create incentives, simply announcing that a flawed measure is being used or is considered valuable is enough to implicitly create a poor incentive structure. The key point here is that operationalising useless metrics turns them from useless to actively harmful.

Software engineer productivity metrics

This line of argument leads us to the somewhat unfortunate conclusion that

there are no good, objective measures of software engineer productivity

That there are no objectively good measures of software engineer productivity should not come as a surprise, for if there were, we would all be universally using them, much like we all agree that the number of cars (of a sufficient quality) that a car factory produces is a good measure of the output of that car factory.

What now?

That we cannot directly measure engineering output may cause some initial disappointment, however that doesn’t mean that we can’t measure anything. The success of a company’s executive team is generally not measured by the number of widgets that company outputs, but by the changes in revenue, profit, free cashflow, return on capital and compound annual growth rate that the executives create. These underlying measures are not measures of output, but are nonetheless measures of company performance and changes in those measures reflect outcomes that the executive team have achieved. In essence then, a company’s executive team is generally judged by the outcomes they achieve, not the output they produce.

Given that we cannot measure engineer output, we should fall back to the same toolset we use for executives, and instead measure the outcomes an engineering team achieves. This create incentives centered around outcomes rather than output, and is one of the core tenets underlying the concept of ‘OKRs’[1]. OKRs involve defining an objective (the O), which is an outcome that an individual or team wishes to achieve, and a set of key results (KRs) that can be used to quantitively measure progress towards that objective. OKRs are generally set quarterly with a <90 day time horizon, which means they can be used to measure progress at a relatively fine resolution (week to week). The fact that OKRs have taken the world by storm could be considered evidence that they offer more value than flawed output-centric metrics for quantifying progress in the knowledge economy.

Under-achieving OKRs

OKRs are all well and good for switching the focus from output to outcomes, but they don’t offer a great degree of managerial utility if a team/individual fails to achieve them. Did the team fail to achieve the OKRs because the OKRs were unachievable, or because the team were missing key skillsets, or because the team wasn’t putting in sufficient effort, or some combination of the above?

If we are to satisfy our fiduciary duty, we must avoid being too naive or idealistic and still concern ourselves with the question of how we measure the competence and effort of individual team members. There are no easy answers here, however the most useful heuristic is that of comparative performance. If engineer A and engineer B both estimate that a piece of work should take 1 day, and engineer A takes 5 days to do it, whilst engineer B takes 1 day to do it, there’s a signal that engineer A has under-performed that should be investigated further. Whilst it’s rare that two engineers ever work on identical pieces of work, evidence of comparative underperformance can still accumulate over the period of weeks to months, and is the signal that provides the most managerial utility.

Other metrics that have been recommended to me by others include:

Adoption rate of features those engineers work on (heuristic: competent engineers have high impact)
Estimation accuracy (heuristic: competent engineers are good at scoping, planning and avoiding distractions)

Even though we cannot directly measure output, these tools (and others like them) are good substitutes that can guide us in making informed decisions about how to maximise the outcomes that we achieve for our stakeholders.

Summary

Software Engineers produce a set of solutions to a set of problems.
The size of these solutions cannot be objectively quantified.
⇒ There are no good, objective measures of software engineer productivity.
Setting targets based on useless measures incentivises wasting resources and makes the measures actively harmful.
We should measure engineering outcomes rather than engineering output.
If we fail to achieve our targeted outcomes, there are other tools we can use to investigate further.
Comparative performance is the most useful signal for spotting problems.

References

Doerr, J., & Page, L. (2018).
Measure What Matters
. Penguin Publishing Group.
Seiden, J. (2019). Outcomes Over Output: Why Customer Behavior Is the Key Metric for Business Success. Independently Published.