Data is King

Let me first begin with some choice quotes:

  • “Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” Linus Torvalds (2006)

  • “Fold knowledge into data, so program logic can be stupid and robust.“ Eric Raymond (2003)

  • “Data Dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.” Rob Pike (1989)

  • “Strategic breakthrough will come from redoing the representation of the data or tables. This is where the heart of your program lies.“ Fred Brooks (1975)

  • Most succinctly: “Show me your tables” Fred Brooks (1975)

If data is king, how much does code matter?

Let’s start with a dictionary definition of ‘complexity', and ‘complicated’:

Complexity is the enemy of progress in engineering - there is no single greater impediment to product velocity and value delivery than mountains of complexity.

Where is your complexity?

Stop reading this for a brief moment and consider all of the complexity of a system you know well.

How much of that complexity lies in the data-model, and how much of that complexity lies in the codebase?

The number of elements in a codebase - variables, functions, modules - and their associated interconnections tends to dramatically outweigh the number of elements and interconnections in most data-models.

An engineer can build a relatively detailed mental model of the entire data-model of a system in a handful of days, but it can take several months for that same engineer to build a detailed mental model of the entire codebase. It naturally follows from this that the complexity of existing code can substantially reduce engineer effectiveness. For this reason, you often hear engineers complain about “code quality”, but you rarely hear them complain about “schema quality”.

How do we reduce complexity?

It is for reasons of complexity that, although data is king, we must concern ourselves with managing & minimising the complexity of our code. We must have strategies for partitioning, organising and hiding this complexity if we are to continue to make fast progress when developing systems of any meaningful scale.

The most impactful strategy for managing complexity is that of “information hiding”, first formally named by David Parnas in his 1972 paper "On the Criteria To Be Used in Decomposing Systems into Modules"[1]. The essence of David’s thesis is that we should seek to “hide” information (or complexity) within the boundaries of a “module”, and think carefully about the information we choose to export or expose from that module (the interface - both the explicit syntax and the implicit semantics). Modules in this sense can be thought of as both individual functions, and higher-order groupings of many functions (such as libraries, or classes). Using modules like this “abstracts away” some of the underlying complexity, and means engineers can avoid having to understand that complexity until such a time as it becomes relevant to the task at hand.

As an aside, organisations, like software systems, are another example of complex information processing systems. For this reason, the principles underlying good organisational design are similar to the principles underlying good software design. There are severe scalability limitations to an organisation if every person has to know what every other person is doing. This works up to perhaps a dozen people, and then synchronisation/coordination overhead becomes a dominating factor. The canonical way to solve this in organisations is separation into teams with defined responsibilities (functional units/modules), and touch-points with other teams (interfaces).

Coming back to code, there are other less impactful but still sometimes meaningful strategies for reducing code complexity, such as coding standards that try to ensure consistent formatting and reduce the set of language constructs that are deployed in the codebase. There are also other tools, such as automated regression tests, that don’t themselves reduce complexity, but do reduce the impact of an engineer failing to understand relevant complexity by explicitly pointing out their failure to them before their mistake impacts the software’s users.

Rounding off, the cost-of-failure when modelling data is much higher than the cost-of-failure when writing code, and so data is indeed king. However, codebases inevitably become more complex than data-models, and if feature velocity is a priority, this complexity cannot be ignored, it must be managed.

Summary

  • Data is king.

  • Focus on ensuring your data-model is correct before thinking too much about your code.

  • Codebases are much more complex than data-models.

  • Managing code complexity is important to ensuring fast feature delivery.

  • Modularisation is the most impactful tool for managing code complexity.

References

  1. Parnas, D.L. (December 1972).

    "On the Criteria To Be Used in Decomposing Systems into Modules"

    . Communications of the ACM. 15 (12): 1053–58. doi:10.1145/361598.361623.