Compression is the Key

Taking a step back for a minute, and looking at our current time and space in the information systems and data industry (circa March 2024), I find a “silent thought” [1] hidden in plain sight - a signal amidst the noise.

Compression - of data, information and knowledge, is key to what is happening in our world with artificial intelligence (AI), machine learning (ML) and our current infatuation with large language models (LLM). Claude Shannon, attributed with establishing the concept of information theory [2] has never seemed more validated.

It’s not just me, have a look:

If you look at the history of mankind, information science, statistics, even mathematics, and for that matter, data as a field of study, it is not surprising that compression is key. For years, we have tried to take what is big, and make it small again for the purposes of reasoning, of decisioning, of making predictions, and so on.

If we look at the fields of decision science/support, or even the things we have done for years such as data warehousing, search engine technology, mathematical or statistical modelling - it has been often about compressing vast amounts of data into models that can be used for reasoning. Whether those models are mathematical formulas, networks of equations (e.g. neural networks), ways of reasoning using statistics, or even dimensional data models used for humans to slice and dice data, or to write “boring old SQL” to ask questions of their data, the purpose has always been the same - to compress into less, what is more.

The world we find ourselves in now is the world that this pursuit of compression has created, which we have been since the dawn of the information age, perhaps even since the dawn of science itself.

We now speak not just about mindless data compression - although this is part of the solution we use to make big data small again. Now we speak of having our models trained in such a way as to understand the semantic meaning of data. We are no longer just compressing data, we are compression information - into models. We are not just trying to compress just information - if we can compress knowledge into models, so that our models become wise and efficient, so to do we wish that outcome.

There lies a fundamental challenge in all of this. Our need to compress the complex into that which we can reason about simply is fundamental to our human limitation. This is why we have invented technology to expand beyond human limitation.

Where cars extended the range of our feet, our approaches to science, systems, AI and beyond attempt to extend the range of our brains - literally, extended the range of our ability to think.

I think there is both danger and a certain amount of inevitability in these developments. Whether we like it our not, we are still natural creatures and all of what we have built that we consider technology is still, in my opinion, natural. It is entirely possible that what we are seeing now is simply the continuation of evolution in weird and wonderful ways that nobody could have predicted.

And therein lies one last rub - “that nobody could have predicted” - we find ourselves in a world today where the power to predict, the power to think, the power to reason, has become embedded in our technology, in our systems, in our science. The complexity of the universe must be difficult to calculate, perhaps this itself is the challenge for which these technological advancements are required.

And this is why compression is so important. Compression of data, information and knowledge into models, models that can understand semantics, concepts and relationships, and models that are interactive, autonomous, and beyond.

We take what is big and we make it small again. How far can this go, and where does it end?

[1] By “silent thought” I mean, a signal so obvious, that when you see it, you cannot unsee it, but that is so obvious, you aren’t seeing it right now.
[2] https://en.wikipedia.org/wiki/Information_theory

 
4
Kudos
 
4
Kudos

Now read this

Connected Data

Today, data is disconnected. Data is fundamentally diverse, disparate and distributed. This has always been a challenge and it continues to be “the” challenge. Disconnected data is front and center, as enormous amounts of human effort... Continue →