July 7, 2015

Connected Data

Today, data is disconnected. Data is fundamentally diverse, disparate and distributed. This has always been a challenge and it continues to be “the” challenge. Disconnected data is front and center, as enormous amounts of human effort and resource are expended in pursuit of creating human value from data. And it will continue to be “the challenge” for the future, even with advances in AI (such as deep learning) and the like.

The goal is simple in concept and difficult in practice. The goal - that which produces value for humanity - is connected data. Beyond just connected data, we also need organized and accessible data, and the means/tools to work with it. But fundamentally, it is about the connections you make with data - it is about putting the puzzle together for understanding.

I recently watched a presentation by the 14th Dalai Lama, Tenzin Gyatso, in which he made the following statement which I found particularly meaningful:

“First, comes experience and intelligence.

Second, comes wisdom.

Third, from wisdom comes vision.”

From our experience (human activity) we generate data. We use our intelligence to interpret that data (generating information and knowledge), and ultimately hope to gain wisdom. The greatest power to change comes from the vision generated by true wisdom gained through experience and the application of our intelligence. I’m no 14th Dalai Lama, but that sounds really good to me. :)

Data is not a means unto itself - data is there for us to analyze as humans, with the aid of our technology, to generate information… create and refine knowledge, and ultimately help us build our wisdom. Someday, I may write in more detail about the question… “what ever happened to the data, information, knowledge and wisdom hierarchy?” But that will be for another day.

Everyone, and I do mean everyone, in the “data” industry, from data platform vendors to data scientists to data consumers wants to make connections with data. For, it seems, when we connect our data (whether it be across conceptual, system and/or organizational boundaries), we unlock new possibilities, new opportunities, and new innovations.

Much has been said about “big data” - and I do not have much more to say about it here. While there is clearly value in huge volumes of data, it has proven just as likely that we might make major discoveries in “small data”. Statistics, by the field’s very nature, takes “big data” and turns it into “small data” via sampling. It is often with “small data” that one can really reason, even if that small data is a taken from a larger data set.

Powerful technologies will help us reason on scales we never contemplated before. But even then, it is in service of human endeavor, and last time I checked, the machines don’t just collect data for their own desires.

It is not that our data is big (although that is a challenge). It is not really about the characteristics of the data itself at all. I was an early pioneer in the use of the concepts of defining data in terms of volume, type, distance and time (VTDT) - now more commonly referred to as “Volume, Variety, Velocity, Veracity and Value” (the 5 V’s of data, or 4 if you leave off value). This focus on the characteristics of the data itself was a necessary mechanical one, but in some ways detracts from the true innovation and value we can provide.

That value comes from connected data - data that you can bring together from any source, of any type, from any system, organize and make accessible. Connected, organized, accessible data.

Personally, I am not a fan of the “any” part of that comment above. Many vendors or systems claim to offer connectivity for “any” data source. However, as time progresses, the need to make connections between “any”
data has become a reality, because of the disparate, diverse and distributed nature of our data, living in so many different, disconnected systems, files, formats (and so on). So I guess that in the end, even if it is not connecting “any” data, it will be about connecting “very many” datum.

Data really is everywhere and comes in every shape and size - embedded in photographs, sound, video, formatted in millions of formats ranging from proprietary to open source/standard. Data is unstructured, semi-structured, structured and even rigidly structured.

I remember the first time when I really stopped to wonder, what kind of data is embedded in a photo which might be relevant in the context of other data? The answer today is clear - a lot. Now with machine vision, we can produce much useful data about the photo. We can produce data about the colors, shapes, and so much more. The data is there, embedded, in the photograph, and it is data just like any other data, which means it can be interpreted, analyzed and is part of the puzzle.

The characteristics of the data matter. But what matters more is the data itself, and what you do with it. This is the essence of the value we can extract from this giant, hyper-focused, human-scale exercise of data-wrangling we seem to find ourselves in these days. What you do with the data, how you connect it, what data you connect, what is discovered in the process of connecting it is what matters.

In western culture, there is a saying: “it is not the clothes that make the man, but the man that makes the clothes.” Embedded in this quote is an interesting feedback loop. Stepping back from the philosophy a little, in real life, clothes do matter from a practical sense (both making and wearing), but clothes wouldn’t matter without humanity. There is an interplay between the humanity and the clothes, the clothes are a tool of humanity that serves a need and provides a benefit. I feel that we can and do experience a similar feedback loop with data and technology (recurrent neural networks, hey-hey!).

It is not the data that makes humanity, but humanity that makes the data, and it is humanity that can connect the data…..

…and connected data can provide true benefit to all humankind.

Kudos

Connected Data

Now read this

Compression is the Key