Lately, I’ve been spending a lot of time exploring the differences between data (as in “Big Data”) and information.
There’s a very interesting conceptual model that has been proposed outlining the relationship between data, knowledge, information, understanding, and wisdom (D-K-I-U-W for brevity’s sake) attributed to American organizational theorist Russell Ackoff. For a nice introduction to this model, you can read the article “Data, Information, Knowledge, and Wisdom,” byGene Bellinger, Durval Castro, and Anthony Mills.
As the article explains, “data” is a set of symbols representing something, but which has no meaning without understanding what the something is, while “information” provides answers to those critical questions of “who,” “what,” “when” and “where” that make data useful. In other words, individual data, out of context, is not very useful. It is the addition of knowledge and context to data which creates usable and valuable answers to the questions a business needs to have answered.
A practical example of this idea is the number 40, which is a piece of data. If I tell you “40,” you really don’t know anything. 40 what? 40 days and 40 nights? 40 dogs? 40 dollars? Now, if I say 40°, you know a bit more about what I’m trying to communicate. And if I tell you that it was 40° today in Minneapolis, after weeks of days where the high temperature was less than 0, now you have what famous radio commentator Paul Harvey used to refer to as “the rest of the story.” And it’s certainly a very different story than if I said the 40° high was here in Southern California on the 4th of July! The context – in this case, the historical temperatures and the general climate – is what makes the data interesting and useful.
This disparity between data and information seems to have been somewhat lost in the noise around Big Data. The D-K-I-U-W conceptual model would say that, to a customer, Big Data on its own isn’t valuable. In fact, it’s just an expensive problem, as growing data has to be stored, transferred, managed, etc., with all the costs this entails. What businesses really care about is Big Information, which hopefully provides that nugget of wisdom that makes the next million dollars in revenue, or saves the next million in expense. Net-net, it’s putting data into the context of the business – in a way that allows people and/or machines to find the key insights – that matters.
Of course, transforming data into information so that you can find that gem or wisdom is a non-trivial problem. It seems to me that you can accomplish this in one of three ways:
- You can understand what matters in advance, and organize the data in such a way that it can be easily put into context. Historically, data warehouses attempted (with varying degrees of success) to accomplish this.
- You can index the data in such a way that makes it easy to find. This means following the practice of media and entertainment companies, who tag all their content with business, and sometimes system-derived, metadata. Metadata tagging is a very interesting topic that I’ll cover in my next post.
- Finally, you can form a theory (or an algorithm), and then apply brute computing force to the data to find and extract the answer of whether or not you’re right. This is the goal of many recent approaches, whether it is randomized searching (ala Google), or many of the modern (often Hadoop-resident) analytical tools. This is a somewhat science-based approach to data mining, which, come to think of it, may be why we now call the individuals who do this data scientists.
While these approaches are different, however, note that they all start with the concept that the data being collected is instantly readable and re-usable. Trying to add context to a piece of data you cannot read is impossible. And yet, so much of the data that commercial enterprises have been and are storing is not immediately readable – it’s in proprietary backup format, with the need to be unpacked and restored prior to being used. Yes, I know that this data has historically been collected for the purpose of risk mitigation (to replace lost or broken data, or sometimes even as a poor man’s archive for compliance purposes), but if the business is going to the cost and effort of storing this secondary data, why not do it a way which makes the data usable?
The only other option to maintaining this data in an instantly usable state is to keep it in file format (on disk, presumably) while also keeping a separate backup copy. But that means more data to store, move, protect and manage, which means much more cost.
That’s why so many customers I’ve spoken with lately are focused on finding a way to maintain all their data in native format, and it may be part of what’s driving the growth in some of the new Virtual Machine-level backup tools (including Quantum’s own VMPro) which store copies of data in native file format, rather than enclosed in a proprietary backup wrapper.
In any case, it seems like any forward-thinking CIO would increasingly insist on going native.
Want to Know More?
Check out Quantum’s vmPRO webpage for a free download: Quantum.com/vmPRO