In keeping with my original intent for this column, namely, to provide some guidance in this lexically challenged industry of ours, I set out this month to analyze a word so prevalent that it is almost invisible, so common that you may argue it needs no analysis at all. The term in question is data. As atoms make up molecules that eventually make up this glorious world around us, simple ones and zeroes constitute data, which, in turn, constitutes our intellectual universe.
During my research I was reminded that much of what weıre doing today, from databases to enterprise information systems in general, is a direct result of scientific theories and practices developed almost a half-century ago. The ı90s thinkers (and adept marketing departments) taking credit for these innovations stand on the shoulders of brilliant but relatively anonymous forebears. If you take into consideration the basic concepts underlying the study of data, we walk in footsteps that date back to the 17th century.
Today data is everywhere. There are databases, data warehouses, data marts, operational data stores, as well as data mining, data management, data extraction, data migration, data quality, not to mention the reams and reams of data now available via the Internet, intranets, and extranets. But the term data and its singular form, datum, made their debut in 1646. (I must note here that data is actually a plural term. Occasionally we adapt language to suit modern tastes, and in this case instead of saying "data are," most people now will say "data is." Even fewer people use the term "datum" when referring to data in the singular.) Although lacking a dictionary from that time period, I located an online version of the 1913 edition of Websterıs Revised Unabridged Dictionary (see humanities.uchicago.edu/forms_unrest/webster.form.html), which defines datum as, "Something given or admitted; a fact or principle granted; that upon which an inference or an argument is based." Even at the beginning of this century we based our decisions on data. The data simply wasnıt in digital form yet.
Of course, contemporary dictionaries give a nod to modern computational technology in their definitions of the term. For example, the online version of Websterıs Unabridged Dictionary (www.m-w.com/dictionary.htm; I use this rather than the printed edition because it is the most up-to-date) defines data as, "Information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful."
Here is a critical point: data must be processed to be relevant. By processing I mean transformed in a way that makes it useful. Dirty data and redundancies are eliminated. It is structured in a way that makes it possible to retrieve select information. Raw data, or data that hasnıt been processed, is a lot like freshly shorn wool from a sheep. It has to go through the steps of extraction, processing, cleaning, and refinement before we can use it to dress our decisions.
Data doesnıt just appear out of thin air. We go out and get it. We collect it. And presumably when we collect it we have an idea of what we want to do with it. We structure it in tables and complex objects. We determine ahead of time (I hope) what information we want to be able to derive from the data.
So what gives data meaning? When we structure data and determine how we want to store it and in which ways we want to be able to query and retrieve it, we have information. From the analysis of this information we can make a rational interpretation that results, in turn, in knowledge.
The term information, like its cousin data, has been around since the mid-1600s. Regardless of how it fared the 1700s and 1800s, the term information went through a renaissance during the early days of computational technology in the 1950s. For the first time ever we had machines that let people store, sort, manipulate, and gain knowledge from data ı a far cry from the abacus. The technology, albeit rudimentary by current standards, resulted in an information explosion. People were able to manipulate unprecedented volumes of data in ways previously impossible.
Around this same time, several theories and practices took shape as people attempted to grapple with this information explosion. They include information theory (often called Shannon Theory after Claude Shannonıs 1938 paper, "A Mathematical Theory of Communication"), which argued that information should be coded in binary format, and information retrieval, which encompassed the entire process of storing recorded data and distributing it via computers. A decade later came information science, which encompassed the entire process of collecting, classifying, storing, retrieving, and distributing recorded knowledge. Interestingly enough, the early adopters of information science came not from the computer centers or large corporations but from the biggest data centers of all: libraries.
During my wanderings this month, I was struck by how much of todayıs database technology is a direct descendant of these older working concepts. We appear to be progressing in slow but steady stages of refinement. We have moved from simple storage and processing of huge amounts of data (namely early mainframe systems) to more complex data storage and manipulation with the intent of obtaining valuable knowledge from this data (namely data warehousing, data mining, and decision support). After knowledge comes wisdom. I can see it now ı will wisdom warehousing be the next big wave?