Thursday, June 14, 2012

Rethinking data

"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay." — Sherlock Holmes in The Adventure of the Copper Beeches.

Data may be the preeminent obsession of our age[1]. We marvel at the ever-growing quantity of data on the Internet, and fortunes are made when Google sells shares for the first time on the stock market. We worry about how corporations and governments collect, protect, and share our personal information. A beloved character on a television science fiction show is named Data. We spend billions of dollars to convert the entire human genome into digital data, and having completed that, barely pause for breath before launching similar and even larger bioinformatic endeavours. All this attention being paid to data reflects a real societal transformation as ubiquitous computing and the Internet refashion our economy and, in some respects, our lives. However, as with other important transformations—think of Darwin's theory of natural selection, and the revolutionary advances in genetics and neuroscience—misinterpretation, misapplication, hype, and fads can develop. In this post, I'd like to examine the current excitement about data and where we may be going astray.

Big Data

Writing in the New York Times, Steve Lohr points out that larger and larger quantities of data are being collected—a phenomenon that has been called "Big Data":
In field after field, computing and the Web are creating new realms of data to explore — sensor signals, surveillance tapes, social network chatter, public records and more. And the digital data surge only promises to accelerate, rising fivefold by 2012, according to a projection by IDC, a research firm. 
Widespread excitement is being generated by the prospect of corporations, governments, and scientists mining these immense data sets for insights. In 2008, a special issue of the journal Nature was devoted to Big Data. Microsoft Research's 2009 book, The Fourth Paradigm: Data-Intensive Scientific Discovery, includes these reflections by Craig Mundie (p.223):
Computing technology, with its pervasive connectivity via the Internet, already underpins almost all scientific study. We are amassing previously unimaginable amounts of data in digital form—data that will help bring about a profound transformation of scientific research and insight. 
The enthusiasm in the lay press is even less restrained. Last November, Popular Science had a special issue all about data. It has a slightly breathless feel—one of the articles is titled "The Glory of Big Data"—which is perhaps not so surprising in a magazine whose slogan is "The Future Now".

Data Science

Along with the growth in data, there has been a tremendous growth in analytical and computational tools for drawing inferences from large data sets. Most prominently, techniques from computer sciencein particular data mining and machine learninghave frequently been applied to big data. These approaches can often be applied automatically—which is to say, without the need to make much in the way of assumptions, and without explicitly specifying models. What is more, they tend to be scalable—it is feasible (in terms of computing resources and time) to apply them to enormous data sets. Such approaches are sometimes seen as black boxes in that the link between the inputs and the outputs is not entirely clear. To some extent these characteristics stand in contrast with statistical techniques, which have been less optimized for use with very large data sets and which make more explicit assumptions based on the nature of the data and the way they were collected. Fitted statistical models are interpretable, if sometimes rather technical.

In an article on big data, Sameer Chopra suggests that organizations should "embrace traditional statistical modeling and machine learning approaches". Some have argued that a new discipline is forming dubbed data science[2]which combines these and other techniques for working with data. In 2010, Mike Loukides at O'Reilly Media wrote a good summary of data science, except for this odd claim:
Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis.
Leaving aside the confusion between statistics and actuarial science (not to mention stereotyped notions of typical attire), what is curious is the suggestion that "traditional statistics" has little role to play in the effective use of data. Chopra is more diplomatic: machine learning "lends itself better to the road ahead". Now, in many cases, a fast and automatic method may indeed be just what's needed. Consider the recommendations we have come to expect from online stores. They may not be perfect, but they can be quite convenient. Unfortunately, the successes of computing-intensive approaches for some applications has encouraged some grandiose visions. In an emphatic piece titled "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete", Chris Anderson, the editor in chief of Wired magazine, writes:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity.
We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
Anderson proposes that instead of taking a scientific approach, we can just "throw the numbers" into a machine and through computational alchemy transform data into knowledge. (Similar thinking shows up in commonplace references to "crunching" numbers, a metaphor I have previously criticized.) The suggestion that we should "forget" the theory developed by experts in the relevant field seems particularly unwise. Theory and expert opinion are always imperfect, but that doesn't mean they should be casually discarded.

Anderson's faith in big data and blind computing power can be challenged on several grounds. Take selection bias, which can play havoc with predictions. As an example, consider the political poll conducted by The Literary Digest magazine, just before the 1936 presidential election. The magazine sent out 10 million postcard questionnaires to its subscribers, and received about 2.3 million back. In 1936, that was big data. The results clearly pointed to a victory by the republican challenger, Alf Landon. In fact, Franklin Delano Roosevelt won by a landslide. The likely explanation for this colossal failure: for one thing, subscribers to The Literary Digest were not representative of the voting population of the United States; for another, the 23% who responded to the questionnaire were likely quite different from those who did not. This double dose of selection bias resulted in a very unreliable prediction. Today, national opinion polls typically survey between 500 and 3000 people, but those people are selected randomly and great efforts are expended to avoid bias. The moral of this story is that, contrary to the hype, bigger data is not necessarily better data. Carefully designed data collection can trump sheer volume of data. Of course it all depends on the situation.

Selection biases can also be induced during data analysis when cases with missing data are excluded, since the pattern of missingness often carries information. More generally, bias can creep into results in any number of ways, and extensive lists of biases have been compiled. One important source of bias is the well-known principle of Garbage In Garbage Out. Anderson refers to measurements taken with "unprecedented fidelity". It is true that in some areas, impressive technical improvements in certain measurement have been made, but data quality issues are much broader and are usually problematic. Data quality issues can never be ignored, and can sometimes completely derail an analysis.

Another limitation of Anderson's vision concerns the goals of data analysis. When the goal is prediction, it may be quite sufficient to algorithmically sift through correlations between variables. Notwithstanding the previously noted hazards of prediction, such an approach can be very effective. But data analysis is not always about prediction. Sometimes we wish to draw conclusions about the causes of phenomena. Such causal inference is best achieved through experimentation, but here a problem arises: big data is mostly observational. Anderson tries to sidestep this by claiming that with enough data "Correlation is enough":
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
But on the contrary, investigations of cause and effect (mechanistic explanations) are central to both natural and social science. And in applied fields such as government policy, it is often of fundamental importance to understand the likely effect of interventions. Correlations alone don't answer such questions. Suppose, for example, there is a correlation between A and B. Does A affect B? Does B affect A? Does some third factor C affect both A and B? This last situation is known as confounding (for a good introduction, see this article [pdf]). A classic example concerns a positive correlation between the number of drownings each month and ice cream sales. Of course this is not a causal relationship. The confounding factor here is the season: during the warmer periods of the year when people consume more ice cream, there are far more water activities and hence drownings. When a confounding factor is not taken into account, estimates of the effect of one factor on another may be biased. Worse, this bias does not go away as the quantity of data increases—big data can't help us here. Finally, confounding cannot be handled automatically; expert input is indispensable in any kind of causal analysis. We can't do without theory.

Big data affords many new possibilities. But just being big does not eliminate the problems that have plagued the analysis of much smaller data sets. Appropriate use of data still requires careful thought—about both the content area of interest and the best tool for the job.

Thinking about Data

It is also useful to think more broadly about the concept of data. Let's start with an examination of the word data itself, to see what baggage it carries.

We are inconsistent in how we talk about data. The words data and information are often used synonymously (think of "data processing" and "information processing"). Notions of an information hierarchy have been around for a long time. One model goes by the acronym DIKW, representing an ordered progression from Data to Information to Knowledge and eventually Wisdom. Ultimately, these are epistemological questions, and easy answers are illusory.

Nevertheless, if what we mean by data is the kind of thing stored on a memory stick, then data can be meaningless noise, the draft of a novel, a pop song, the genome of a virus, a blurry photo taken by a cellphone, or a business's sales records. Each of these types of information and an endless variety of others can be stored in digital memory: on one level all data are equivalent. Indeed the mathematical field of information theory sets aside the meaning or content of data, and focuses entirely on questions about encoding and communicating information. In the same spirit, Chris Anderson argues that we need "to view data mathematically first and establish a context for it later."

But when we consider the use of data, it makes no sense to think of all data as equivalent. The complete lyrics of all of the songs by the Beatles is not the same as a CT scan. Data are of use to us when they are "about" something. In philosophy this is the concept of intentionality, which is an aspect of consciousness. By themselves, the data on my memory stick have no meaning. A human consciousness must engage with the data for them to be meaningful. When this takes place, a complex web of contextual elements come into play. Depending on who is reading them, the Beatles' lyrics may call to mind the music, the cultural references, the history of rock and roll, and diverse personal associations. A radiologist who examines a CT scan will recognize various anatomical features and perhaps concerning signs of pathology. Judgements of quality may also arise, whether in mistranscribed lyrics or a poorly performed CT scan.

The word data is the plural of the Latin word datum, meaning "something given". So the data are the "givens" in a problem. But in many cases, it might be helpful to think of data as taken rather than given. For example, when you take a photograph, you have a purpose in mind, you actively choose a scene, include some features and exclude others, adjust the settings of the camera. The quality of the resulting image depends on how steady your hand is, how knowledgeable you are of the principles of photography. Even when a photograph is literally given to you by someone else, it was still taken by somebody. The camera never lies, but the photograph may be misunderstood or misrepresented.

When a gift is given to you, it is easy to default to the passive role of recipient. The details of how the gift was selected and acquired may be entirely unknown to you. A dealer in fine art would carefully investigate a newly acquired work to determine its provenance and authenticity. Similarly, when you receive data from an outside source, it is important to take an active role. At the very least, you should ask questions. Chris Anderson claims that "With enough data, the numbers speak for themselves." But on their own, the numbers never speak for themselves, any more than a painting stolen during WWII will whisper the secret of its rightful ownership. One common source of received data today is administrative data, that is, data collected as part of an organization's routine operations. Rather than taking such data at face value, it is important to investigate the underlying processes and context.

It is also possible to make use of received data to design a study. For example, to investigate the effect of a certain exposure, cases of a rare outcome may be selected from a data set and matched with controls, that is individuals who are similar except that they did not experience that outcome. (This is a matched case-control study.) Appropriate care must be taken in how the cases and controls are selected, and in ensuring that any selection effects in the original database do not translate into bias in the analysis. Tools for the valid and efficient analysis of such observational studies have been investigated by epidemiologists and statisticians for over 50 years.

When we collect the data ourselves, we have an opportunity to take an active role from the start. In an experiment, we manipulate independent variables and measure the resulting values of dependent variables. Careful experimental design lets us accurately and efficiently obtain results. In many cases, however, true experiments are not possible. Instead, observational studies, where there is no manipulation of independent variables, are used. Numerous designs for observational studies exist, including case-control (as mentioned above), cohort, and cross-sectional. Again, careful design is vital to avoid bias, and to efficiently obtain results.


Excitement over a new developmentbe it a discovery, a trend, or a way of thinkingcan sometimes spill over, like popcorn jumping from a popper. This may give rise to related, but nevertheless distinct ideas. In the heat of the excitement (and not infrequently a good deal of hype), it's important to evaluate the quality of the ideas. Exaggerated claims may not be hard to identify, but they are also frequently pardoned as merely an excess of enthusiasm.

Still, the underlying bad idea may, in subtler form, gradually gain acceptance. The costs may only be appreciated much later. Today it is easy to see how damaging ideas like social Darwinismthe malignant offspring of a very good ideaproved to be. But at the time, it may have seemed like a plausible extrapolation from a brilliant new theory.

The role of data in our societies and our own lives is becoming increasingly central. We live in a world where the quantity of data is exploding and truly gargantuan data sets are being generated and analyzed. But it is important that we not become hypnotized by their immensity. It is all too easy to see data as somehow magical, and to imagine that big data combined with computational brute force will overcome all obstacles.

Let's enjoy the popcornbut turn down the heat a little. 

1. ^In this post, I won't worry too much about whether to treat data as singular or plural. It strikes me as a little bit like the question of whether to talk about bacteria or a bacterium. While the distinction is sometimes important, people can get awfully hung up on it, with little benefit. 
2. ^ See this interesting history of data science
Bookmark and Share