Friday, June 30, 2006

My two cents

I'm a creature of habit. Each morning, upon arriving at the hospital where I work, I stop at the café. I always get the same size and type of coffee, and I know the price in advance. Surprisingly enough, today my coffee was 2 cents cheaper. Why? Well, the Conservative government has just lowered the goods and services tax (GST) by 1%.

Of course the price difference on a cup of coffee is utterly trivial. (How long before we save everyone a lot of trouble and get rid of the penny altogether?) If I'd been buying a new car I might have saved a few hundred dollars.

The Conservatives believe that this will stimulate the economy, and that may well be true, although I imagine that might take a little while. But in any case, that's an economic prediction, not a certainty. Economies don't always behave the way economic models suggest they will. (I'm being charitable.) Ultimately, it remains to be seen what will actually happen. If the economy does improve, the Conservatives will no doubt attribute the change to the reduction in the GST, but to convincingly argue for a causal relationship isn't nearly so easy. Any number of other factors could be responsible for such a change.

What seems indisputable is that in the short term at least, the government will take in less tax revenue. I haven't seen any economic analyses about the longer term. Perhaps we're supposed to just take it on faith that lower taxes are a good thing. If government revenues are reduced, then you have to increase debt or reduce spending. For fans of smaller government, the choice seems clear.

When Conservatives look at government spending, they see some prime targets: health care, education, social programs, funding for the arts. Oddly enough they seem to forget one big area of government spending: the military, which seems to enjoy some special metaphysical status. In any case, wouldn't cutting military spending be ... unpatriotic?

Tomorrow is Canada Day, and I must admit that while I generally abhor nationalism, I do have a soft spot for Canada Day. I think this is a wonderful country, and we're so fortunate in so many ways. I also love the celebration of diversity that has become such a central part of our national holiday.

I'm sympathetic to fiscal conservatives who want to cut bureaucratic waste and mismanagement. The trouble is it's much easier to talk about doing that than to actually achieve real progress. It's also pretty clear to me that while everyone would like to eliminate inefficiency, Stephen Harper's Conservative government has a lot more than that on their agenda. The move towards privatizing health care in this country isn't about eliminating inefficiency (in fact I'd argue just the opposite). It's part of the broader plan to downsize social spending in general. I, for one, would rather pay a few more cents for my coffee if it helps protect Canada's healthcare system and social programs.
Bookmark and Share

Sunday, June 25, 2006

That question sucks!

Andrew Gelman commented yesterday on a recent CBS News Poll which asked the following question:
"Should U.S. troops stay in Iraq as long as it takes to make sure Iraq has a stable democracy, even if it takes a long time, or leave as soon as possible, even if Iraq is not completely stable?"
Gelman points out that it's a "double-barrelled" question with
"... the assumption that U.S. troops will 'make sure Iraq has a stable democracy,' along with the question of how long the troops should stay".
He also notes that the New York Times piece on the poll included "a yucky graph (as Tufte would put it, 'chartjunk')", which I have shown here.

It really doesn't get much better than this: an exceedingly slanted question and an exceedingly silly graph! If I may, I'd like to name it the bow tie graph ... but is that taken? On this, I defer to Kaiser over at Junk Charts. Dressing up simple percentages with multicoloured variable-sized triangle regions is ingenious, but misguided. It's a shame that so much creative effort is misspent. There are many situations where new ideas for displaying data are needed, but instead we get a never-ending stream of bizarre ways to display percentages.

The question itself stands as striking evidence of media bias. Respondents should have been given a third option: "That question sucks!" (I'm reminded of the segment on This Hour Has 22 Minutes called "That Show Sucked!") As it is, it appears that about 7% of Republicans, 5% of Democrats, and 8% of Independents didn't answer the question. Some of those who didn't answer may have felt the question was too stupid to dignify a response. But it's quite interesting to speculate on what the responses might have been like if the following (perhaps more dignified) choice had been included:
I can't answer this question because I believe it has built-in assumptions that I disagree with.
I often find that this is my reaction to opinion poll questions. To be fair, even with an honest effort to understand people's views (which seems utterly implausible in the case above), it's not easy to ask good questions, or to provide a good set of response choices. This seems like a strong argument in favour of qualitative research methods, which avoid imposing a predefined (and possibly ideologically loaded) structure on responses. I'm very much in the quantitative camp, but for complex matters like political opinions, I can see that there may be some value, particularly in the early stages of research, in taking a qualitative approach.

When it comes to exposing ideological bias, no one's better than Tom Tomorrow. Check out this archive of cartoons from his brilliant series, This Modern World.
Bookmark and Share

Saturday, June 17, 2006

Another kind of nothing

Missing values are the bane of the applied statistician. They haunt our data sets, invisible specters lurking malevolently.

The missing value is like the evil twin of zero. The introduction of a symbol representing "nothing" was a revolutionary development in mathematics. The arithmetic properties of zero are straightforward and universally understood (except perhaps when it comes to division by zero, a rather upsetting idea). In comparison, the missing value has no membership in the club of numbers, and its properties are shrouded in mystery. The missing value was a pariah until statisticians reluctantly took it in—someone had to. And it's an ill-behaved tenant, popping in and out unexpectedly, sometimes masquerading as zero, sometimes carrying important messages—always a source of mischief.

... symbolizing nothing

A variety of different symbols are used to represent missing values. The statistical software packages SAS and SPSS, for example, use a dot. The fact that it's almost invisible is oddly fitting. Other software uses NA or N/A—but does that mean "not available" or "not applicable"? These are, after all, two very different situations. The IEEE floating point standard includes NaN, meaning "not a number" (for example 0/0 is not a number). In spreadsheets, a missing value is simply an empty cell (but in Excel, at least, invalid calculations result in special types of missing values—for example 0/0 results in "#DIV/0!"). Dates and times can also be missing, as can character string variables.

Logical expressions, such as "X > 0" (which can either be TRUE or FALSE), are an interesting special case. If X is missing, then the value of the expression itself is missing. Suppose Y=5. If X is missing, what is the value of the expression "(X > 0) AND (Y > 0)"? Well, we can't say, because we need to know the values of both X and of Y to determine the result. So the value of "(X > 0) AND (Y > 0)" is missing. How about "(X > 0) OR (Y > 0)"? In this case, the answer is TRUE. It is enough to know that Y is positive to answer the question, regardless of the value of X. (There's also a logical operation called exclusive-OR, denoted XOR, which means "one or the other is true, but not both". You'd need to know both values in that case.)

Even though the rules above seem straightforward, great care must still be taken in ensuring that calculations are handling missing values appropriately. That's because in reality there are any number of different kinds of missing values. Suppose, for example, that as part of a clinical study of neurotoxic effects of chemotherapy agents, IQ is measured. What does a missing value in the data set mean? Perhaps the patient wasn't available on the day the measurement took place. Or perhaps they died. Or perhaps their cognitive disability was so severe that the test couldn't be administered. In the last two cases, the missingness might well be related to the neurotoxic effect of interest. This is known as informative missingness. Statisticians also distinguish the case where values are "missing at random" versus "missing completely at random". The latter is the best we can hope for—but it's often wishful thinking.

Something for nothing

One potential solution to the problem of missing values is imputation, that is filling in values ... somehow. One approach is mean imputation in which the mean of the values that are not missing is substituted for any missing values. Seems reasonable, except that it effectively reduces variability, which can seriously distort inferences. A variety of other imputation methods have been proposed, the most sophisticated of which, multiple imputation, allows for valid variance estimates provided a number of assumptions hold. The bottom line is there are no easy solutions: you can't get something for nothing ... for nothing.

Too much of nothing

The really unfortunate thing is that missing values are often the result of bad design. Perhaps the most common instance of this is surveys. Most surveys are too long! This creates at least three problems. The first is non-response (which is missing values writ large). While I might be willing to spend 30 seconds answering a questionnaire, I'd be much less interested in spending 10 minutes. The second problem is that even if I do answer the questionnaire, I may get tired and skip some questions (or perhaps only get part way through), or take less care in answering. The third problem is that long surveys also tend to be badly designed. There's a simple explanation for this: when there are a small number of questions, great care can be taken to get them right; typically when there are a large number of questions, less effort is put into designing each individual question. "Surveybloat" is ubiquitous and wasteful, often the product of design-by-committee. The desire to add "just one more question" is just too strong and the consequences are apparently too intangible (despite being utterly predictable). I would say that most surveys are at the very least twice as long as they ought to be.

In medical research, the randomized controlled trial is considered to be the "gold standard" of evidence. By randomly assigning an experimental intervention to some patients and a control (for example, standard care) to other patients, the effect of the experimental intervention can be reliably assessed. Because of the random allocation, the two groups of patients are unlikely to be very different beforehand. This is a tremendous advantage because it permits fair comparisons. But everything hinges on being able to assess the outcomes for all patients, and this is surprisingly difficult to do. Patients die or drop out of studies (due to side effects of the intervention?); forms are sometimes lost or not completed properly; it's not always possible to obtain follow-up measurements—the sources of missing values are almost endless. But each missing value weakens the study.

If this is a problem with prospective studies, in which patients are followed forward in time and pre-planned measurements are conducted, imagine the difficulties with retrospective studies, for example reviews of medical charts. Missing values are sometimes so prevalent that data sets resemble Swiss cheese. In such cases, how confident can we really be in the study findings?

Learning nothing

Most children learn about zero even before they start school. But who learns about missing values? University-level statistics courses cover t-tests, chi-square tests, analysis of variance, linear regression ... (How much of any of this is retained by most students is another question.) It's only in advanced courses that any mention is made of missing values. So graduate students in statistics (and perhaps students in a few other disciplines) learn about missing values; but even then, it's usually from a rather theoretical perspective. In day-to-day data analysis, missing values are rife. I would hazard a guess that of all the p-values reported in scientific publications, at least half the time there were missing values in the corresponding data, and they were simply ignored. In scientific publications, missing values are routinely swept under the carpet.

Missing values are a bit of a dirty secret in science. Because they are rarely mentioned in science education, it's not surprising that they are often overlooked in practice. This is terribly damaging—regardless of whether it's due to ignorance, dishonesty, or wishful thinking.

Nihil obstat

In some cases, missing values may just be an irritation with little consequence other than a reduction in sample size. It would be lovely if that were always the case, but it simply isn't. We ignore missing values at our peril.


Addendum (22June2006):

In my post I discussed how logical expressions are affected by missing values. The difference between a value that is not available and one that is not applicable has an interesting effect here. Suppose that following an initial assessment of a patient, a clinic administers a single-sheet questionnaire each time the patient returns. One of the questions is:
Since your last visit, have you experienced such-and-such symptom?
It might be of interest to know what proportion of patients have ever answered yes. Suppose that patients returned to the clinic up to three times. A logical expression to represent whether each patient had ever experienced the symptom would be:
symptom = v1 OR v2 OR v3
where v1 is TRUE if the patient reported the symptom on the first return visit, and likewise for v2 and v3. Suppose that a particular patient returned to the clinic three times, and answered "no" the first two times, but the questionnaire sheet for that patient's third visit was misplaced. Then v1=FALSE, v2=FALSE, and v3 is missing (not available). Following the rules that I discussed earlier for logic with missing values (these are rules used in SPSS and R, and I suspect in most other statistical packages), the value of the logical expression would be missing, which makes sense: we unfortunately don't know if the patient ever reported experiencing the symptom.

Suppose that another patient only returned to the clinic twice, also answering "no" on both visits. Then again v1=FALSE, v2=FALSE, and v3 is missing (not applicable, since there was no third visit). Blindly applying the rules, the value of the logical expression would again be missing. But this time, it's incorrect: we know that this patient never reported experiencing the symptom.

This is one of the justifications for my statement that "Even though the rules above seem straightforward, great care must still be taken in ensuring that calculations are handling missing values appropriately."
Bookmark and Share

Thursday, June 15, 2006

Of buffoonery and bigotry

Tonight is the New York City premiere of a documentary film called American Zeitgeist. The film's subtitle is "Crisis and conscience in an age of terror", and it looks fascinating. Following the screening—in fact as I write this—a debate is taking place between Eric Margolis and Christopher Hitchens.

My friend Ray pointed this out, in passing, on his blog earlier this week, and I took the opportunity to comment on Christopher Hitchens. Here is an edited version of my comments:


Part of me thinks that the best response to Christopher Hitchens is simply to ignore him. How anybody can consider him to be anything but a complete buffoon is beyond me.

I think it's interesting to compare Christopher Hitchens and Ann Coulter. On the face of it, they're very different. Coulter is indisputably a joke (albeit a very nasty one), with no pretense of seriousness or intellect. Hitchens, on the other hand, has the sheen of intellectual and moral respectability.

But Coulter and Hitchens are reading from the same hateful script. Coulter plays the comic while Hitchens plays the learned professor. The groundlings are tickled by Coulter's antics, while the folks in the balconies are enthralled by Hitchens' sage pronouncements. There's something for everyone!

That is, unless you'd like a little honesty or decency.

At first blush, ignoring their nonsense seems an attractive option. But media ownership being what it is, Coulter and Hitchens are guaranteed to get lots of exposure. And they're both dangerous.

Here's a small taste of the world according to Hitchens:
"if Muslims do not want their alleged prophet identified with barbaric acts or adolescent fantasies, they should say publicly that random murder for virgins is not in their religion."
This is the kind of inflammatory rhetoric Hitchens is famous for. The same Hitchens who has been unrelenting in his support for the invasion and occupation of Iraq. One of his most recent pronouncements is that Haditha is not like My Lai. How fortunate.

Hitchens has demonstrated repeatedly that he's morally bankrupt, not the least in terms of the way he treats people, and the way he cites evidence. Everyone makes mistakes from time to time, and in general people deserve the benefit of the doubt. But at a certain point you have to pull the plug. Hitchens reached that point ages ago. I'm reminded of a line from Shakespeare: "O wicked wit and gifts, that have the power So to seduce!"

For more insight into Hitchens, see this article or this one.


Another commenter took objection to my characterization of Hitchens and to what she took to be a de facto attack on anyone who agrees with him. That wasn't my intent, but if you're interested, you be the judge (click on the comments link at the end of the blog post).
Bookmark and Share