Monday, August 23, 2010

An ORnery problem

Some time ago, I wrote about missing values and how they complicate the life of an applied statistician. A particularly tricky case concerns logical variables, and I give a more detailed explanation here.

Suppose X is a variable representing whether a person is at risk for developing type-2 diabetes. Two known risk factors are (A) being older and (B) being overweight. If we had a database containing the age and weight of each person in a group, we could compute X using the following logical expression:
X = A OR B.
(X, A, and B are known as logical variables, and they each take values TRUE or FALSE according to whether the corresponding condition holds.) But what happens if some ages and weights are missing from the database? Fortunately statistical software packages like R and SPSS have built-in rules that will correctly evaluate the logical expression, even if A or B (or both) are missing. The complete truth table is as follows (where T means TRUE, F means FALSE, and a dot means missing):

Note that if A is FALSE and B is missing, the result is missing. That makes sense because if the actual value of B were TRUE, the result would be TRUE, but if the actual value of B were FALSE, the result would be FALSE. Thus it is not possible to say what the value of X is.

The trouble is, this logic can sometimes be perverse. Suppose X instead represents whether a patient tests positive for infection with a certain virus. But there may be two different blood tests (A and B), and patients may receive one or the other, or perhaps both. Suppose that if any of the tests is positive, the patient will be considered to be infected. The logical variables A and B take the values TRUE, FALSE, or missing according to whether the corresponding tests were positive, negative, or simply not performed. Shouldn't the logical expression X = A OR B handle this situation correctly? Unfortunately not. Suppose only one test was performed, and it was negative. Then the truth table shows that X will be missing, even though the patient tested negative!

Why does the logical expression handle missing values the way we want in the first case, but fail to do so in the second? The answer is that in the first case a missing value represents the fact that the age or weight of a given person is not available, whereas in the second case, when a test outcome is missing from the database, it means that no test was performed, thus the variable representing the outcome is not applicable. Another common case of variables that are not applicable occurs with data representing observations on multiple occasions. For example, suppose a database records whether hotel guests eat at the hotel restaurant on the first day of their stay (EAT1), the second day (EAT2), or the third day (EAT3). Some guests stay for just one day, while others stay longer. The database may look like this:

This is an example of a ragged array, and as with the blood test, the issue is that the denominator (the number of tests performed, or the number of days a guest stays at the hotel) varies. To determine whether a guest ate at the hotel restaurant at least once (which we will represent by the logical variable EAT), we might try:
EAT = EAT1 OR EAT2 OR EAT3.
Unfortunately, as with the blood test example, this can fail when there are missing values. Guest number 4 in the table above stayed just one day at the hotel and did not eat at the restaurant, so EAT should be FALSE, but the expression above gives a missing value.

Workaround in R

In R, the vertical bar operator | represents OR, and missing values are represented by NA. For the diabetes example, the following behaviour is just what we want:

> FALSE | NA
[1] NA

In other words, when a person does not have one of the risk factors, but we don't know about the other one, then we don't know if the person is at risk. But for the blood test example, we need to use the following code:

> sum(FALSE,NA,na.rm=TRUE)>0
[1] FALSE

The sum function adds up logical values by treating TRUE as 1 and FALSE as zero. If the sum of the logical values is greater than zero, then at least one of the values must have been TRUE. Setting na.rm=TRUE tells sum to ignore missing values.

Workaround in SPSS

The situation is much the same in SPSS. For the diabetes example, the following works:

COMPUTE X = A OR B.
EXECUTE.

But for the blood test example, we need to use:

COMPUTE X = SUM(A,B)>0.
EXECUTE.

Note that the SPSS function SUM ignores missing values.

Missing value mistakes

The hard part, of course, is thinking through how the missing values in a given situation should be handled. I suspect that this issue has resulted in countless errors in data analyses. Proceed with caution: a miss is as good as a mile!

Labels: , , ,

Bookmark and Share

2 Comments:

Blogger Ginger Sanches said...

The post is very understandable. I had a problem with this theme, so I want to recommend you the http://essay-writing.company/. It is very useful and informative website.

6:53 AM, August 28, 2017  
Blogger Crystal said...



Henderson Elizabeth
Dr joy is a trust worthy spell caster and he will be of great help to you. I never believed in spell casting but After 4 years of marriage my husband left me because I lost my womb, and i was unable to give birth to children. I felt like my life has come to an end, and i almost committed suicide, i was emotionally down for a very long time, but thanks to this spell caster called Dr joy whom i met online after my friend Becky Ross told me how he also helped her to bring back her husband in less than 2 days. I believed her and decided to give Dr joy a try and i contacted him on his email joylovespell@gmail.com. and explained my problems to him. He laughed and told me that In less than 2 days, my Husband will come back to me again, and that he will restore my womb and i will give birth to children. At first i thought it was a joke but i took courage and believed as Dr joy has said and it did happen just as this Great spell caster said, My husband called me and was crying, begging for forgiveness. I forgive him and today i am so glad that all worries and problems has gone away, and we are even happier than before, another good news is that i am pregnant now, and very soon we will have our baby. Dr joy is really a gifted and a powerful spiritual man and i will not stop publishing him because he is a wonderful man. I advice you all If you have a problem and you are looking for a real and genuine spell caster to solve all your problems just Contact Dr joy on his email on joylovespell@gmail.com. because he will always help you to solve all problems. Once again thank you Dr joy. Thank you, thank you.
you can also call him or add him on Whats-app: +2348100452479.

7:29 PM, June 26, 2018  

Post a Comment

<< Home