In our earlier examples of regression analysis, we were predicting a continuous variable such as human weight. But what if we wish to predict a dichotomous varible, i.e. one recording which of two outcomes occurs?
Consider the Pima dataset from earlier examples. Say we are predicting
whether someone has — or will later develop — diabetes. This is coded
in the test
column of the dataset, 1 for having the disease, 0 for
not.
As a simple example, say we try to predict test
from the variables
bim
and age
. A linear model would be
mean test = β0 + β1 bmi + β2 age
Remember, test
takes on the values 1 and 0. What happens when we
take the average of a bunch of 1s and 0s? The answer is that we get the
proportion of 1s. For instance, the mean of the numbers 1,0,1,1 is 3/4,
which is exactly the proportion of 1s in that data.
In statististical terms, what the above equation is doing is expressing the probability of a 1 — i.e. the probability of having diabetes — in terms of Body Mass Index and age.
Not a bad model, but one troubling point is that the right-hand side could evaluate to a number less than 0 or greater than 1, which would be impossible for a probability. In order to deal with that problem, we might use a logistic model, as follows.
Define the logistic function to be
l(t) = 1 / (1 + e-t)
We then modify the above equation to
probability of diabetes = l(β0 + β1 bmi + β2 age)
As before, the statistical details are beyond the scope of this R tutorial, but here is how you estimate the coefficients βi using R:
> glout <- glm(test ~ bmi + age, data=pima, family=binomial)
> summary(glout)
...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.40378 0.51530 -10.487 < 2e-16 ***
bmi 0.09825 0.01248 7.874 3.45e-15 ***
age 0.04561 0.00694 6.571 4.98e-11 ***
...
Let’s explore those estimated βi a bit. Consider women with about average BMI, say 32, and compare 30-year-olds to those of age 40.
> l <- function(t) 1 / (1 + exp(-t))
> l(-5.40378 + 32*0.09825 + 30*0.04561)
[1] 0.2908045
> l(-5.40378 + 32*0.09825 + 40*0.04561)
[1] 0.3928424
So, the risk of diabetes increases substantial over that 10-year period, but this population and BMI level.