Most real-world data is “dirty,” i.e. filled with errors. The famous New York taxi trip dataset, for instance, has one trip destination whose lattitude and longitude place it in Antartica! The impact of such erroneous data on one’s statistical analysis can be anywhere from mild to disabling. Let’s see below how one might ferret out bad data. And along the way, we’ll cover several new R concepts.
We’ll use the famous Pima Diabetes dataset. Various versions exist, but
we’ll use the one included in faraway
, an R package compiled
by Julian Faraway, author of several popular books on statistical
regression analysis in R.
I’ve placed the data file, Pima.csv
, on
my Web site. Here
is how you can read it into R:
> pima <- read.csv('http://heather.cs.ucdavis.edu/FasteR/data/Pima.csv',header=TRUE)
The dataset is in a CSV (“comma-separated values”) file. Here we read it, and assigned the resulting data frame to a variable we chose to name pima.
Note that second argument, header=TRUE
. A header in a file, if one
exists, is in the first line in the file. It states what names the
columns in the data frame are to have. If the file doesn’t have one,
set header
to FALSE. You can always add names to your data frame
later (future lesson).
Tip: It’s always good to take a quick look at a new data frame:
> head(pima)
pregnant glucose diastolic triceps insulin bmi diabetes age test
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0
> dim(pima)
[1] 768 9
The dim
function tells us that there are
768 people in the study, 9 variables measured on each.
Since this is a study of diabetes, let’s take a look at the glucose
variable. R’s table
function is quite handy.
> table(pima$glucose)
0 44 56 57 61 62 65 67 68 71 72 73 74 75 76 77 78 79 80 81
5 1 1 2 1 1 1 1 3 4 1 3 4 2 2 2 4 3 6 6
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
3 6 10 7 3 7 9 6 11 9 9 7 7 13 8 9 3 17 17 9
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
13 9 6 13 14 11 13 12 6 14 13 5 11 10 7 11 6 11 11 6
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
12 9 11 14 9 5 11 14 7 5 5 5 6 4 8 8 5 8 5 5
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
5 6 7 5 9 7 4 1 3 6 4 2 6 5 3 2 8 2 1 3
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
6 3 3 4 3 3 4 1 2 3 1 6 2 2 2 1 1 5 5 5
182 183 184 186 187 188 189 190 191 193 194 195 196 197 198 199
1 3 3 1 4 2 4 1 1 2 3 2 3 4 1 1
Be careful here; the first, third, fifth and so on lines are the glucose values, while the second, fourth, sixth and so on lines are the counts of women having those values. For instance, 3 women had the glucose =
Uh, oh! 5 women in the study had glucose level 0. And 1 had level 44, etc. Presumably 0 is not physiologically possible, and maybe not 44 either.
Let’s consider a version of the glucose data that at least excludes these 0s.
> pg <- pima$glucose
> pg1 <- pg[pg > 0]
> length(pg1)
[1] 763
As before, the expression pg > 0
creates a vector of TRUEs and FALSEs.
The filtering pg[pg > 0]
will only pick up the TRUE cases, and sure
enough, we see that pg1
has only 763 cases, as opposed to the
original 768.
Did removing the 0s make much difference? Turns out it doesn’t:
> mean(pg)
[1] 120.8945
> mean(pg1)
[1] 121.6868
But still, these things can in fact have major impact in many statistical analyses.
R has a special code for missing values, NA, for situations like this. Rather than removing the 0s, it’s better to recode them as NAs. Let’s do this, back in the original dataset so we keep all the data in one object:
> pima$glucose[pima$glucose == 0] <- NA
Tip: That’s pretty complicated. It’s clearer to break things up into smaller steps (I recommend this especially for beginners), as follows:
> glc <- pima$glucose
> z <- glc == 0
> glc[z] <- NA
> pima$glucose <- glc
Here is what the code does:
glc
are 0s, resulting in z
being a vector of TRUEs and FALSEs.glc
corresponding to
the TRUEs. (Note the recycling of NA
.)glc
to it.Tip: Note again the double-equal sign! If we wish to test whether, say,
a
andb
are equal, the expression must bea == b
, nota = b
; the latter would doa <- b
. This is a famous beginning programmer’s error.
As a check, let’s verify that we now have 5 NAs in the glucose variable:
> sum(is.na(pima$glucose))
[1] 5
Here the built-in R function is.na
will return a vector of TRUEs and
FALSEs. Recall that those values can always be treated as 1s and 0s,
thus summable. Thus we got our count, 5.
Let’s also check that the mean comes out right:
> mean(pima$glucose)
[1] NA
What went wrong? By default, the mean
function will not skip over
NA values; thus the mean was reported as NA too. But we can instruct
the function to skip the NAs:
> mean(pima$glucose,na.rm=TRUE)
[1] 121.6868
Your Turn: Determine which other columns in
pima
have suspicious 0s, and replace them with NA values.Now, look again at the plot we made earlier of the Nile flow histogram. There seems to be a gap between the numbers at the low end and the rest. What years did these correspond to? Find the mean of the data, excluding these cases.