Recall that earlier we found that there were several columns in the Pima dataset that contained values of 0, which were physiologically impossible. These should be coded NA. We saw how to do that recoding for the glucose variable:
> pima$glucose[pima$glucose == 0] <- NABut there are several columns like this, and we’d like to avoid doing
this all repeatedly by hand. (What if there were several hundred such
columns?) Instead, we’d like to do this programmatically. This can be
done with R’s for loop construct (which by the way most programming
languages have as well).
Let’s first check which columns seem appropriate for recoding. Recall that there are 9 columns in this data frame.
> for (i in 1:9) print(sum(pima[,i] == 0))
[1] 111
[1] 5
[1] 35
[1] 227
[1] 374
[1] 11
[1] 0
[1] 0
[1] 500This is known in the programming world as a for loop*.
The print(etc.) is called the body of the loop. The for (i in 1:9) part says, “Execute the body of the loop with i = 1, then execute
it with i = 2, then i = 3, etc. up through i = 9.”
In other words, the above code instructs R to do the following:
i <- 1
print(sum(pima[,i] == 0))
i <- 2
print(sum(pima[,i] == 0))
i <- 3
print(sum(pima[,i] == 0))
i <- 4
print(sum(pima[,i] == 0))
i <- 5
print(sum(pima[,i] == 0))
i <- 6
print(sum(pima[,i] == 0))
i <- 7
print(sum(pima[,i] == 0))
i <- 8
print(sum(pima[,i] == 0))
i <- 9
print(sum(pima[,i] == 0))And this amounts to doing
print(sum(pima[,1] == 0))
print(sum(pima[,2] == 0))
print(sum(pima[,3] == 0))
print(sum(pima[,4] == 0))
print(sum(pima[,5] == 0))
print(sum(pima[,6] == 0))
print(sum(pima[,7] == 0))
print(sum(pima[,8] == 0))
print(sum(pima[,9] == 0))Now, it’s worth reviewing what those statements do, say the first. Once
again, pima[,1] == 0 yields a vector of TRUEs and FALSEs, each
indicating whether the corresponding element of column 1 is 0. When we
call sum, TRUEs and FALSEs are treated as 1s and 0s, so we get the
total number of TRUEs — which is a count of the number of elements in that
column that are 0, exactly what we wanted.
The variable i in for (i in 1:9)... is known as the index of the
loop. It’s just an ordinary R variable, so name it what you wish.
Instead of i, we might name it, say, colNumber.
for (colNumber in 1:9) print(sum(pima[,colNumber] == 0))A technical point: Why did we need the explicit call to print?
Didn’t we say earlier that just typing an expression at the R > prompt
will automatically print out the value of the expression? Ah yes — but
we are not at the R prompt here! Yes, in the expanded form we see
above,
print(sum(pima[,1] == 0))
print(sum(pima[,2] == 0))
print(sum(pima[,3] == 0))
print(sum(pima[,4] == 0))
print(sum(pima[,5] == 0))
print(sum(pima[,6] == 0))
print(sum(pima[,7] == 0))
print(sum(pima[,8] == 0))
print(sum(pima[,9] == 0))each command would be issued at the prompt. But in the for loop version
for (i in 1:9) print(sum(pima[,i] == 0))we are calling print() from within the loop, not at the prompt.
So, the explicit call to print() is needed.
We now see there are a lot of erroneous 0s in this dataset, e.g. 35 of them in column 3. We probably have forgotten which column is which, so let’s see, using yet another built-R function:
> colnames(pima)
[1] "pregnant" "glucose" "diastolic" "triceps" "insulin" "bmi"
[7] "diabetes" "age" "test" Ah, so column 3 was diastolic.
Since some women will indeed have had 0 pregnancies, that column should not be recoded. And the last column states whether the test for diabetes came out positive, 1 for yes, 0 for no, so those 0s are legitimate too.
But 0s in columns 2 through 6 ought to be recoded as NAs. And the fact
that it’s a repetitive action suggests that a for loop can be used
there too:
> for (i in 2:6) pima[pima[,i] == 0,i] <- NAYou’ll probably find this line quite challenging, but be patient and, as with everything in R, you’ll find you can master it.
First, let’s write it in more easily digestible (though a bit more involved) form:
> for (i in 2:6) {
+ zeroIndices <- which(pima[,i] == 0)
+ pima[zeroIndices,i] <- NA
+ }You can enter the code for a loop or function etc. line by line at the
prompt, as we’ve done here. R helpfully uses its + prompt (which I
did not type) to remind me that I am still in the midst of typing the
code. (After the } I simply hit Enter.)
Here I intended the body of the loop to consist of a block of two
statements, not one, so I needed to tell R that, by typing { before
writing my two statements, then letting R know I was finished with the
block, by typing }.
For your convenience, below is the code itself, no + symbols. You can
copy-and-paste into R, with the result as above.
for (i in 2:6) {
zeroIndices <- which(pima[,i] == 0)
pima[zeroIndices,i] <- NA
}(If you are using RStudio, set up some work space, by selecting File |
New File | RScript. Copy-and-paste the above into the empty pane (named
SOURCE) that is created, and run it, via Code | Run Region | Run All.
If you are using an external text editor, type the code into the editor,
save to a file, say x.R, then at the R > prompt, type
source(x.R).)
So, the block (two lines here) will be executed with i = 2, then 3,
4, 5 and 6. The line
zeroIndices <- which(pima[,i] == 0)determines where the 0s are in column i, and then the line
pima[zeroIndices,i] <- NAreplaces those 0s by NAs.
Tip: Note that I have indented the two lines in the block. This is not required but is considered good for clear code, in order to easily spot the block when you or others read the code.
Sometimes our code needs to leave a loop early, which we can do using
the R break construct. Say we are adding cubes of numbers
1,2,3,…, and for some reason want to determine which sum is the first
to exceed s:
> f
function(n,s)
{
tot <- 0
for (i in 1:n) {
tot <- tot + i^3
if (tot > s) {
print(i)
break
}
if (i == n) print('failed')
}
}
> f(100,345)
[1] 6
> f(5,345)
[1] "failed"
If our accumulated total meets our goal, we leave the loop.
A better approach is to use while loops, covered later in this
tutorial.
Tip: There is a school of thought among some R enthusiasts that one should avoid writing loops, using something called functional programming. We will cover this in Lesson 28, but I do not recommend it for R beginners. As the name implies, functional programming uses functions, and it takes a while for most R beginners to master writing functions. It makes no sense to force beginners to use functional programming before they really can write function code well. I myself, with my several decades as a coder, write some code with loops and some with functional programming. Write in whatever style you feel comfortable with, rather than being a “slave to fashion.”