Lesson 17: ‘For’ Loops

Recall that earlier we found that there were several columns in the Pima dataset that contained values of 0, which were physiologically impossible. These should be coded NA. We saw how to do that recoding for the glucose variable:

> pima$glucose[pima$glucose == 0] <- NA

But there are several columns like this, and we’d like to avoid doing this all repeatedly by hand. (What if there were several hundred such columns?) Instead, we’d like to do this programmatically. This can be done with R’s for loop construct (which by the way most programming languages have as well).

Let’s first check which columns seem appropriate for recoding. Recall that there are 9 columns in this data frame.

> for (i in 1:9) print(sum(pima[,i] == 0))
[1] 111
[1] 5
[1] 35
[1] 227
[1] 374
[1] 11
[1] 0
[1] 0
[1] 500

This is known in the programming world as a for loop*.

The print(etc.) is called the body of the loop. The for (i in 1:9) part says, “Execute the body of the loop with i = 1, then execute it with i = 2, then i = 3, etc. up through i = 9.”

In other words, the above code instructs R to do the following:

i <- 1
print(sum(pima[,i] == 0))
i <- 2
print(sum(pima[,i] == 0))
i <- 3
print(sum(pima[,i] == 0))
i <- 4
print(sum(pima[,i] == 0))
i <- 5
print(sum(pima[,i] == 0))
i <- 6
print(sum(pima[,i] == 0))
i <- 7
print(sum(pima[,i] == 0))
i <- 8
print(sum(pima[,i] == 0))
i <- 9
print(sum(pima[,i] == 0))

And this amounts to doing

print(sum(pima[,1] == 0))
print(sum(pima[,2] == 0))
print(sum(pima[,3] == 0))
print(sum(pima[,4] == 0))
print(sum(pima[,5] == 0))
print(sum(pima[,6] == 0))
print(sum(pima[,7] == 0))
print(sum(pima[,8] == 0))
print(sum(pima[,9] == 0))

Now, it’s worth reviewing what those statements do, say the first. Once again, pima[,1] == 0 yields a vector of TRUEs and FALSEs, each indicating whether the corresponding element of column 1 is 0. When we call sum, TRUEs and FALSEs are treated as 1s and 0s, so we get the total number of TRUEs — which is a count of the number of elements in that column that are 0, exactly what we wanted.

The variable i in for (i in 1:9)... is known as the index of the loop. It’s just an ordinary R variable, so name it what you wish. Instead of i, we might name it, say, colNumber.

for (colNumber in 1:9) print(sum(pima[,colNumber] == 0))

A technical point: Why did we need the explicit call to print? Didn’t we say earlier that just typing an expression at the R > prompt will automatically print out the value of the expression? Ah yes — but we are not at the R prompt here! Yes, in the expanded form we see above,

print(sum(pima[,1] == 0))
print(sum(pima[,2] == 0))
print(sum(pima[,3] == 0))
print(sum(pima[,4] == 0))
print(sum(pima[,5] == 0))
print(sum(pima[,6] == 0))
print(sum(pima[,7] == 0))
print(sum(pima[,8] == 0))
print(sum(pima[,9] == 0))

each command would be issued at the prompt. But in the for loop version

for (i in 1:9) print(sum(pima[,i] == 0))

we are calling print() from within the loop, not at the prompt. So, the explicit call to print() is needed.

We now see there are a lot of erroneous 0s in this dataset, e.g. 35 of them in column 3. We probably have forgotten which column is which, so let’s see, using yet another built-R function:

> colnames(pima)
[1] "pregnant"  "glucose"   "diastolic" "triceps"   "insulin"   "bmi"      
[7] "diabetes"  "age"       "test"     

Ah, so column 3 was diastolic.

Since some women will indeed have had 0 pregnancies, that column should not be recoded. And the last column states whether the test for diabetes came out positive, 1 for yes, 0 for no, so those 0s are legitimate too.

But 0s in columns 2 through 6 ought to be recoded as NAs. And the fact that it’s a repetitive action suggests that a for loop can be used there too:

> for (i in 2:6) pima[pima[,i] == 0,i] <- NA

You’ll probably find this line quite challenging, but be patient and, as with everything in R, you’ll find you can master it.

First, let’s write it in more easily digestible (though a bit more involved) form:

> for (i in 2:6) {
+    zeroIndices <- which(pima[,i] == 0)
+    pima[zeroIndices,i] <- NA
+ }

You can enter the code for a loop or function etc. line by line at the prompt, as we’ve done here. R helpfully uses its + prompt (which I did not type) to remind me that I am still in the midst of typing the code. (After the } I simply hit Enter.)

Here I intended the body of the loop to consist of a block of two statements, not one, so I needed to tell R that, by typing { before writing my two statements, then letting R know I was finished with the block, by typing }.

For your convenience, below is the code itself, no + symbols. You can copy-and-paste into R, with the result as above.

for (i in 2:6) {
   zeroIndices <- which(pima[,i] == 0)
   pima[zeroIndices,i] <- NA
}

(If you are using RStudio, set up some work space, by selecting File | New File | RScript. Copy-and-paste the above into the empty pane (named SOURCE) that is created, and run it, via Code | Run Region | Run All. If you are using an external text editor, type the code into the editor, save to a file, say x.R, then at the R > prompt, type source(x.R).)

So, the block (two lines here) will be executed with i = 2, then 3, 4, 5 and 6. The line

zeroIndices <- which(pima[,i] == 0)

determines where the 0s are in column i, and then the line

 pima[zeroIndices,i] <- NA

replaces those 0s by NAs.

Tip: Note that I have indented the two lines in the block. This is not required but is considered good for clear code, in order to easily spot the block when you or others read the code.

Sometimes our code needs to leave a loop early, which we can do using the R break construct. Say we are adding cubes of numbers 1,2,3,…, and for some reason want to determine which sum is the first to exceed s:

> f
function(n,s) 
{
   tot <- 0
   for (i in 1:n) {
      tot <- tot + i^3
      if (tot > s) {
         print(i)
         break
      }
      if (i == n) print('failed')
   }
}
> f(100,345)
[1] 6
> f(5,345)
[1] "failed"

If our accumulated total meets our goal, we leave the loop.

A better approach is to use while loops, covered later in this tutorial.

Tip: There is a school of thought among some R enthusiasts that one should avoid writing loops, using something called functional programming. We will cover this in Lesson 28, but I do not recommend it for R beginners. As the name implies, functional programming uses functions, and it takes a while for most R beginners to master writing functions. It makes no sense to force beginners to use functional programming before they really can write function code well. I myself, with my several decades as a coder, write some code with loops and some with functional programming. Write in whatever style you feel comfortable with, rather than being a “slave to fashion.”