(The reader should cover this lesson especially slowly and carefully. The concepts are simple, but putting them together requires careful inspection.)
First, let’s review what we saw in a previous lesson:
> which1200 <- which(Nile > 1200)
> Nile[which1200]
[1] 1210 1230 1370 1210 1250 1260 1220
There, we saw how to extract vector elements. We can do something similar to extract data frame rows or columns. Here is how:
Continuing the Vitamin C example, let’s compare mean tooth length for the two types of supplements. Here is the code:
> whichOJ <- which(tg$supp == 'OJ')
> whichVC <- which(tg$supp == 'VC')
> mean(tg[whichOJ,1])
[1] 20.66333
> mean(tg[whichVC,1])
[1] 16.96333
In the first two lines above, we found which rows in tg
(or
equivalently, which elements in tg$supp
) had the OJ supplement, and
recorded those row numbers in whichOJ
. Then we did the same for VC.
Now, look at the expression tg[whichOJ,1]
. Remember, data frames
are accessed with two subscript expressions, one for rows, one for
colums, in the format
d[the rows we want, the columns we want]
So, tg[whichOJ,1]
says to restrict attention to the OJ rows, and
only column 1, tooth length. We then find the mean of those restricted
numberss. This turned out to be 20.66333. Then do the same for VC.
Again, if we are pretty experienced, we can skip steps:
> tgoj <- tg[tg$supp == 'OJ',]
> tgvc <- tg[tg$supp == 'VC',]
> mean(tgoj$len)
[1] 20.66333
> mean(tgvc$len)
[1] 16.96333
Either way, we have the answer to our original question: Orange juice appeared to produce more growth than Vitamin C. (Of course, one might form a confidence interval for the difference etc.)
Just as we learned earlier how to use a sequence of TRUE and FALSE values to extract a parts of a vector, we now see how to do the analogous thing for data frames: We can use a sequence of TRUE and FALSE values to extract a certain rows or columns from a data frame.
It is imperative that the reader fully understand this lesson before continuing, trying some variations of the above example on his/her own. We’ll be using this technique often in this tutorial, and it is central to R usage in the real world.
Your turn: Try some of these operations on R’s built-in
faithful
dataset. For instance, find the number of eruptions for which the waiting time was more than 80 minutes.