Right after vectors, the next major workhorse of R is the data frame. It’s a rectangular table consisting of one row for each data point.
Say we have height, weight and age on each of 100 people. Our data frame would have 100 rows and 3 columns. The entry in, e.g., the second row and third column would be the age of the second person in our data. The second row as a whole would be all the data for that second person, i.e. the height, weight and age of that person.
Note that that row would also be cnsidered a vector. The third column as a whole would be the vector of all ages in our dataset.
As our first example, consider the ToothGrowth dataset built-in to
R. Again, you can read about it in the online help by typing
> ?ToothGrowth
``` (The data turn out to be on guinea pigs, with orange juice or
Vitamin C as growth supplements.) Let's take a quick look from the
command line.
``` r
> head(ToothGrowth)
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5R’s head function displays (by default) the first 6 rows of the
given dataframe. We see there are length, supplement and dosage
columns, which the curator of the data decided to name len, supp and
dose. Each of column is an R vector, or in the case of the second
column, a vector-like object called a factor, to be discussed
shortly).
Tip: To avoid writing out the long words repeatedly, it’s handy to make a copy with a shorter name.
> tg <- ToothGrowthDollar signs are used to denote the individual columns, e.g. ToothGrowth$dose for the dose column. So for instance, we can print out the mean tooth length:
> mean(tg$len)
[1] 18.81333Subscripts/indices in data frames are pairs, specifying row and column numbers. To get the element in row 3, column 1:
> tg[3,1]
[1] 7.3which matches what we saw above in our head example. Or, use the
fact that tg$len is a vector:
> tg$len[3]
[1] 7.3The element in row 3, column 1 in the data frame tg is element 3
in the vector tg$letn. This duality between data frames and
vectors is often exploited in R.
Your Turn: The above examples are fundamental to R, so you should conduct a few small experiments on your own this time, little variants of the above. The more you do, the better!
For any subset of a data frame d, we can extract whatever rows and
columns we want using the format
d[the rows we want, the columns we want]Some data frames don’t have column names, but that is no obstacle. We can use column numbers, e.g.
> mean(tg[,1])
[1] 18.81333Note the expression [,1]. Since there is a 1 in the second position,
we are talking about column 1. And since the first position, before the
comma, is empty, no rows are specified — so all rows are included.
That boils down to: all of column 1.
A key feature of R is that one can extract subsets of data frames, just as we extracted subsets of vectors earlier. For instance,
> z <- tg[2:5,c(1,3)]
> z
len dose
2 11.5 0.5
3 7.3 0.5
4 5.8 0.5
5 6.4 0.5Here we extracted rows 2 through 5, and columns 1 and 3, assigning the
result to z. To extract those columns but keep all rows, do
> y <- tg[ ,c(1,3)]i.e. leave the row specification field empty.
By the way, note that the three columns are all of the same length, a
requirement for data frames. And what is that common length in this
case? R’s nrow function tells us the number of rows in any data
frame:
> nrow(ToothGrowth)
[1] 60Ah, 60 rows (60 guinea pigs, 3 measurements each).
Or, alternatively:
> tg <- ToothGrowth
> length(tg$len)
[1] 60
> length(tg$supp)
[1] 60
> length(tg$dose)
[1] 60So now you know four ways to do the same thing. But isn’t one enough? Of course. But in this get-acquainted period, reading all four will help reinforce the knowledge you are now accumulating about R. So, make sure you understand how each of those four approaches produced the number 60.
The head function works on vectors too:
> head(ToothGrowth$len)
[1] 4.2 11.5 7.3 5.8 6.4 10.0Like many R functions, head has an optional second argument,
specifying how many elements to print:
> head(ToothGrowth$len,10)
[1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0You can create your own data frames — good for devising little tests of your understanding — as follows:
> x <- c(5,12,13)
> y <- c('abc','de','z')
> d <- data.frame(x,y)
> d
x y
1 5 abc
2 12 de
3 13 zLook at that second line! Instead of vectors consisting of numbers, one can form vectors of character strings, complete with indexing capability, e.g.
> y <- c('abc','de','z')
> y[2]
[1] "de"As noted, all the columns in a data frame must be of the same length.
Here x consists of 3 numbers, and y consists of 3 character
strings. (The string is the unit in the latter. The number of
characters in each string is irrelevant.)
One can use negative indices for rows and columns as well, e.g.
> z <- tg[,-2]
> head(z)
len dose
1 4.2 0.5
2 11.5 0.5
3 7.3 0.5
4 5.8 0.5
5 6.4 0.5
6 10.0 0.5Your Turn: Devise your own little examples with the
ToothGrowthdata. For instance, write code that finds the number of cases in which the tooth length was less than 16. Also, try some examples with another built-in R dataset,faithful. This one involves the Old Faithful geyser in Yellowstone National Park in the US. The first column gives duration of the eruption, and the second has the waiting time since the last eruption. As mentioned, these operations are key features of R, so devise and run as many examples as possible; err on the side of doing too many!
As mentioned, the data frame is the fundamental workhorse of R. It is made up of columns of vectors (of equal lengths), a fact that often comes in handy.
Unlike the single-number indices of vectors, each element in a data frame has 2 indices, a row number and a column number. One can specify sets of rows and columns to extra subframes.
One can use the R nrow function to query the number of rows in a
data frame; ncol does the same for the number of columns.