fasteR

Lesson 24: Baseball Player Analysis (cont’d.)

This lesson will be a little longer and more detail-oriented. But it will give you more practice on a number of earlier topics, and will also bring in some new R functions for you. Spending extra time on this lesson will pay substantial dividends.

We might wonder whether the regression lines differ much among player positions. (A more statistical approach would be to include interaction terms in the model.) Let’s first see what positions are tabulated:

> table(mlb$PosCategory)
   Catcher  Infielder Outfielder    Pitcher 
        76        210        194        535

Let’s fit the regression lines separately for each position type.

There are various ways to do this, involving avoidance of loops to various degrees. But we’ll keep it simple, which will be clearer.

First, let’s split the data by position. You might at first think this is easily done using the split function, but that doesn’t work, since that function is for splitting vectors. Here we wish to split a data frame.

So what can be done instead? We need to think creatively here. One solution is this:

We need to determine the row numbers of the catchers, the row numbers of the infielders and so on. So we can take all the row numbers, 1:nrow(mlb), and apply split to that vector!

> rownums <- split(1:nrow(mlb),mlb$PosCategory)

Tip: As usual, following an intricate operation like this, we should glance at the result:

> str(rownums)
List of 4
 $ Catcher   : int [1:76] 1 2 3 35 36 66 67 68 101 102 ...
 $ Infielder : int [1:210] 4 5 6 7 8 9 37 38 39 40 ...
 $ Outfielder: int [1:194] 10 11 12 13 14 15 16 43 44 45 ...
 $ Pitcher   : int [1:535] 17 18 19 20 21 22 23 24 25 26 ...

So the output is an R list; no surprise there, as we knew before before that split produces an R list. Also not surprisingly, the elements of the list are named Catcher etc. So for example, the third outfielder is in row 12 of the data frame.

Tip: The idea here, using split on 1:nrow(mlb), was a bit of a trick. Actually, it is a common ploy for experienced R coders, but you might ask, “How could a novice come up with this idea?” The answer, as noted several times already here, is that programming is a creative process. Creativity may not come quickly! In some case, one might need to mull over a problem for a long time before coming up with a solution. Don’t give up! The more you think about a problem, the more skilled you will get, even if you sometimes come up empty-handed. And of course, there are many forums on the Web at which you can ask questions, e.g. Stack Overflow.

Tip: Now, remember, a nice thing about R lists is that we can reference their elements in various ways. The first element above, for instance, is any of rownums$Catcher, rownums[['Catcher']] and rownums[[1]], This versatility is great, as for example we can use the latter two forms to write loops.

And a loop is exactly what we need here. We want to call lm four times, once for each position. We could do this, say, with a loop beginning with

for (i in 1:4)

to iterate through the four position types, but it will be clearer if we use the names:

for (pos in c('Catcher','Infielder','Outfielder','Pitcher'))

Just an ordinary for loop. Recall such loops are of the form

for (variable in vector)...

Instead of having a numeric vector, e.g. the 1:4 above, we now have a character vector, which each element of the vector being a character string, but the principles are the same.

We could have lm and print calls in the body of the loop. But let’s be a little fancier, building up a data frame with the output. We’ll start with an empty frame, and keep adding rows to it.

Our code is

posNames <- c('Catcher','Infielder','Outfielder','Pitcher')
m <- data.frame()
for (pos in posNames) {
   posRows <- rownums[[pos]]
   lmo <- lm(Weight ~ Age, data = mlb[posRows,])
   newrow <- lmo$coefficients
   m <- rbind(m,newrow)
}

Here is the output:

> m
  X180.828029016113 X0.794925225995348
1          180.8280          0.7949252
2          170.2466          0.8589593
3          176.2884          0.7883343
4          185.5994          0.6543904

Some key things to note here.

The overall strategy is to start with an empty data frame, then keep adding rows to it, one row of regression coefficients per playing position.
In order to add rows to m, we used R’s rbind (“row bind”) function. The expression rbind(m,newrow) forms a new data frame, by tacking newrow onto m. Here we reassign the result back to m, also a common operation. (Note carefully: The rbind operation did not change m; it merely created a new data frame. To update m, we needed to assign that new data frame to m.) By the way, there is also a cbind function for columns.
In the call to lm, we used mlb[rownums[[pos]],] instead of mlb as previously, since here we wanted to fit a regression line on each position subgroup. So, we restricted attention to only those rows of mlb for which the position was equal to the current value of pos.

So, what happens is: m is initially an empty data frame. Then the loop, for its first iteration, sets pos to Catcher. Then a regression line will be fit to the rows of mlb that are for catchers. That line is returned to us from lm, and we assign it to lmo. (Once again, the name is arbitrary; I chose this one to symbolize “lm output.“) We extract the coefficients and tack them on at the end of m.

Tip: This is a very common design pattern in R (and most other languages)

Nice output, with the two columns aligned. But those column names are awful, and the row labels should be nicer than 1,2,3,4. We can fix these things:

> row.names(m) <- posNames
> names(m) <- c('intercept','slope')
> m
           intercept     slope
Catcher     180.8280 0.7949252
Infielder   170.2466 0.8589593
Outfielder  176.2884 0.7883343
Pitcher     185.5994 0.6543904

What happened here? We earlier saw the built-in row.names function, so that setting row names was easy. But what about the column names? Recall that a data frame is actually an R list, consisting of several vectors of the same length, which form the columns. So, names(m) is the names of the columns.

So with a little finessing here, we got some nicely-formatted output. Moreover, we now have our results in a data frame for further use. For instance, we may wish to plot the four lines on the same graph, and we would use rows of the data frame as input.

A little more finessing is possible. Look at the line

posNames <- c('Catcher','Infielder','Outfielder','Pitcher')

We’re using a computer! We shouldn’t have to type out these names by hand, as I did in this line. In fact, we already have them in one of our R objects, rownums; recall our earlier check:

> str(rownums)
List of 4
 $ Catcher   : int [1:76] 1 2 3 35 36 66 67 68 101 102 ...
 $ Infielder : int [1:210] 4 5 6 7 8 9 37 38 39 40 ...
 $ Outfielder: int [1:194] 10 11 12 13 14 15 16 43 44 45 ...
 $ Pitcher   : int [1:535] 17 18 19 20 21 22 23 24 25 26 ...

The elements of the R list rownums are the names of the positions! So, the better way to set posNames is

posNames <- names(rownums)

Tip: Again, the reader may be thinking, “How in the world would I have been able to realize this?” Again, the answer is that as you acquire more experience in coding, you will be more and more ability to come up with insights like this. Patience!

Finally, what about those numerical results? There is substantial variation in those estimated slopes, but again, they are only estimates. The question of whether there is substantial variation at the population level is one of statistical inference, beyond the scope of this R course.

← PREVIOUS

S3 classes

R Packages, CRAN, Etc.