Lesson 31: Linear Regression Analysis, II

Continuing our look at linear regression analysis using R, let’s look at the famous bike sharing data. (See the latter site for the documentation; e.g. temperature has been scaled, rather than measured in degrees.) It’s in a .zip file, so it will need a little extra preprocessing:

# fetch from Web, and store the downloaded data to the file 'bike.sip'
> download.file('https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip','bike.zip')
> unzip('bike.zip')  # out come the files 'day.csv' and 'hour.csv'
> day <- read.csv('day.csv',header=TRUE)
> head(day)  # always take a look around!
  instant     dteday season yr mnth holiday weekday workingday weathersit
1       1 2011-01-01      1  0    1       0       6          0          2
2       2 2011-01-02      1  0    1       0       0          0          2
3       3 2011-01-03      1  0    1       0       1          1          1
4       4 2011-01-04      1  0    1       0       2          1          1
5       5 2011-01-05      1  0    1       0       3          1          1
6       6 2011-01-06      1  0    1       0       4          1          1
      temp    atemp      hum windspeed casual registered  cnt
1 0.344167 0.363625 0.805833 0.1604460    331        654  985
2 0.363478 0.353739 0.696087 0.2485390    131        670  801
3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

By the way, the weather variables have been rescaled to the interval [0,1]. A value of 0.28, for instance, means 28% of the way from the minimum to the maximum value of this variable.

One new concept here is the presence of indicator variables, more informally known as dummy variables. These are variables taking only the values 0 and 1, with a 1 “indicating” that some trait is present. For instance, the holiday variable is either 1 or 0, depending on whether that day was a holiday (which might affect the demand for bikes that day).

Those who manage this bike sharing service may wish to predict future demand for bikes, say the next day, to aid in their planning. As an example, let’s try to predict the number of casual riders from some weather variables and the dummy variable workingday.

> day1 <- day[,c(8,10,12:14)]
> head(day1)
  workingday     temp      hum windspeed casual
1          0 0.344167 0.805833 0.1604460    331
2          0 0.363478 0.696087 0.2485390    131
3          1 0.196364 0.437273 0.2483090    120
4          1 0.200000 0.590435 0.1602960    108
5          1 0.226957 0.436957 0.1869000     82
6          1 0.204348 0.518261 0.0895652     88
> lmout <- lm(casual ~ .,data=day1)
> lmout
...
Coefficients:
(Intercept)   workingday         temp          hum    windspeed  
     1063.6       -806.6       2149.5       -812.7      -1145.3  
 

The expression casual ~ . means, “regress casual against all the other variables in this dataset.

These numbers make sense. The negative coefficient for workingday says that, all else equal, there tend to be fewer casual riders on a work day.

By the way, we probably should expect fewer riders on very cold or very hot days, so we may wish to add a quadratic term to the model, say by doing

day1$temp2 <- temp^2  # the caret symbol means exponentiation, 
                      # i.e. 2nd power here

This would add the indicated column to day1. But we will not pursue this for now.

One of the very important features of R is generic functions. These are functions that take on different roles for objects of different classes. One such example is the plot function we saw earlier.

Try typing plot(lmout) at the R prompt. You will be shown several plots desribing the fitted regression model. What happened was that the function plot is just a placeholder. When we type plot(lmout) R says, “Hmm, what kind of object is lmout? Oh, it’s of class 'lm'. So I’m going to transfer (dispatch) this call to one involving a special plot function for that class, plot.lm.” This is in contrast to our previous calls to plot, which were invoked on vectors; those calls were dispatched to plot.default.

Another generic function is summary:

> summary(Nile)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  456.0   798.5   893.5   919.4  1032.5  1370.0 
> summary(lmout)
...
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1063.55     101.37  10.492  < 2e-16 ***
workingday   -806.63      33.41 -24.143  < 2e-16 ***
temp         2149.52      86.32  24.901  < 2e-16 ***
hum          -812.74     112.98  -7.194 1.57e-12 ***
windspeed   -1145.31     208.55  -5.492 5.51e-08 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
...

In the first case, the call to summary, invoked on a vector, was dispatched to summary.default, while in the second the transfer was to summary.lm. In both cases, whoever it was in the R development team who wrote these functions decided what summary information should be printed out automatically.

Again, the purpose of this tutorial is to present R, not statistics. The interested reader should consult a statistics book regarding p-values and confidence intervals. The former are show in the last column of the above summary. An approximate 95% confidence interval for, say, the population coefficient of humidity is -812.74 plus or minus 1.96 times the standard error, 112.98. Note carefully that p-values have long been considered to be poor methodology; see the ASA statement.

Another important generic function is predict. Say we want to predict casual for a work day in which temp, hum and windspeed are 0.26, 0.55, 0.18, respectively.

> newCase <- data.frame(workingday=1, temp=0.26, hum=0.55, windspeed=0.18)
> predict(lmout,newCase)
       1 
162.6296 

The predict function, which here is predict.lm, assumes that the new cases to be predicted are supplied as a data frame, with the same column names as with the original data.