We will begin by examining some numerical and graphical summaries of
the Smarket
data, which is part of the ISLR2
library. This data set consists of percentage returns for the S&P
500 stock index over \(1,250\)~days,
from the beginning of 2001 until the end of 2005. For each date, we have
recorded the percentage returns for each of the five previous trading
days, lagone
through lagfive
. We have also
recorded volume
(the number of shares traded on the
previous day, in billions), Today
(the percentage return on
the date in question) and direction
(whether the market was
Up
or Down
on this date). Our goal is to
predict direction
(a qualitative response) using the other
features.
library(ISLR2)
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
dim(Smarket)
## [1] 1250 9
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
## Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
## Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
## Lag4 Lag5 Volume Today
## Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
## 1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
## Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
## Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
## 3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
## Direction
## Down:602
## Up :648
##
##
##
##
pairs(Smarket)
The cor()
function produces a matrix that contains all
of the pairwise correlations among the predictors in a data set. The
first command below gives an error message because the
direction
variable is qualitative.
cor(Smarket)
## Error in cor(Smarket): 'x' doit être numérique
cor(Smarket[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
## Lag5 Volume Today
## Year 0.029787995 0.53900647 0.030095229
## Lag1 -0.005674606 0.04090991 -0.026155045
## Lag2 -0.003557949 -0.04338321 -0.010250033
## Lag3 -0.018808338 -0.04182369 -0.002447647
## Lag4 -0.027083641 -0.04841425 -0.006899527
## Lag5 1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
As one would expect, the correlations between the lag variables and
today’s returns are close to zero. In other words, there appears to be
little correlation between today’s returns and previous days’ returns.
The only substantial correlation is between Year
and
volume
. By plotting the data, which is ordered
chronologically, we see that volume
is increasing over
time. In other words, the average number of shares traded daily
increased from 2001 to 2005.
attach(Smarket)
plot(Volume)
Next, we will fit a logistic regression model in order to predict
direction
using lagone
through
lagfive
and volume
. The glm()
function can be used to fit many types of generalized linear models ,
including logistic regression. The syntax of the glm()
function is similar to that of lm()
, except that we must
pass in the argument family = binomial
in order to tell
R
to run a logistic regression rather than some other type
of generalized linear model.
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial
)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The smallest \(p\)-value here is
associated with lagone
. The negative coefficient for this
predictor suggests that if the market had a positive return yesterday,
then it is less likely to go up today. However, at a value of \(0.15\), the \(p\)-value is still relatively large, and so
there is no clear evidence of a real association between
lagone
and direction
.
We use the coef()
function in order to access just the
coefficients for this fitted model. We can also use the
summary()
function to access particular aspects of the
fitted model, such as the \(p\)-values
for the coefficients.
coef(glm.fits)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## -0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
## Volume
## 0.135440659
summary(glm.fits)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
## Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
## Lag3 0.011085108 0.04993854 0.2219750 0.8243333
## Lag4 0.009358938 0.04997413 0.1872757 0.8514445
## Lag5 0.010313068 0.04951146 0.2082966 0.8349974
## Volume 0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fits)$coef[, 4]
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974
## Volume
## 0.3924004
The predict()
function can be used to predict the
probability that the market will go up, given values of the predictors.
The type = "response"
option tells R
to output
probabilities of the form \(P(Y=1|X)\),
as opposed to other information such as the logit. If no data set is
supplied to the predict()
function, then the probabilities
are computed for the training data that was used to fit the logistic
regression model. Here we have printed only the first ten probabilities.
We know that these values correspond to the probability of the market
going up, rather than down, because the contrasts()
function indicates that R
has created a dummy variable with
a 1 for Up
.
glm.probs <- predict(glm.fits, type = "response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292
## 9 10
## 0.5176135 0.4888378
contrasts(Direction)
## Up
## Down 0
## Up 1
In order to make a prediction as to whether the market will go up or
down on a particular day, we must convert these predicted probabilities
into class labels, Up
or Down
. The following
two commands create a vector of class predictions based on whether the
predicted probability of a market increase is greater than or less than
\(0.5\).
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
The first command creates a vector of 1,250 Down
elements. The second line transforms to Up
all of the
elements for which the predicted probability of a market increase
exceeds \(0.5\). Given these
predictions, the table()
function can be used to produce a
confusion matrix in order to determine how many observations were
correctly or incorrectly classified. %By inputting two qualitative
vectors R will create a two by two table with counts of the number of
times each combination occurred e.g. predicted {} and market increased,
predicted {} and the market decreased etc.
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
(507 + 145) / 1250
## [1] 0.5216
mean(glm.pred == Direction)
## [1] 0.5216
The diagonal elements of the confusion matrix indicate correct
predictions, while the off-diagonals represent incorrect predictions.
Hence our model correctly predicted that the market would go up on \(507\)~days and that it would go down on
\(145\)~days, for a total of \(507+145 = 652\) correct predictions. The
mean()
function can be used to compute the fraction of days
for which the prediction was correct. In this case, logistic regression
correctly predicted the movement of the market \(52.2\),% of the time.
At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of \(1,250\) observations. In other words, \(100\%-52.2\%=47.8\%\), is the training error rate. As we have seen previously, the training error rate is often overly optimistic—it tends to underestimate the test error rate. In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.
To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
## [1] 252 9
Direction.2005 <- Direction[!train]
The object train
is a vector of \(1{,}250\) elements, corresponding to the
observations in our data set. The elements of the vector that correspond
to observations that occurred before 2005 are set to TRUE
,
whereas those that correspond to observations in 2005 are set to
FALSE
. The object train
is a Boolean
vector, since its elements are TRUE
and FALSE
.
Boolean vectors can be used to obtain a subset of the rows or columns of
a matrix. For instance, the command Smarket[train, ]
would
pick out a submatrix of the stock market data set, corresponding only to
the dates before 2005, since those are the ones for which the elements
of train
are TRUE
. The !
symbol
can be used to reverse all of the elements of a Boolean vector. That is,
!train
is a vector similar to train
, except
that the elements that are TRUE
in train
get
swapped to FALSE
in !train
, and the elements
that are FALSE
in train
get swapped to
TRUE
in !train
. Therefore,
Smarket[!train, ]
yields a submatrix of the stock market
data containing only the observations for which train
is
FALSE
—that is, the observations with dates in 2005. The
output above indicates that there are 252 such observations.
We now fit a logistic regression model using only the subset of the
observations that correspond to dates before 2005, using the
subset
argument. We then obtain predicted probabilities of
the stock market going up for each of the days in our test set—that is,
for the days in 2005.
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial, subset = train
)
glm.probs <- predict(glm.fits, Smarket.2005,
type = "response")
Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
mean(glm.pred != Direction.2005)
## [1] 0.5198413
The !=
notation means not equal to, and so the
last command computes the test set error rate. The results are rather
disappointing: the test error rate is \(52\),%, which is worse than random
guessing! Of course this result is not all that surprising, given that
one would not generally expect to be able to use previous days’ returns
to predict future market performance.