Logistic Regression- Stock Market

# Knowledge Mining: Logistic Regression

# Load ISLR library

require(ISLR)

Loading required package: ISLR

# Check dataset Smarket
?Smarket
names(Smarket)

[1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
[7] "Volume"    "Today"     "Direction"

summary(Smarket)

      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648

# Create a dataframe for data browsing
sm=Smarket

# Bivariate Plot of inter-lag correlations
pairs(Smarket,col=Smarket$Direction,cex=.5, pch=20)

# Logistic regression
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data=Smarket,family=binomial)
summary(glm.fit)


Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

glm.probs=predict(glm.fit,type="response") 
glm.probs[1:5]

        1         2         3         4         5 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812

glm.pred=ifelse(glm.probs>0.5,"Up","Down")
attach(Smarket)
table(glm.pred,Direction)

        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507

mean(glm.pred==Direction)

[1] 0.5216

# Make training and test set for prediction
train = Year<2005
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data=Smarket,family=binomial, subset=train)
glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 
glm.pred=ifelse(glm.probs >0.5,"Up","Down")
Direction.2005=Smarket$Direction[!train]
table(glm.pred,Direction.2005)

        Direction.2005
glm.pred Down Up
    Down   77 97
    Up     34 44

mean(glm.pred==Direction.2005)

[1] 0.4801587

#Fit smaller model
glm.fit=glm(Direction~Lag1+Lag2,
            data=Smarket,family=binomial, subset=train)
glm.probs=predict(glm.fit,newdata=Smarket[!train,],type="response") 
glm.pred=ifelse(glm.probs >0.5,"Up","Down")
table(glm.pred,Direction.2005)

        Direction.2005
glm.pred Down  Up
    Down   35  35
    Up     76 106

mean(glm.pred==Direction.2005)

[1] 0.5595238

# Check accuracy rate
106/(76+106)

[1] 0.5824176

# Interpretation:
## The model correctly predicted the direction of the stock market 48% of the time 
## when fitted to a set with all the variables
## When fit to a smaller data set (some variables removed) the model predicts 
## direction with 56% accuracy. This higher accuracy score suggests that lag1 and 
## lag2 have a stronger relationship to the direction of the stock market
## The model is has an accuracy score of 58% when predicting "up", which means
## for some reason the model is better at predicting just "up" compared to both
## up and down

Part 2

a. What is/are the requirement(s) of LDA?

predictor variables should follow a normal dist
observations should be independent to each other
linear relationship
Co-variance measurements are identical in each class

b. How LDA is different from Logistic Regression?

They are both binary classification methods for statistical analysis. There are many differences, here are a few:

-LDA assumes the predictors have a normal distribution whereas logistic regression makes no assumptions

-Logistic regression is based on maximum likelihood estimation, meaning it is trying to maximize the likelihood of observing the data given the parameters. LDA uses least squares estimation and is trying to maximize the separation between different classes.

-Logistic regression can handle more predictors compared to observation, LDA has issues with this because it assumes the predictors have a normal distribution

c. What is ROC?

Receiver Operating Characteristic. An ROC curve is a probability curve that is used to measure the performance of a binary classification model. In this graph, sensitivity is plotted on the Y-axis and specificity is plotted on the X-axis. This shows how specificity and sensitivity are graphed together, the area under the curve is what quantifies the performance of the model as a whole. If the area under the ROC curve is 1.

d. What is sensitivity and specificity? Which is more important in your opinion?

Sensitivity is the accuracy for the positive values and specificity the accuracy for the negative values.

In this case, sensitivity measures the TP … how accurate a “yes” was predicted as a “yes”. Sensitivity = TP / (TP + FN)

Specificity measures TN… how accurate the model predicted a “no” when it was a “no”

In my opinion It depends on the problem you’re trying to solve if specificity or sensitivity is more important. In a case where you are predicting a life threatening disease, sensitivity is more important because its okay to have more false positives than false negatives. In a situation like stock market where you are trying to maximize profit, sensitivity would be more important. If you are trying to minimize loss, specificity would be more important.

e. From the following chart, for the purpose of prediction, which is more critical?

TP and TN are the most critical for prediction because they tell you how often your prediction was right

3. Calculate the prediction error from the following …

252/9896 #no when it was supposed to be yes

[1] 0.02546483

23/104 #yes when it was supposed to be no

[1] 0.2211538