In analytics one of the most common techniques
is regression. In simplest form regression is used to determine relationship
between two variables, it tells us what we can expect from the other variable.
Linear regression is one of the part of regression
techniques. Basically linear regression is based on Ordinary Least Square
Regression. Before we go to any further, we clarify some terminologies.
Response Variable: It is the
outcome variable which we are trying to predict.
Predictor Variable: It is the
input variable which we are using to predict.
A simple linear regression model that
describes the relationship between two variables x and y. where x is a
dependent (i.e. response) variable, and y is independent (predictor) variable. According
to the probability theory, if variable y is dependent on x then variable x
cannot be independent of variable y. so we stick with the terms response and
predictor exclusively.
Linear regression model can be expressed
as:
Y = ax +b
Where,
Y is the response variable
x is the predictor variable
a is the intercept
b is the slope
Steps to Implement
Linear Regression in R
Let us take a case we have data named as father.son, using father heights we want to predict sons height using
linear regression model.
Here, father heights are the predictors
and sons heights are the responses.
require(UsingR)
require(ggplot2)
head(father.son)
head(father.son)
## fheight
sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
To calculate linear regression we use lm function. lm function
creates the relationship model between the response and the predictor
variables.
heightsLM <- lm(sheight
~ fheight,
data =
father.son)
heightsLM
heightsLM
##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
##
## Coefficients:
## (Intercept) fheight
## 33.8866 0.5141
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
##
## Coefficients:
## (Intercept) fheight
## 33.8866 0.5141
Here, we once again see the formula
notation that specifies to regress sheight on fheight. The interpretation of
this result is that for every extra inch of height in a father, we expect an
extra half inch in height for his son.
The intercept in this case doesn’t make
much sense because it represents the height of a son whose father had zero
height. To understand more clearly we need to see a full report of the model
summary(heightsLM)
##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
## Residuals:
## Min 1Q Median 3Q Max
## -8.8772 -1.5144 -0.0079 1.6285 8.9685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.88660 1.83235 18.49 <2e-16 ***
## fheight 0.51409 0.02705 19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
## F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
## Min 1Q Median 3Q Max
## -8.8772 -1.5144 -0.0079 1.6285 8.9685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.88660 1.83235 18.49 <2e-16 ***
## fheight 0.51409 0.02705 19.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
## F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
This generate lots of information about
model, including standard errors, t-test values and p-values for coefficient
and so on. This is all diagnostic information to check the fit of the model.
ggplot(data=father.son , aes(x
= fheight,y = sheight))
+ geom_point(col="darkgreen") +
geom_smooth(method
= 'lm')+ labs(x = "Father",y
= "Sons")
In this graph blue line running through
point is the regression line and grey brand around it represents the uncertainty
in the fit. Basically linear regression is used for prediction purpose.
No comments:
Post a Comment