Wednesday, June 8, 2016

Simple Linear Regression in R

In analytics one of the most common techniques is regression. In simplest form regression is used to determine relationship between two variables, it tells us what we can expect from the other variable.

Linear regression is one of the part of regression techniques. Basically linear regression is based on Ordinary Least Square Regression. Before we go to any further, we clarify some terminologies.

Response Variable: It is the outcome variable which we are trying to predict.

Predictor Variable: It is the input variable which we are using to predict.

A simple linear regression model that describes the relationship between two variables x and y. where x is a dependent (i.e. response) variable, and y is independent (predictor) variable. According to the probability theory, if variable y is dependent on x then variable x cannot be independent of variable y. so we stick with the terms response and predictor exclusively.
Linear regression model can be expressed as:

Y = ax +b

Where,
Y is the response variable
x is the predictor variable
a   is the intercept
b is the slope

Steps to Implement Linear Regression in R

Let us take a case we have data named as father.son, using father heights we want to predict sons height using linear regression model.

Here, father heights are the predictors and sons heights are the responses.

require(UsingR)
require(ggplot2)
head(father.son)

##    fheight  sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

To calculate linear regression we use lm function. lm function creates the relationship model between the response and the predictor variables.

heightsLM <- lm(sheight ~ fheight, data = father.son)
heightsLM
##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
##
## Coefficients:
## (Intercept)      fheight 
##     33.8866       0.5141

Here, we once again see the formula notation that specifies to regress sheight on fheight. The interpretation of this result is that for every extra inch of height in a father, we expect an extra half inch in height for his son.
The intercept in this case doesn’t make much sense because it represents the height of a son whose father had zero height. To understand more clearly we need to see a full report of the model

summary(heightsLM)

##
## Call:
## lm(formula = sheight ~ fheight, data = father.son)

## Residuals:
##     Min      1Q  Median      3Q     Max
## -8.8772 -1.5144 -0.0079  1.6285  8.9685
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 33.88660    1.83235   18.49   <2e-16 ***
## fheight      0.51409    0.02705   19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared:  0.2513, Adjusted R-squared:  0.2506
## F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

This generate lots of information about model, including standard errors, t-test values and p-values for coefficient and so on. This is all diagnostic information to check the fit of the model.

ggplot(data=father.son , aes(x = fheight,y = sheight)) +  geom_point(col="darkgreen") + geom_smooth(method =  'lm')+ labs(x = "Father",y = "Sons")


In this graph blue line running through point is the regression line and grey brand around it represents the uncertainty in the fit. Basically linear regression is used for prediction purpose. 

No comments:

Post a Comment

Creating Compelling Pie Charts in Looker: A Step-by-Step Guide with Examples

Creating Compelling Pie Charts in Looker: A Step-by-Step Guide with Examples   In the realm of data visualization, pie charts are a clas...