Saturday, May 7, 2011

Multiple Linear Regression

Multiple Linear Regression : A technique to analyzing certain types of multivariate data. This can helps us to understand the relationship between a response variable and one or more predictor variables. This is to estimate the value of the response variable knowing the values of predictor variable.

Y as response variable : which also know as dependent variable or outcome or simply an output variable. This variable should be Qualitative having meaningful numerical values.
X as predictor variables : X1,X2....Xn are the predictor variables. It is also known as input variable or covariates. This variable(s) should also be quantitative.

The multiple linear regression model can be represented mathematically as an algebraic relationship between response variable and one or more predictive variable.

Investopedia explains Multiple Linear Regression - MLR
MLR takes a group of random variables and tries to find a mathematical relationship between them. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points. 

MLR is often used to determine how many specific factors such as the price of a commodity, interest rates, and particular industries or sectors, influence the price movement of an asset. For example, the current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price. MLR could be used to model the impact that each of these variables has on stock's price.

Some Examples :

  1. A data set consisting of the gender, height and age of children between 5 and 10 years old. You could use multiple linear regression to predict the height of a child (dependent variable) using both age and gender as predictors (i.e., two independent variables).
  2. The current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price
  3. An excellent example is a study conducted by an American University, on quantifying relationship between final exam score of a student and no of hrs spend partying during last week of the term.Therefore Y is exam score and the predictor variables are X1 = Hrs spend studying, X2 = Hrs spend in partying.

The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

The model for MLR, given n observations, is:

yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + Ei where i = 1,2, ..., n

Multiple Regression Model

Let's take a real time example of predicting the sale price of homes (sale price in $ thousands)
based on the two predictor variables

  1. Floor Size ( in Sq feet thousands)
  2. Lot size     (category, home built on large amount of land will have much higher price than a home with less land, all else being constant. therefore we can categories 0-3k sq feet as category 1,3-5k category 2 and so on  up to category 10.
After calculation we got the best fit model Y = 122.36+61.9*X1+7.09*X2

or   Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category

having said that, we conclude that sale price will increase $6200 for each 100 sq foot increase floor size  when lot size is a constant.
Similarly ,  sale price will increase $709 for each category increase,being floor size constant.

The above calculation based on multiple regression techniques describes how to identify whether changing one variable is associated with a change in other variable and NOT establish changing one variable will cause other to change. 

How to evaluate the MODEL
The basic Qs is , how to evaluate the model that if it is a good fit or not. We use generally 3 standard methods to numerically evaluate how well a regression model fits sample data.
The methods are .....

  1. The regression standard errors.
  2. Co-eff of determination R2
  3. Slope Parameter 
The regression standard errors:

Coming back to our last example,Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category, We found that "root mean square error" is 2.4752 using SAS as a statistical SW to calculate the regression value.
As we have 2 predictor variables  X1,X2, therefore the standard errors will be 2s, or 2* 2.4752. = 4.95.

At 95% confidence interval, we can say that we can accurately calculate the home price to be accurate to with in the range of +_ $4950.

Co-eff of determination R2 :
To explain  effect of  R2 in layman's term,is that it represent the % of reliability of the model in term of regression relationship between response variable and predictor variables.
The lies between 0 to 1 which means 0% to 100%, in our example the value of R2 we got is 0.9717. This translates to 97.17%  of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size).
The greater the value, the better the fit.

Adjusted R2 (R square):
Unfortunately R2 is not a reliable to guile model building because if we add a predictor to a model ,R2 either increases or stays same.
Therefore a better way is to use Adjusted R2,which provides a good fit with out over fitting. It can be used to guide the model building  since it decreases when an extra unimportant predictors have been added to the model.
Coming back to our example, adjusted R2 is .9528 , which means 95.28% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size). Which is more accurate than our last prediction of  97.17%

No comments:

Post a Comment