Y as response variable : which also know as dependent variable or outcome or simply an output variable. This variable should be Qualitative having meaningful numerical values.
X as predictor variables : X1,X2....Xn are the predictor variables. It is also known as input variable or covariates. This variable(s) should also be quantitative.
The multiple linear regression model can be represented mathematically as an algebraic relationship between response variable and one or more predictive variable.
Investopedia explains Multiple Linear Regression - MLR
MLR takes a group of random variables and tries to find a mathematical relationship between them. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual points.
MLR is often used to determine how many specific factors such as the price of a commodity, interest rates, and particular industries or sectors, influence the movement of an asset. For example, the current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price. MLR could be used to model the impact that each of these variables has on stock's price.
Some Examples :
- A data set consisting of the gender, height and age of children between 5 and 10 years old. You could use multiple linear regression to predict the height of a child (dependent variable) using both age and gender as predictors (i.e., two independent variables).
- The current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price
- An excellent example is a study conducted by an American University, on quantifying relationship between final exam score of a student and no of hrs spend partying during last week of the term.Therefore Y is exam score and the predictor variables are X1 = Hrs spend studying, X2 = Hrs spend in partying.
The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.
The model for MLR, given n observations, is:
yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + Ei where i = 1,2, ..., n
Multiple Regression Model
Let's take a real time example of predicting the sale price of homes (sale price in $ thousands)
based on the two predictor variables
- Floor Size ( in Sq feet thousands)
- Lot size (category, home built on large amount of land will have much higher price than a home with less land, all else being constant. therefore we can categories 0-3k sq feet as category 1,3-5k category 2 and so on up to category 10.
The basic Qs is , how to evaluate the model that if it is a good fit or not. We use generally 3 standard methods to numerically evaluate how well a regression model fits sample data.
The methods are .....
- The regression standard errors.
- Co-eff of determination R2
- Slope Parameter
Coming back to our last example,Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category, We found that "root mean square error" is 2.4752 using SAS as a statistical SW to calculate the regression value.
As we have 2 predictor variables X1,X2, therefore the standard errors will be 2s, or 2* 2.4752. = 4.95.
At 95% confidence interval, we can say that we can accurately calculate the home price to be accurate to with in the range of +_ $4950.
To explain effect of R2 in layman's term,is that it represent the % of reliability of the model in term of regression relationship between response variable and predictor variables.
The lies between 0 to 1 which means 0% to 100%, in our example the value of R2 we got is 0.9717. This translates to 97.17% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size).
The greater the value, the better the fit.
Adjusted R2 (R square):
Unfortunately R2 is not a reliable to guile model building because if we add a predictor to a model ,R2 either increases or stays same.
Therefore a better way is to use Adjusted R2,which provides a good fit with out over fitting. It can be used to guide the model building since it decreases when an extra unimportant predictors have been added to the model.
Coming back to our example, adjusted R2 is .9528 , which means 95.28% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size). Which is more accurate than our last prediction of 97.17%