# Data Mining

http://www.forbes.com/2010/05/21/best-jobs-college-grads-careers-forbes-woman-leadership-employment_slide_4.html

The multiple linear regression model can be represented mathematically as an algebraic relationship between response variable and one or more predictive variable.

MLR takes a group of random variables and tries to find a mathematical relationship between them. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points.

MLR is often used to determine how many specific factors such as the price of a commodity, interest rates, and particular industries or sectors, influence the price movement of an asset. For example, the current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price. MLR could be used to model the impact that each of these variables has on stock's price.

- A data set consisting of the gender, height and age of children between 5 and 10 years old. You could use multiple linear regression to predict the height of a child (dependent variable) using both age and gender as predictors (i.e., two independent variables).
- The current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price
- An excellent example is a study conducted by an American University, on quantifying relationship between final exam score of a student and no of hrs spend partying during last week of the term.Therefore Y is exam score and the predictor variables are X1 = Hrs spend studying, X2 = Hrs spend in partying.

The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

The model for MLR, given n observations, is:

y

Let's take a real time example of predicting the sale price of homes (sale price in $ thousands)

based on the two predictor variables

- Floor Size ( in Sq feet thousands)
- Lot size (category, home built on large amount of land will have much higher price than a home with less land, all else being constant. therefore we can categories 0-3k sq feet as category 1,3-5k category 2 and so on up to category 10.

After calculation we got the best fit model Y = 122.36+61.9*X1+7.09*X2

or Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category

having said that, we conclude that sale price will increase $6200 for each 100 sq foot increase floor size when lot size is a constant.

Similarly , sale price will increase $709 for each category increase,being floor size constant.

The above calculation based on multiple regression techniques describes how to identify whether changing one variable is associated with a change in other variable and NOT establish changing one variable will cause other to change.

The basic Qs is , how to evaluate the model that if it is a good fit or not. We use generally 3 standard methods to numerically evaluate how well a regression model fits sample data.

The methods are .....

- The regression standard errors.
- Co-eff of determination R2
- Slope Parameter

Coming back to our last example,Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category, We found that "root mean square error" is

As we have 2 predictor variables X1,X2, therefore the standard errors will be 2s, or 2* 2.4752. = 4.95.

At 95% confidence interval, we can say that we can accurately calculate the home price to be accurate to with in the range of +_ $4950.

To explain effect of R2 in layman's term,is that it represent the % of reliability of the model in term of regression relationship between response variable and predictor variables.

The lies between 0 to 1 which means 0% to 100%, in our example the value of R2 we got is 0.9717. This translates to 97.17% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size).

The greater the value, the better the fit.

Unfortunately R2 is not a reliable to guile model building because if we add a predictor to a model ,R2 either increases or stays same.

Therefore a better way is to use Adjusted R2,which provides a good fit with out over fitting. It can be used to guide the model building since it decreases when an extra unimportant predictors have been added to the model.

Coming back to our example, adjusted R2 is .9528 , which means 95.28% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size).

As per,Inmon , the father of data warehousing,

The data warehouse is a basis for informational processing. It is

defi ned as being

■ subject oriented;

■ integrated;

■ nonvolatile;

■ time variant;

■ a collection of data in support of management’s decision

Problem of traditional data Warehouse Models...

1. Active Data Warehouse - Difficulty in maintaining the transaction integrity,Capacity planning,Processing conflict,Cost.

2. Federated Data Warehouse - very poor performance,no data integrity, no history of data,improper grains.

3. Star Schema - Unavailable for change,limited to optimization,useful only when at lowest grain.

4. Data Mart - Problem with Data reconciliation,maintenance issues,rigt design not flexible to change, implement future changes are difficult.

**Building the REAL Data Warehouse**

**Interactive Sector - **There is only a modest amount of data that is found in the Interactive Sector.the volumes of interactive data that are found here are small. The interactive data almost always resides on disk storage.In addition to having fast performance, the transactions that are run through the Interactive Sector are able to do updates. Data can be added, deleted, or modified.

**Integrated Sector - **This is where data is organized into major subject areas and where detail is kept.The summary data found in the Integrated Sector is summary data that is used in many places and summary data that doesn’t change.The data is granular: There are a lot of atomic units of data to

**Near Line Sector - **Performance is enhanced by downloading data with a low probability of access to the Near Line Sector. Because only data with a low probability of access is sent to the Near Line Sector, the data remaining in disk storage in the Integrated Sector is freed from the overhead of “ bumping into ” large amounts of data that is not going to be used.**Archival Sector - **When data is sent to the Archival Sector, it may or may not be appropriate to preserve the structure that the data had in the integrated or near-line environments. There are advantages and disadvantages to both preserving the structure of the data and not preserving the structure of the data. One advantage of preserving the structure of the data as it passes into the Archival Sector is that it is an easy thing to do.

The data warehouse is a basis for informational processing. It is

defi ned as being

■ subject oriented;

■ integrated;

■ nonvolatile;

■ time variant;

■ a collection of data in support of management’s decision

Problem of traditional data Warehouse Models...

1. Active Data Warehouse - Difficulty in maintaining the transaction integrity,Capacity planning,Processing conflict,Cost.

2. Federated Data Warehouse - very poor performance,no data integrity, no history of data,improper grains.

3. Star Schema - Unavailable for change,limited to optimization,useful only when at lowest grain.

4. Data Mart - Problem with Data reconciliation,maintenance issues,rigt design not flexible to change, implement future changes are difficult.

The** **data Warehouse is divide in to 4 Sectors. These are ...

**Very Current (aka Interactive Sector) - Data that is as old as 2 sec.****Current (aka Integrated Sector) - Data is as old as 24 hrs.****Near Line (aka Near Line Sector ) - Data is as older 3 -4 years.****Archival (aka Archival Sector) - Data is older than 5 years.**

The infrastructure is held together by metadata

Data access at quick as it is based on sectors

Data Archiving is done automatically

Less volume of data at each section.

be collected and managed. The data is historical: There is often from 3 to 5 years ’ worth of data. The data comes from a wide variety of sources.

Upon leaving the Near Line Sector, data normally moves into the Archival Sector. Note that the Archival Sector may be fed data directly from the Integrated Sector without passing through the Near Line

Sector. However, if the data has been moved into the Near Line Sector, then it is normally moved from there to the Archival Sector.The movement of data to the Archival Sector is made when the probability of accessing the data drops significantly.

The data is simply read in one format and written out in the same format. That is about as simple as it gets. But there are some reasons this approach may not be optimal. One reason is that once the data becomes archived, it may not be used the same way it was in the integrated.

environment

Subscribe to:
Posts (Atom)