Friday, May 20, 2011

The UC San Diego Extension staff targets the following 14 niche careers as sectors to watch for

Data Mining

Data mining is an exploding industry, largely due to the massive amount of data generated by the population's use of technology and the Web, which can be used to predict trends and consumer behavior. A study out of UC Berkeley shows that the amount of data in the world doubles every three years. Career prospects include advertising technology, fraud detection, risk management and law enforcement. Data mining requires understanding of algorithms and advanced statistics as well as programming and computer management.

Saturday, May 7, 2011

Multiple Linear Regression

Multiple Linear Regression : A technique to analyzing certain types of multivariate data. This can helps us to understand the relationship between a response variable and one or more predictor variables. This is to estimate the value of the response variable knowing the values of predictor variable.

Y as response variable : which also know as dependent variable or outcome or simply an output variable. This variable should be Qualitative having meaningful numerical values.
X as predictor variables : X1,X2....Xn are the predictor variables. It is also known as input variable or covariates. This variable(s) should also be quantitative.

The multiple linear regression model can be represented mathematically as an algebraic relationship between response variable and one or more predictive variable.

Investopedia explains Multiple Linear Regression - MLR
MLR takes a group of random variables and tries to find a mathematical relationship between them. The model creates a relationship in the form of a straight line (linear) that best approximates all the individual data points. 

MLR is often used to determine how many specific factors such as the price of a commodity, interest rates, and particular industries or sectors, influence the price movement of an asset. For example, the current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price. MLR could be used to model the impact that each of these variables has on stock's price.

Some Examples :

  1. A data set consisting of the gender, height and age of children between 5 and 10 years old. You could use multiple linear regression to predict the height of a child (dependent variable) using both age and gender as predictors (i.e., two independent variables).
  2. The current price of oil, lending rates, and the price movement of oil futures, can all have an effect on the price of an oil company's stock price
  3. An excellent example is a study conducted by an American University, on quantifying relationship between final exam score of a student and no of hrs spend partying during last week of the term.Therefore Y is exam score and the predictor variables are X1 = Hrs spend studying, X2 = Hrs spend in partying.

The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

The model for MLR, given n observations, is:

yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + Ei where i = 1,2, ..., n

Multiple Regression Model

Let's take a real time example of predicting the sale price of homes (sale price in $ thousands)
based on the two predictor variables

  1. Floor Size ( in Sq feet thousands)
  2. Lot size     (category, home built on large amount of land will have much higher price than a home with less land, all else being constant. therefore we can categories 0-3k sq feet as category 1,3-5k category 2 and so on  up to category 10.
After calculation we got the best fit model Y = 122.36+61.9*X1+7.09*X2

or   Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category

having said that, we conclude that sale price will increase $6200 for each 100 sq foot increase floor size  when lot size is a constant.
Similarly ,  sale price will increase $709 for each category increase,being floor size constant.

The above calculation based on multiple regression techniques describes how to identify whether changing one variable is associated with a change in other variable and NOT establish changing one variable will cause other to change. 

How to evaluate the MODEL
The basic Qs is , how to evaluate the model that if it is a good fit or not. We use generally 3 standard methods to numerically evaluate how well a regression model fits sample data.
The methods are .....

  1. The regression standard errors.
  2. Co-eff of determination R2
  3. Slope Parameter 
The regression standard errors:

Coming back to our last example,Price of home = 122.36 + 61.9*floor size +7.09 * lot size /category, We found that "root mean square error" is 2.4752 using SAS as a statistical SW to calculate the regression value.
As we have 2 predictor variables  X1,X2, therefore the standard errors will be 2s, or 2* 2.4752. = 4.95.

At 95% confidence interval, we can say that we can accurately calculate the home price to be accurate to with in the range of +_ $4950.

Co-eff of determination R2 :
To explain  effect of  R2 in layman's term,is that it represent the % of reliability of the model in term of regression relationship between response variable and predictor variables.
The lies between 0 to 1 which means 0% to 100%, in our example the value of R2 we got is 0.9717. This translates to 97.17%  of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size).
The greater the value, the better the fit.

Adjusted R2 (R square):
Unfortunately R2 is not a reliable to guile model building because if we add a predictor to a model ,R2 either increases or stays same.
Therefore a better way is to use Adjusted R2,which provides a good fit with out over fitting. It can be used to guide the model building  since it decreases when an extra unimportant predictors have been added to the model.
Coming back to our example, adjusted R2 is .9528 , which means 95.28% of variation is sales price of homes has linear regression relationship between sale price and (floor size,lot size). Which is more accurate than our last prediction of  97.17%

Thursday, May 5, 2011

Data Warehouse 2.0

As per,Inmon , the father of data warehousing,

The data warehouse is a basis for informational processing. It is
defi ned as being
■ subject oriented;
■ integrated;
■ nonvolatile;
■ time variant;
■ a collection of data in support of management’s decision

Problem of traditional data Warehouse Models...
1. Active Data Warehouse - Difficulty in maintaining the transaction integrity,Capacity planning,Processing conflict,Cost.
2. Federated Data Warehouse - very poor performance,no data integrity, no history of data,improper grains.
3. Star Schema  -  Unavailable for change,limited to optimization,useful only when at lowest grain.
4. Data Mart - Problem with Data reconciliation,maintenance issues,rigt design not flexible to change, implement future changes are difficult.

Building the REAL Data Warehouse

The data Warehouse is divide in to 4  Sectors. These are ...
  • Very Current (aka Interactive Sector)  - Data that is as old as 2 sec. 
  • Current (aka Integrated Sector) - Data is as old as 24 hrs.
  • Near Line (aka Near Line Sector ) - Data is as older 3 -4 years.
  • Archival (aka Archival Sector) - Data is older than 5 years.
The infrastructure is held together by metadata
Data access at  quick as it is based on sectors
Data Archiving is done automatically
Less volume of data at each section.

Interactive Sector - There is only a modest amount of data that is found in the Interactive Sector.the volumes of interactive data that are found here are small. The interactive data almost always resides on disk storage.In addition to having fast performance, the transactions that are run through the Interactive Sector are able to do updates. Data can be added, deleted, or modified.

Integrated Sector - This is where data is organized into major subject areas and where detail is kept.The summary data found in the Integrated Sector is summary data that is used in many places and summary data that doesn’t change.The data is granular: There are a lot of atomic units of data to
be collected and managed. The data is historical: There is often from 3 to 5 years ’ worth of data. The data comes from a wide variety of sources.

Near Line Sector - Performance is enhanced by downloading data with a low probability of access to the Near Line Sector. Because only data with a low probability of access is sent to the Near Line Sector, the data remaining in disk storage in the Integrated Sector is freed from the overhead of “ bumping into ” large amounts of data that is not going to be used.
Upon leaving the Near Line Sector, data normally moves into the Archival Sector. Note that the Archival Sector may be fed data directly from the Integrated Sector without passing through the Near Line
Sector. However, if the data has been moved into the Near Line Sector, then it is normally moved from there to the Archival Sector.The movement of data to the Archival Sector is made when the probability of accessing the data drops significantly.
Archival Sector - When data is sent to the Archival Sector, it may or may not be appropriate to preserve the structure that the data had in the integrated or near-line environments. There are advantages and disadvantages to both preserving the structure of the data and not preserving the structure of the data. One advantage of preserving the structure of the data as it passes into the Archival Sector is that it is an easy thing to do.
The data is simply read in one format and written out in the same format. That is about as simple as it gets. But there are some reasons this approach may not be optimal. One reason is that once the data becomes archived, it may not be used the same way it was in the integrated.