Thursday, May 5, 2011

Data Warehouse 2.0

As per,Inmon , the father of data warehousing,

The data warehouse is a basis for informational processing. It is
defi ned as being
■ subject oriented;
■ integrated;
■ nonvolatile;
■ time variant;
■ a collection of data in support of management’s decision

Problem of traditional data Warehouse Models...
1. Active Data Warehouse - Difficulty in maintaining the transaction integrity,Capacity planning,Processing conflict,Cost.
2. Federated Data Warehouse - very poor performance,no data integrity, no history of data,improper grains.
3. Star Schema  -  Unavailable for change,limited to optimization,useful only when at lowest grain.
4. Data Mart - Problem with Data reconciliation,maintenance issues,rigt design not flexible to change, implement future changes are difficult.

Building the REAL Data Warehouse

The data Warehouse is divide in to 4  Sectors. These are ...
  • Very Current (aka Interactive Sector)  - Data that is as old as 2 sec. 
  • Current (aka Integrated Sector) - Data is as old as 24 hrs.
  • Near Line (aka Near Line Sector ) - Data is as older 3 -4 years.
  • Archival (aka Archival Sector) - Data is older than 5 years.
The infrastructure is held together by metadata
Data access at  quick as it is based on sectors
Data Archiving is done automatically
Less volume of data at each section.

Interactive Sector - There is only a modest amount of data that is found in the Interactive Sector.the volumes of interactive data that are found here are small. The interactive data almost always resides on disk storage.In addition to having fast performance, the transactions that are run through the Interactive Sector are able to do updates. Data can be added, deleted, or modified.

Integrated Sector - This is where data is organized into major subject areas and where detail is kept.The summary data found in the Integrated Sector is summary data that is used in many places and summary data that doesn’t change.The data is granular: There are a lot of atomic units of data to
be collected and managed. The data is historical: There is often from 3 to 5 years ’ worth of data. The data comes from a wide variety of sources.

Near Line Sector - Performance is enhanced by downloading data with a low probability of access to the Near Line Sector. Because only data with a low probability of access is sent to the Near Line Sector, the data remaining in disk storage in the Integrated Sector is freed from the overhead of “ bumping into ” large amounts of data that is not going to be used.
Upon leaving the Near Line Sector, data normally moves into the Archival Sector. Note that the Archival Sector may be fed data directly from the Integrated Sector without passing through the Near Line
Sector. However, if the data has been moved into the Near Line Sector, then it is normally moved from there to the Archival Sector.The movement of data to the Archival Sector is made when the probability of accessing the data drops significantly.
Archival Sector - When data is sent to the Archival Sector, it may or may not be appropriate to preserve the structure that the data had in the integrated or near-line environments. There are advantages and disadvantages to both preserving the structure of the data and not preserving the structure of the data. One advantage of preserving the structure of the data as it passes into the Archival Sector is that it is an easy thing to do.
The data is simply read in one format and written out in the same format. That is about as simple as it gets. But there are some reasons this approach may not be optimal. One reason is that once the data becomes archived, it may not be used the same way it was in the integrated.

No comments:

Post a Comment