What is a Data Warehouse?
Data Warehouse is a storage place for data. It is used to store current and historical information.
According to Ralph Kimball, “Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model”
According to BillInmon, “A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process”
However, the definition by Bill Inmon is extensively accepted which is explained in detail below:
Subject-Oriented: Data Warehouse concentrates on a particular subject area.
For example, a data warehouse consisting the details of the sales as illustrated below
Integrated: Data warehouse consists of data that has been combined from numerous sources. The data can be obtained from mainframes, flat files, SQL Server database and various other heterogeneous sources as illustrated.
Time-variant: Data warehouse stores current and historical data. The data is used for decision-making and forecasting the business trends.
Non-volatile: The data stored in a data warehouse is retained and never deleted. A different version of data is stored in the data warehouse indicating if any inserts, updates or deletes have occurred. Hence, the original data is never altered.
Why do we need a data warehouse?
Data warehouses are used extensively in the largest and most complex businesses around the world. In demanding situations, good decision making becomes critical. Significant and relevant data is required to make decisions. This is possible only with the help of a well-designed data warehouse.
Enhancing the turnaround time for analysis and reporting: Data warehouse allows business users to access critical data from a single source enabling them to take quick decisions. They need not waste time retrieving data from multiple sources.
The business executives can query the data themselves with minimal or no support from IT which in turn saves money and time.
Improved Business Intelligence: Data warehouse helps in achieving the vision for the managers and business executives. Outcomes that affect the strategy and procedures of an organization will be based on reliable facts and supported with evidence and organizational data.
Benefit of historical data: Transactional data stores data on a day to day basis or for a very short period of duration without the inclusion of historical data. In comparison, a data warehouse stores large amounts of historical data which enables the business to include time-period analysis, trend analysis, and trend forecasts.
Standardization of data: The data from heterogeneous sources are available in a single format in a data warehouse. This simplifies the readability and accessibility of data.
For example, gender is denoted as Male/ Female in Source 1 and m/f in Source 2 but in a data warehouse the gender is stored in a format which is common across all the businesses i.e. M/F.
Immense ROI (Return On Investment): Return On Investment refers to the additional revenues or reduces expenses a business will be able to realize from any project. According to a 2002 International Data Corporation (IDC) study “The Financial Impact of Business Analytics”, analytics projects have been achieving a substantial impact on a business’ financial status.
What are the components of a Data warehouse?
The components of a data warehouse are depicted in the figure below
Data Sources
- A flat file database stores data in a normal text format. Contrary to a relational database where the data is stored in the form of tables, in a flat file database the data stored does not have a folders or paths related to them. No manipulations are performed on the data. Delimiters are used in flat files to separate the data columns.
- Excel spreadsheets are regularly used in data warehousing operations. They are impressive, low-priced, and flexible tolls that many decision-makers find convenient to use. Excel also provides graphing features that allow the end-user to present the required data in chart and graph formats. These formats can be easily integrated into MS Word and Power Point presentations.
- Operational systems of a business contain the day to day transactions of the data at a low-level. For example, the sales data, HR data, marketing data are used as input sources for a data warehouse.
- Legacy systems are the applications of the yesteryear. They mirror the requirements of a business that might be twenty to twenty five year old. They are use till date since over years these systems have captured the business knowledge and rules that are exceptionally difficult to translate to a new platform/application.
Staging Area
- The first part of the staging area is the most challenging process of extraction. Depending on how accurately the data is extracted the subsequent operations succeed or fail. The source systems might be complicated or poorly documented due to which the process becomes all the more difficult. The data may be extracted not only once but also periodically when changes occur at the source side.
- The second stage is the transformation where the data is converted from one format to another. Since data often exists in different locations and formats across the enterprises, data conversion is mandatory to ensure that data from one application is comprehensible to other applications and databases.
- The third stage is the loading where the extracted and transformed data is loaded into a data mart or a data warehouse depending on the business. The populated data is used for presentation applications by the end users.
Data Repository
The data is loaded into a data warehouse in the form of facts and dimensions
Users
The loaded data is accessed for reporting, analysis, and mining. The reporting tools like Business Objects and Cognos are used by users to generate reports. The data is also used for predicting trends
What is the difference between OLTP and a Data Warehouse?
OLTP
- Online Transactional Processing constitutes of a large number of brief on-line transactions. It focuses on quicker query processing
- The database queries are generally uncomplicated, need sub-second responses and return comparatively fewer records
For example, when we withdraw cash from an ATM we get a mini statement issuing the withdrawal amount, balance amount, data of withdrawal and transaction number. This transaction falls under an OLTP. The mini statement consist of fewer records and takes sub-seconds to process.
On the other hand, when we need a bank statement for a year we request the bank for an account statement. This statement falls under a data warehouse operation. We can view transactions occurred over a year with the dates and the updated amounts over each transaction and the closing balance.
OLTP
|
OLAP
| |
Origin of data
|
Transactional data
|
Data combined from several sources
|
Objective
|
Used for basis business tasks
|
Used for predicting future trends and analysis
|
Inserts and Update Operations
|
Data is inserted and updated regularly by the end users
|
Data is refreshed by long-running batch jobs
|
Nature of queries
|
Simple queries with lesser number of records
|
Complex queries required
|
Processing speed
|
Very fast
|
Takes a longer time
|