There are various areas within data warehousing where tools can be utilized.
Consider candidates in the following categories:
· Data Cleansing,· Data Transformation and Load,· Data Access and Analysis (Query),· Report Writing.
Data Cleansing Tools
Tools in this category typically analyze, standardize, and consolidate data (i.e., “scrub” data) from disparate legacy systems. Techniques such as lexical analysis, statistical matching, and pattern processing are used to clean data prior to migrating data to the data warehouse.
Data Transformation and Load Tools
These tools convert data from source system formats into formats used in the data warehouse. Various data formats and notations are transformed into a standard notation within the data warehouse (e.g., a common notation to which all data can be translated).
The tools are also used to load integrated data to various types of target databases, including mainframe and client/server platforms.
Data Access and Analysis (Query) Tools
Access and analysis tools can be further subdivided into the following three categories: data dipping, data analysis, and data mining.
Data dippers are basic business tools that allow for generation of standard reports and queries. Data dippers perform simple analysis to answer standard business questions such as how many sales occurred last week or how many new customers were added last month. These tools typically use metadata to distance users from data warehouse complexities and to provide a user-friendly interface.
On-line analytical processing (OLAP) tools provide complex on-line analysis against live data. These tools are designed to process large amounts of data and perform business analysis such as budgeting and forecasting, profitability analysis, tax planning and financial consolidations. OLAP tools provide multi-dimensional analysis, the ability to calculate ratios and variances, and the ability to display results in various formats (e.g., 2D and 3D cross-tabs, charts, and graphs). There are three distinct categories of OLAP tools, each of which is used to solve different business problems.
Multi-dimensional OLAP (MOLAP) tools were the first OLAP tools to be developed. The tools typically create data cubes (three-dimensional views of data versus traditional relational views) to be used to efficiently analyze data. Large amounts of summarized data are used to provide efficient response times. Data (and metadata) administration and deployment are controlled from a central location.
Database OLAP (DOLAP) tools create portable data that can be accessed and analyzed via user desktops. Data is typically extracted from the warehouse, formatted (e.g., dimensional data is assembled into cubes), and then provided to users. DOLAP tools are also becoming more effective at distributing data using mechanisms such as e-mail and the web (Web OLAP or WOLAP). These tools are very useful for providing data to users who are disconnected from the network (e.g., notebook computers used by travelling sales staff).
As data warehouses continued to grow, it became clear that MOLAP databases could not be used to maintain massive amounts of data. Relational OLAP (ROLAP) tools and databases have been developed to address this problem. Multi-dimensional front ends are used to process information in a relational format. Tools allow for the transformation of two-dimensional data into multi-dimensional information. ROLAP is frequently implemented when users need to analyze large amounts of current data.
Data mining tools are used to analyze data to find trends and patterns within data and relationships between data. Artificial intelligence and neural networks are frequently used to create competitive intelligence by finding connections in data that may not be readily obvious (e.g., people who buy product A from a company are more inclined to buy product B). Business analysts who use data mining tools are typically very familiar with the data being searched.
Report Writing Tools
These tools provide a permanent record of the business at a specific point in time, in standard formats. Report writers typically offer a variety of attractive formats that are often used to present information to executives and other decision makers. It is also standard for these tools to produce reports in HTML format, providing the ability to have reports viewed from web sites. Users within a wide distribution can then access reports using a web browser, without the need for any special software to be installed on user workstations.
Reports are frequently produced by data specialists so a commitment to building a report “library” will increase the benefits to be obtained from these tools.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities:
- Data Extraction - Involves gathering data from multiple heterogeneous sources.
- Data Cleaning - Involves finding and correcting the errors in data.
- Data Transformation - Involves converting the data from legacy format to warehouse format.
- Data Loading - Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
- Refreshing - Involves updating from data sources to warehouse.
Note: Data cleaning and data transformation are important steps in improving the quality of data and data mining results.
What are some of the popular data warehouse tools available?
Data warehouse tools are therefore something that every company must look at going into the future. Here are some of the most popular data warehouse tools that can help your company meet its growing and comprehensive needs in a successful manner.
Ab Initio Software
Developed by Ab Initio Software, the products produced by this company are aimed at helping companies to perform functions related to fourth generation data analysis, batch processing, data manipulation and graphical user interface (GUI) based parallel processing software. (GUI based software is commonly used to extract, transform and load data.) The Ab Initio Software is a company that specialises in producing high volume data processing application and was founded more than 20 years ago, giving them considerable expertise in this field. Some of the products manufactured by the company include Graphical Development Environment, Co-operating System, Enterprise Meta, among others. Further, the company also introduced a free feature limited version known as Elementum in 2010, though it was only available to customers who has a commercial license from the company.
Amazon Redshift
Another hosted data warehouse product, Amazon Redshift is a part of the Amazon Web Services, which is basically a large cloud computing platform. Built on top of technology from the massive parallel processing, Redshift is different from other database offered by Amazon. This is because Amazon Redshift can handle analytics workloads of large quantities. In order to handle such huge data, the company make use of massive parallel processing. Some of the partners of Amazon Redshift that provide data integration tools include Alooma, Attunity, FlyData, Informatics, SnapLogic, Talend and Xplenty.
AnalytiX DS
A software vendor, AnalytiX DS provides specialised data mapping and tools for data integration, data management, enterprise application integration and big data software and services. With its main office in Virginia, the company has offices in Asia and North America with a international team of service partners and technical assistants. The founder of AnalytixX DS, Mike Boggs was responsible for coining the term pre-ETL Mapping. Further, the company launched AnalytiX Mapping manager, a premier tool that is capable of automating pre_ETL source to the target mapping process. With an investment of 50-100 crore, AnalytiX Ds might soon open a new development centre in Bangalore in the coming years.
CodeFutures
Founded in 2001 by Andy Grove, CodeFutures is based in the United States. The main software of this company is called dbShards, a NewSQL platform based on database sharing. What sets this apart from other SQL products is the fact that dbShards has been designed to provide scalability to companies and can be used with traditional database platforms like MySQL and PostgreSQL. This means that companies will not have to replace their existing database engine but dbShards can be used along with them.
DATAllegro
Another database warehouse tool, DATAllegro is specialised in providing companies with appliances that perform a wide range of data warehouse functions. Founded by Stuart Frost in 2003, it was a direct competition to the data warehouse appliance created by Netezza. While Netezza used commodity PowerPC chips, DATAllegro was implemented on the commodity hardware. These included hardware on systems like Dell, CISCO and EMC Corp. However, like Netezza, DATAllegro also used open source software stack. In 2008, Microsoft acquired the company and the SQL Server Data Warehouse is a successor to DATAllegro that uses a version of SQL server database engine.
Holistic Data Management
A framework that is AHISDATA, holistic data management is used for implementing software within a company network. The framework can also perform a range of functions that include data governance, data quality, data integration and master data management. Some of the specifications of Holistic Data management is the following: 1. All data objects in the warehouse must either be a child data object or a parent data object 2. The data network scope must have only one parent data object Data mapping link must be present within all child data objects 4. In the data management modules, there must be exist least one data object relationship
Informatica Corporation
A software development company, Informatics was founded in the year in 1993 in California. With a product portfolio that focusses on data integration, cloud data integration, B2B data exchange, ETL, Information lifecycle management, data replication, data virtualisation, complex event processing among other functions. Together these components come together to provide data warehouse facilities to companies across sectors. The informatics Power centre has three main components namely Informatica Power centre client tools (installed at the developer end), Informatics Power centre repository (place where all the metadata for an application is stored.) Informatica Power centre server (place where all the data executions takes place.) With a customer base of over 5000 companies, Informatics has also launched Informatica Marketplace to allow company stop share and leverage data integration solutions. With a host of features, this tool has over 1300 pre-built mapping, templates, connectors to help companies manage and empower their data in an effective manner.
ParAccel
A California based software company, ParAccel provides database management system for companies and organisations across all sectors. The company was acquired in 2013 by Actian. Two of the products offered by ParAccel are Amigo and Maverick. Amigo has been designed to speed up the process of queries that are generally directed towards the existing data warehouse. In relation, Maverick has been designed to be a stand alone data store for companies. Amigo was scrapped by ParAccel in favour of Maverick which later evolved to become the ParAccel Analytic Database. A parallel relational database system, the ParAccel Analytical Database uses a shared nothing architecture with columnar orientation, and memory centric design to provide data analysis in a comprehensive manner. In addition, ParAccel also offers built in analytic functions like standard deviation and two off the shelf Analytics packages called Base package and Advanced Package.
Teradata Corporation
A publicly held international company with its headquarters at Ohio, Teradata offers analytic data platforms and related services to different companies. The analytic products of Teradata is supposed to help companies to consolidate data from numerous sources and help them infer unique and important insights from them. It has two divisions namely data analytics and marketing applications which look after data analytics platforms and marketing software respectively. By providing a parallel processing system, Teradata allows companies to recall and analyse data in a simple and effective manner. One of the most important feature of this data warehouse application is that it segregates data into hot and cold, where cold data is that which is not frequently used. Further, Teradata is considered one of the most popular database warehouse application.