The realities of data warehousing today are:
• Multiple, complex applications will use the data warehouse serving a variety of users
• Data size is exploding and will continue to explode
• Data warehouse data latency is becoming intolerable as real-time data is demanded
Applications
Data warehousing is important to a business to the degree that it can provide either short- or long-term value to the bottom line. Many of the applications from where the return on investment (ROI) is generated today are related to the customer and the ability to attract and retain high value and high potential customers on a marketing budget that must be spent wisely.
Otherwise, data is analyzed for initiatives around fraud detection, marketing automation, workforce management, customer analysis and management, planning and analyzing manufacturing, the supply chain and the contact center and analyzing the key performance indicators of a company’s performance and direction. Often, central data for key subject matter like customers, products and sales are applicable to a variety of the applications. This leads to high concurrency requirements. In summary, data warehousing is applicable across the board to a company. There is scarcely a strategic or tactical company objective that cannot be supported with the information generated from a data warehouse.
Data Size
Data warehouse size is booming due to a variety of factors, but mainly because success begets success with data warehousing. As the initial data generates profitable use and the platform proves able to handle the workload, it is a matter of time until new uses leverage the data and add their different data requirements to the warehouse. And using detailed data in conjunction with summary data is important for effective decision making, further contributing to data overload.
Seldom is it feasible to delete, or otherwise render inaccessible, older data. Plan on data simply accumulating ad infinitum in the warehouse. Plan on loading all the historical data you have to seed the warehouse with as well. Companies are also realizing the usefulness of data that is generated outside of their confines – so called third-party data. It is no longer difficult or untested to “subscribe”
to external data feeds to augment internally generated data. Marketing departments in particular have grown in their sophistication to deal with all kinds of data and the more, the better.
Real-Time
More and sooner is always better. Real-time is becoming in vogue for data warehouses. If overnight loading is good, intra-day loading is better and “real time” loading is best. Data warehouses are absorbing many “operational” functions today that operational systems are unable to provide. Often, it is easier to tackle issues with a modern, supportable data warehouse than it is to change operational systems. But that does not mean the added functionality can impede the system’s operation. Users demand summarized historical trends but they also demand real-time data. It is critical to offer business users access to critical information as business events occur, enabling them to better focus on driving performance across the business. Enable users to combine real-time data generated by transactional systems with historical time-slice information stored in the data warehouse for serious value. Do not build non-real-time extraction, transformation and loading routines for data warehouses today that cannot be easily altered for real-time. Your data warehouse must be able to address the real-time data integration challenge.
Architecture Requirements
The trends above are rather obvious to an astute observer of the data warehousing industry. However, acknowledging the industry direction and being able to support it are two different things. Look beyond vendor hype. Look behind the eye-catching headlines for true substance to see if there is real production applicability to you.
Of the major inputs to a data warehouse architecture specification and DBMS selection, eventual data size is one of the most important. At certain levels of data size (in the terabytes), data warehouse programs tend to gravitate towards similar levels of usage and complexity. There are programs in the multiple terabytes today with thousands of users and this represents the eventuality of many data warehouses starting out today. Make sure your DBMS selection can accommodate a true eventual production data warehouse environment.
Criteria for a Data Warehouse DBMS Selection
The data warehouse DBMS selection is critical and acts as a catalyst for all other technology decisions. The technology needs to support both the immediate as well as future, unspecified and unknown requirements. Ideally the DBMS selection should be the first technology decision made for a data warehouse project. Given the state of the marketplace, described above, the technical architecture should be:
• Scalable – In both performance capacity and incremental data volume growth. Make sure the proposed solution scales in a near-linear fashion and behaves consistently with growth in all of database size, number of concurrent users and complexity of queries. Understand additional hardware and software required for each of the incremental uses.
• Powerful – Designed for complex decision support activity in a multi-user mixed workload environment. Check on the maturity of the optimizer for supporting every type of query with good performance and determining the best execution plan based on changing data demographics. Check on conditional parallelism and what the causes are of variations in the parallelism deployed. Check on dynamic and controllable prioritization of resources for queries.
• Manageable – Through minimal support tasks requiring DBA/System Administrator intervention. It should provide a single point of control to simplify system administration. You should be able to create and implement new tables and indexes at will.
• Extensible – Provide flexible database design and system architecture that keeps pace with evolving business requirements and leverages existing investment in hardware and applications. What is required to add and delete columns? What is the impact of repartitioning tables?
• Available – Supports mission critical business applications with minimal down time (a fading concept). Check on “hot pluggable” components, understand system down time requirements and any issues that might deny or degrade service to end users. These can include batch load times, software/hardware upgrades, severe system performance issues and system maintenance outages.
• Interoperable – Integrated access to the web, internal networks, and corporate mainframes.
• Affordable – Proposed solution (hardware, software, services, required customer support) providing a low total cost of ownership (TCO) over multi-year period.
• Proven – You don’t want to risk a critical decision regarding a fundamental underpinning of the data warehouse environment on an unproven solution.
• Flexible – Provides optimal performance across the full range of normalized, star and hybrid data schemas with large numbers of tables. Look for proven ability to support multiple applications from different business units, leveraging data that is integrated across business functions and subject areas.