Wednesday, March 19, 2008

Trends in Data Warehousing

Last year I co-authored a book chapter with two other colleagues, Peter O'Donnell and David Arnott, on the use of data warehouses for decision support, and it's just recently been published. The book is called Handbook on Decision Support Systems edited by Frada Burstein (another Monash colleague) and Clyde Holsapple. One section of the chapter that I wrote looked at current trends in DW practice, and I thought, as I wrote it in late 2006, that it would probably be better as a blog post, than part of a chapter in a (hopefully long-lived) book. Here's the excerpt. I'd be interested to hear what other people think are the big trends in DW and where it's headed.

Current Trends and the Future of Data Warehousing Practice

Forecasting future trends in any area of technology is always an exercise in inaccuracy, but there are a number of noticeable trends which will have a significant impact in the short-to-medium term. Many of these are a result of improvements and innovations in the underlying hardware and database management system (DBMS) software. The most obvious of these is the steady increase in the size and speed of data warehouses connected to the steady increase in processing power of CPUs available today, improvements in parallel processing technologies for databases, and decreasing prices for data storage. This trend can be seen in the results of Winter Corporation's "Top Ten Program," which surveys companies and reports on the top ten transaction-processing and data warehouse databases, according to several different measures. Figure 11 depicts the increase in reported data warehouse sizes from the 2003 and 2005 surveys (2007 data has not yet been released):

Ten Largest Global Data Warehouses by Database Size, 2003/2005. From Winter Corporation.

The data warehousing industry has seen a number of recent changes that will continue to have an impact on data warehouse deployments in the short-to-medium term. One of these is the introduction by several vendors, such as Teradata, Netezza and DATAllegro, of the concept of a data warehouse 'appliance' (Russom, 2005). The idea of an appliance is a scalable, plug-and-play combination of hardware and DBMS that an organization can purchase and deploy with minimal configuration. The concept is not uncontroversial (see Gaskell, 2005 for instance), but is marketed heavily by some vendors never-the-less.

Another controversial current trend is the concept of 'active' data warehousing. Traditionally, the refresh of data in a data warehouse occurs at regular, fixed points of time in a batch-mode. This means that data in the data warehouse is always out of date by a small amount of time (since the last execution of the ETL process). Active data warehousing is an attempt to approach real-time, constant refreshing of the data in the warehouse: as transactions are processed in source systems, new data flows through immediately to the warehouse. To date, however, there has been very limited success in achieving this, as it depends on not just the warehouse itself, but performance and load on source systems to be able to handle the increased data handling. Many ETL processes are scheduled to execute at times of minimal load (eg. overnight or on weekends), but active warehousing shifts this processing to peak times for transaction-processing systems. Added to this are the minimal benefits that can be derived from having up-to-the-second data in the data warehouse, with most uses of the data not so time-sensitive that decisions made would be any different. As a result, the rhetoric of active data warehousing has shifted to "right-time" data warehousing (see Linstedt, 2006 for instance), which relaxes the real-time requirement for a more achievable 'data when it's needed' standard. How this right-time approach differs significantly in practice from standard scheduling of ETL processing is unclear.

Other than issues of hardware and software, a number of governance issues are introducing change to the industry. One of these is the prevalence of outsourcing information systems - in particular the transaction-processing systems that provide the source data for warehouse projects. With many of these systems operated by third party vendors, governed by service level agreements that do not cover extraction of data for warehouses, data warehouse developers are facing greater difficulties in getting access to source systems. Arnott (2006) describes one such project where the client organization had no IT staff at all, and all 13 source systems were operated off-site. The outsourcing issue is compounded by data quality problems, which is a common occurrence. Resolution of data quality problems is difficult even when source systems are operated in-house: political confrontations over who should pay for rectifying data quality problems, and even recognition of data quality as a problem (in many cases, it's only a problem for data warehouse developers, as the transaction processing system that provides the source data is able to cope with the prevailing level of data quality) can be difficult to overcome. When the system is operated off-site and in accordance with a contractual service level agreement that may not have anticipated the development of a data warehouse, they become even more difficult to resolve.

In addition to the issues of outsourcing, alternative software development and licensing approaches are becoming more commonplace. In particular, a number of open source vendors have released data warehousing products, such as Greenplum's Bizgres DBMS (also sold as an appliance) based on the Postgres relational DBMS. Other open source tools such as MySQL have also been used as the platform for data warehousing projects (Ashenfelter, 2006). The benefits of the open source model are not predominantly to do with the licensing costs (the most obvious difference to proprietary licensing models), but rather have more to do with increased flexibility, freedom from a relentless upgrade cycle, and varied support resources that are not deprecated when a new version of the software is released (Wheatley, 2004). Hand-in-hand with alternative licensing models is the use of new approaches to software development, such as Agile methodologies (see (Ashenfelter, 2006). The adaptive, prototyping oriented approaches of the Agile methods are probably well suited to the adaptive and changing requirements that drive data warehouse development.

The increased use of enterprise resource planning (ERP) systems is also having an impact on the data warehousing industry at present. Although ERP systems have quite different design requirements to data warehouses, vendors such as SAP are producing add-on modules (SAP Business Warehouse) that aim to provide business intelligence-style reporting and analysis services without the need for a separate data warehouse. The reasoning behind such systems is obvious: since an ERP system is an integrated tool capturing transaction data in a single location, the database resembles a data warehouse, insofar as it's a centralized, integrated repository. However, the design aims of a data warehouse that dictate the radically different approach to data design described above in Sections 3.1 and 4 mean that adequate support for management decision-making requires something other than simply adding a reporting module to an ERP system. Regardless, the increased usage of ERP systems means that data warehouses will need to interface with these tools more and more. This will further drive the market for employees with the requisite skill set to work with the underlying data models and databases driving common ERP systems.

Finally, Microsoft's continued development of their Microsoft SQL Server database engine has produced a major impact on Business Intelligence vendors. Because of Microsoft's domination of end-user's desktops, it is able to integrate its BI tools with other productivity applications such as Microsoft Excel, Microsoft Word and Microsoft PowerPoint with more ease than their competitors. The dominance of Microsoft on the desktop, combined with the pricing of SQL Server, and the bundling of BI tools with the DBMS means that many business users already have significant BI infrastructure available to them, without purchasing expensive software from other BI vendors. Although SQL Server has been traditionally regarded as a mid-range DBMS, not suitable for large-scale data warehouses, Microsoft is actively battling this perception. They recently announced a project to develop very large data warehouse applications for an external and an internal client, to handle data volumes up to 270 terabytes (Computerworld, 2006). If Microsoft are able to dispel the perception that SQL Server is only suited for mid-scale applications, it will put them into direct competition with large-scale vendors such as Oracle, IBM and Teradata, with significantly lower license fees. Even if this is not achieved, the effect that Microsoft has had on business intelligence vendors will flow through to data warehousing vendors, with many changes being driven by perceptions of what Microsoft will be doing with forthcoming product releases.