Wednesday, March 19, 2008

Trends in Data Warehousing


Last year I co-authored a book chapter with two other colleagues, Peter O'Donnell and David Arnott, on the use of data warehouses for decision support, and it's just recently been published. The book is called Handbook on Decision Support Systems edited by Frada Burstein (another Monash colleague) and Clyde Holsapple. One section of the chapter that I wrote looked at current trends in DW practice, and I thought, as I wrote it in late 2006, that it would probably be better as a blog post, than part of a chapter in a (hopefully long-lived) book. Here's the excerpt. I'd be interested to hear what other people think are the big trends in DW and where it's headed.

Current Trends and the Future of Data Warehousing Practice

Forecasting future trends in any area of technology is always an exercise in inaccuracy, but there are a number of noticeable trends which will have a significant impact in the short-to-medium term. Many of these are a result of improvements and innovations in the underlying hardware and database management system (DBMS) software. The most obvious of these is the steady increase in the size and speed of data warehouses connected to the steady increase in processing power of CPUs available today, improvements in parallel processing technologies for databases, and decreasing prices for data storage. This trend can be seen in the results of Winter Corporation's "Top Ten Program," which surveys companies and reports on the top ten transaction-processing and data warehouse databases, according to several different measures. Figure 11 depicts the increase in reported data warehouse sizes from the 2003 and 2005 surveys (2007 data has not yet been released):


Ten Largest Global Data Warehouses by Database Size, 2003/2005. From Winter Corporation.

The data warehousing industry has seen a number of recent changes that will continue to have an impact on data warehouse deployments in the short-to-medium term. One of these is the introduction by several vendors, such as Teradata, Netezza and DATAllegro, of the concept of a data warehouse 'appliance' (Russom, 2005). The idea of an appliance is a scalable, plug-and-play combination of hardware and DBMS that an organization can purchase and deploy with minimal configuration. The concept is not uncontroversial (see Gaskell, 2005 for instance), but is marketed heavily by some vendors never-the-less.

Another controversial current trend is the concept of 'active' data warehousing. Traditionally, the refresh of data in a data warehouse occurs at regular, fixed points of time in a batch-mode. This means that data in the data warehouse is always out of date by a small amount of time (since the last execution of the ETL process). Active data warehousing is an attempt to approach real-time, constant refreshing of the data in the warehouse: as transactions are processed in source systems, new data flows through immediately to the warehouse. To date, however, there has been very limited success in achieving this, as it depends on not just the warehouse itself, but performance and load on source systems to be able to handle the increased data handling. Many ETL processes are scheduled to execute at times of minimal load (eg. overnight or on weekends), but active warehousing shifts this processing to peak times for transaction-processing systems. Added to this are the minimal benefits that can be derived from having up-to-the-second data in the data warehouse, with most uses of the data not so time-sensitive that decisions made would be any different. As a result, the rhetoric of active data warehousing has shifted to "right-time" data warehousing (see Linstedt, 2006 for instance), which relaxes the real-time requirement for a more achievable 'data when it's needed' standard. How this right-time approach differs significantly in practice from standard scheduling of ETL processing is unclear.

Other than issues of hardware and software, a number of governance issues are introducing change to the industry. One of these is the prevalence of outsourcing information systems - in particular the transaction-processing systems that provide the source data for warehouse projects. With many of these systems operated by third party vendors, governed by service level agreements that do not cover extraction of data for warehouses, data warehouse developers are facing greater difficulties in getting access to source systems. Arnott (2006) describes one such project where the client organization had no IT staff at all, and all 13 source systems were operated off-site. The outsourcing issue is compounded by data quality problems, which is a common occurrence. Resolution of data quality problems is difficult even when source systems are operated in-house: political confrontations over who should pay for rectifying data quality problems, and even recognition of data quality as a problem (in many cases, it's only a problem for data warehouse developers, as the transaction processing system that provides the source data is able to cope with the prevailing level of data quality) can be difficult to overcome. When the system is operated off-site and in accordance with a contractual service level agreement that may not have anticipated the development of a data warehouse, they become even more difficult to resolve.

In addition to the issues of outsourcing, alternative software development and licensing approaches are becoming more commonplace. In particular, a number of open source vendors have released data warehousing products, such as Greenplum's Bizgres DBMS (also sold as an appliance) based on the Postgres relational DBMS. Other open source tools such as MySQL have also been used as the platform for data warehousing projects (Ashenfelter, 2006). The benefits of the open source model are not predominantly to do with the licensing costs (the most obvious difference to proprietary licensing models), but rather have more to do with increased flexibility, freedom from a relentless upgrade cycle, and varied support resources that are not deprecated when a new version of the software is released (Wheatley, 2004). Hand-in-hand with alternative licensing models is the use of new approaches to software development, such as Agile methodologies (see http://www.agilealliance.org) (Ashenfelter, 2006). The adaptive, prototyping oriented approaches of the Agile methods are probably well suited to the adaptive and changing requirements that drive data warehouse development.

The increased use of enterprise resource planning (ERP) systems is also having an impact on the data warehousing industry at present. Although ERP systems have quite different design requirements to data warehouses, vendors such as SAP are producing add-on modules (SAP Business Warehouse) that aim to provide business intelligence-style reporting and analysis services without the need for a separate data warehouse. The reasoning behind such systems is obvious: since an ERP system is an integrated tool capturing transaction data in a single location, the database resembles a data warehouse, insofar as it's a centralized, integrated repository. However, the design aims of a data warehouse that dictate the radically different approach to data design described above in Sections 3.1 and 4 mean that adequate support for management decision-making requires something other than simply adding a reporting module to an ERP system. Regardless, the increased usage of ERP systems means that data warehouses will need to interface with these tools more and more. This will further drive the market for employees with the requisite skill set to work with the underlying data models and databases driving common ERP systems.

Finally, Microsoft's continued development of their Microsoft SQL Server database engine has produced a major impact on Business Intelligence vendors. Because of Microsoft's domination of end-user's desktops, it is able to integrate its BI tools with other productivity applications such as Microsoft Excel, Microsoft Word and Microsoft PowerPoint with more ease than their competitors. The dominance of Microsoft on the desktop, combined with the pricing of SQL Server, and the bundling of BI tools with the DBMS means that many business users already have significant BI infrastructure available to them, without purchasing expensive software from other BI vendors. Although SQL Server has been traditionally regarded as a mid-range DBMS, not suitable for large-scale data warehouses, Microsoft is actively battling this perception. They recently announced a project to develop very large data warehouse applications for an external and an internal client, to handle data volumes up to 270 terabytes (Computerworld, 2006). If Microsoft are able to dispel the perception that SQL Server is only suited for mid-scale applications, it will put them into direct competition with large-scale vendors such as Oracle, IBM and Teradata, with significantly lower license fees. Even if this is not achieved, the effect that Microsoft has had on business intelligence vendors will flow through to data warehousing vendors, with many changes being driven by perceptions of what Microsoft will be doing with forthcoming product releases.

12 comments:

Bruce Fowler said...

Perhaps the question is "what is driving your BI and/or DW project"? In my own organisational context, we have spent a number of years buying/building/hybridising transaction processing systems which continue to amass volumes of transaction-level data to support (ultimately) accounting processes and audit requirements.

These systems are tooled with expansive reporting capabilities but are largely grouping and sorting variations on a fairly nominal theme. Functionally, the organisation operates as a collection of processes (rather than silos) with systems supporting the same, but the data almost invariably retains a silo characteristic - each process operating in its own bubble.

As users master the systems they manage and the systems themselves approach optimal operating stability and efficiency, they have time to consider "what if". The question dominates their thinking, their planning, and their strategy. The advances in technology have certainly contributed to the delivery of DW and BI tools, however I think it’s also contributed to our ability – our capacity – to shift focus from transaction-level aggregations to more dynamic “what-if” analyses.

The politics of data and the concept of process silos continue to remain a barrier to effective (and timely) design and implementation – though I’m not sure whether in-sourcing, out-sourcing or co-sourcing arrangements contribute (positively or negatively) to this issue though. I would be interested in any thoughts on this.

With respect to the active data warehousing concept, I have to wonder whether this really has a place as a data warehousing approach – aren’t you really transforming data from the RDBMS supporting the TPS to enable presentation of the data in a specific format/context? Hardly a DW concept I would have thought. “Right-time data warehousing” as a concept sounds like a scheduling discussion for the ETL process rather than a new data warehousing concept. I look forward to the “wrong-time” variant to be announced in the near future.

The governance issue described is a complex one, and perhaps a “lesson learned” for others considering out-sourcing arrangements for core information systems or processes and perhaps one more reason why organisations should reconsider before out-sourcing their core information systems (though the designation of a core information system could perhaps be an issue of timing or changing requirements, so forward assessment during the risk assessment phase of the forming of the out-sourcing agreement should perhaps be an area of particular focus).

Doesn’t the integrated nature of ERP systems present an opportunity to deliver DW/BI as a natural extension of the base RDBMS platform – presumably with less interference from issues such as data quality and data politics? Though the RDBMS design principles are different, there is no reason why this could not be implemented as a data warehouse in a separate database within the RDBMS (or of course, a separate RDBMS altogether), and launched through a common portal (the ERP system) to maintain user continuity.

I note the recent inclusion of BIDS in the Microsoft SQL Express with Advanced Tools, and (I believe) the relevant extensions into the Visual Studio Express environment. All of these tools are freely available with limited constraints (see http://www.microsoft.com/sql/prodinfo/features/compare-features.mspx). The power and accessibility of these tools bodes well for Microsoft, and for data warehousing and business intelligence in general, and will no doubt contribute to the continued uptake and evolution of the industries – particularly in small-to-medium enterprises.

I wonder whether the feature transference principle of motor vehicles – the expectation that features present in luxury motor vehicles today (particularly the Mercedes S-Class according to Top Gear’s Jeremy Clarkson) will find their way to everyday motor vehicles within five-to-ten years – is now being reflected in the data warehouse and business intelligence spaces (i.e. large to small-to-medium enterprises)?

Rob Meredith said...

Hi Bruce,

I reckon sourcing is one of the big issues at the moment. Apart from the obvious problems that it raises with simply getting at the data (one DW project I know of couldn't get their hands on *any* transaction data for the first six months of the project), it becomes a major governance issue (what's not, it seems?) when you throw an evolutionary dynamic at a resource managed through legally binding SLAs.

It's all driven by that need to ask the 'what-if' question. The answer to that, then raises a 'well, why?' question leading to further, 'if that, then why not this?' questions. Life would be so much easier if users could just satisfy themselves with standard reports - let *us* decide what's important!

Thanks again for the comment, Bruce.

Cheers,

Rob.

Noman said...

Just a quick comment in regard to Active Data Warehousing...while it might not be the absolute requirement for every company or infact every industry but there are instances where ADW can deliver information through which the organization can market the day light out of it. Gaming industry is one such example.

In regard to future trends? Well, how about "in-memory" warehouses? :)

Noman Jaffery

Bruce Fowler said...

Noman, how frequently is the ETL process run? Again, is this a scheduling issue for the ETL process rather than a new DW approach?

Rob Meredith said...

Hi Noman,

With data warehouses, there's always an exception to anything, it seems. Sure, there are *some* applications for active warehousing. As a major transformative theme for the industry, though, I don't see it any more than as a marketing ploy by vendors trying to differentiate themselves. 18 months after writing about it in the original chapter, I reckon it's a view that's been vindicated.

As for in-memory DBMSs, this is just another approach to caching data. Whether you put that functionality in the delivery system (ala OLAP cubes, etc.) or in the DBMS doesn't seem to be that significant to me. It's just one more tool in the DBA's arsenal for tuning performance.

Cheers,

Rob.

Gabriele Bartolini said...

Hello,

I have just bumped into your interesting blog. I am an active member of the PostgreSQL community from Italy (I have actually studied at Monash too in 2001!) and we are organising the first European PostgreSQL conference, which will be held at the Monash University Prato Centre in October 2008.

The data warehousing capabitilities in PostgreSQL have improved so much, but it would be interesting to get some of you to come here in Prato and talk about "Data warehousing" needs for an open-source ORDBMS such as PostgreSQL. The main developers from all over the world will be there, so it would be an incredible chance to get the business needs meet the community of developers.

Please let me know what you think. Thanks a lot.

Ciao,
Gabriele

Rob Meredith said...

Hi Gabriele,

You'll have a great time in Prato! We ran a conference there in 2004 and the location was spectacular. Unfortunately I won't be able to make it over to Italy in October (I'm blowing my travel budget to present a paper at the 2008 version of the conference we ran in 2004 in Toulouse in July. One of the other guys may be able to head over though, so I'll mention your comment to them.

I reckon there's two big issues for open source data warehouses. The first is mind-share and perception: most DBAs and consultants seem to see open source DBMS's as equivalent to MySQL, which in turn doesn't equate to 'industrial, DW strength technology'. There's also a bit of mistrust there about the open source model. This is all broad generalisation of course, but as with any open source app, the hardest place to break into is big corporate IT.

The second is that I don't think there's much needed to be done on the DBMS end for open source in the DW area. The technology itself is pretty much there. So long as there is an embrace of standards (even if they come from the software old guard like Microsoft - eg. MDX), then I don't see any technological hiccups (note that I'm not a computer scientist, though - I'm sure there's some aspect of DBMS DW arcania that could be tweaked with a better algorithm). No, the big hole at the moment for open source is the delivery system - the business intelligence bit. I've yet to see a good, solid, intuitive front end reporting and analysis suite that I'd be comfortable sitting a C-level executive down in front of. The closed-source vendors have their issues in this regard too (see my various posts on that topic!). In many cases, the open-source community doesn't understand real BI users (I saw a jaw dropping discussion on Slashdot once that was characterised by a severe misunderstanding of what spreadsheets do).

The problem's not terminal, though, and it will just take a few people who know the area well to collaborate to pull a project together. I think it's probably just a matter of time before that happens.

Dan Linstedt said...

Hi Rob,

It seems you pulled my quote about active data warehousing and real-time out of context. That article you referenced that I wrote for Teradata was obviously not directed at the technical levels of defining "active data warehousing".

Right-Time merely means that the business REQUIRES the data to be available at a specific time, and can justify it with ROI calculations. Whether it's ETL or not has no bearing on the latency of arrival, nor the latency of use.

Keep in mind that "REAL-TIME" doesn't exist. There's no way to ever get data in to the EDW at the same time it's actually created. The correct term should be "near-real-time".

If you'd like to reference "ETL" as a technology, then keep in mind that it grew up under batch operations, and getting ETL to run properly in less than 1 second latencies across multiple streaming feeds is (today) a near impossibility due to the engine design for high volume batch operations.

Just some additional thoughts, I have more - but the comments here are not the place to leave them. I'll be happy to place them in my blog at: http://www.B-eye-Network.com/blogs/linstedt

Rob Meredith said...

Hi dan,

Thanks for taking the time to clarify your thoughts. Note though that I haven't actually quoted you on anything - my citation of your blog post was simply a link to an example of the shift in emphasis that has taken place in industry from ADW to 'right-time' data warehosuing. I would still hold that this shift has come about because of the unrealistic expectations that ADW raised, but I'm fairly sceptical about the actual benefits of both approaches. After all, what is non-'right time' data warehousing?

Based on your definition, RTDW is: "If the business needs to answer a question at the end of the day, every day - then a RTDW would refresh on a daily basis. If the business needs to answer a question every 30 minutes, then an RTDW would refresh every 30 minutes - assuming the data is available."

How could you have a successful DW that doesn't at least meet this standard? If a DW doesn't, then, ipso facto, it's a DW that isn't meeting a clear business requirement, and some other system or process needs to be put in place to meet that requirement.

Dan Linstedt said...

Hi Rob,

you've got the quote source incorrect for "right-time data warehousing" This terminology is used by my good friend Stephen Brobst, CTO of Teradata. I also don't believe that it is rhetoric - why? Because we shouldn't really build anything that the business can't justify with a business reason.

True REAL-TIME is not only difficult, it's expensive. An example of such a system might be one that florida hurrican center uses to track hurricane movement from satellite feeds every millisecond. They can't get the data fast enough.

Please remember that "right-time" may mean delivering second-level latency data to one part of the business, while delivering half-hour incremental updates to another part of the business. It only makes sense.

Cheers,
Dan Linstedt
DanL@RapidACE.com

Rob Meredith said...

Hi Dan,

Again, as I said above, I haven't quoted you at all. The purpose of citing your 2006 post was to point to an example of the usage of the term 'real time data warehousing'. I'm not quoting you on definition, nor am I attributing to you credit for inventing the term. For clarity, here's how I cited you in the original post:

As a result, the rhetoric of active data warehousing has shifted to "right-time" data warehousing (see Linstedt, 2006 for instance), which relaxes the real-time requirement for a more achievable 'data when it's needed' standard.

Note the "for instance" in the citation.

As for the merits or otherwise of the label, I'm definitely a skeptic. In terms of the actual meaning attributed to the label, as you use it in the linked blog post and by others elsewhere, I have no problem. Of course requirements for data timeliness need to be explicitly justified and explained. People need data when they need it, and that forms a requirement that the data warehouse should be able to meet. If it can't, what's the point of building it?

My problem with the term is that it's meaningless. All successful data warehouses have to meet the criteria that have been attributed to right-time data warehousing (indeed, arguably any system has to meet those criteria). Unless a concept actually offers something new, somehow helping people to build better systems, then it's nothing more than a marketing term, putting a new spin on an old concept: in this case, properly eliciting and justifying business requirements, and designing a system accordingly.

When has this never been the case?

If anything, it highlights a major flaw in one of the premises the active DW concept: the assumption that all users actually need real-time data all the time. To a certain extent, active DW is technology for technology's sake (I can understand why a vendor would want to go down that road, it sells bigger and faster and more expensive boxes, but clients should be wary of this mindset).

Sure, RTDW makes sense in this context, as a reaction to the often unattainable goals of ADW. However, it's nothing new, and in fact is what people have been doing all along if they run their projects (any systems project!) properly: considering carefully, from a cost-benefit perspective, each aspect of a system design, whether it's the frequency of delivery of data, right through to the kinds of data visualisation used.

The label is rhetoric. The concept itself is just proper systems design, which good analysts and designers have been doing right from the start.

shakar12 said...
This comment has been removed by a blog administrator.