Tuesday, January 23, 2007

Finnish BI Video

Hot on the heels of the Digg BI Stories RSS feed I've added a feed for BI videos on YouTube. Not a lot there at the moment, but thought I'd link to this one. It's a Finnish promotional video with SAS at what I'm guessing is a business software conference, and it's here purely because of the way they say "decision support" and "business intelliggence". Those wacky Finns...

You're a BI What? The Myopic View of BI Vendors

An excellent article that I came across while adding the Digg functionality to this blog yesterday (notice the new Digg It! button below, and the list of BI stories on Digg on the right!) does a wonderful job of outlining some of the major shortcomings of the BI industry at the moment. Neil Raden from consulting outfit Hired Brains and the fingertips behind the Addicted to BI blog, gets stuck in, criticising vendors for a myopic product/customer/sales view of organisations, and being too focused on the software/hardware tool, rather than the all-important decision support. Shades of Peter Keen there, who made the same criticism of DSS developers and vendors back in the 1980s - it's the first 'S' in DSS that's the most important, ie. support should take precedence over the system (since the latter is only a means to achieving the former).

Raden also talks about the potential of Web 2.0 (ugh) concepts that may be of benefit to decision-makers: collaboration, tagging, etc. Although I'm not a fan of the label, I am very supportive of Web 2.0 thinking that sees social interaction and bottom-up creation of content as the key to useful tools on the web. If we can overcome the problems we currently have in getting BI software users to contribute metadata, etc., then some interesting things might happen.

My favourite quote from Neil comes at the end, though, and goes directly to the issue of the provision of support to decision-makers. He makes the point that most business people are tech-savvy, often more aware of the latest tech trends than internal IT support staff. They just won't put up with crappy support:

"Look, I'm playing a 3-D video strategy game with four people in China I don't even know, while I'm downloading data to my iPod, while I'm answering messages in Yahoo messenger. Are you going to tell me I can't have a report for three months because it has to go through QA?"

Tuesday, January 16, 2007

Binary Search Broken

Still on the theme of ubiquitous bugs, I was chatting with a friend the other day and he mentioned that the canonical description of the binary search algorithm contained a bug. The Official Google Research Blog has all the details. The binary search algorithm, for those of you who don't have a CompSc background, is an amazingly efficient algorithm for searching an ordered list of items - elegant and simple, it's much more efficient than a simple linear search: a straight through iterative search is order N/2 (on average, for a list of N elements, the search will take N/2 iterations to find a specific element); binary search is order log (2) N. The basic algorithm works like this:

  1. Find the midpoint in the ordered list
  2. Compare the middle element to the value you are searching for.
  3. If the element is greater than the search key, then discard the top half of the list.
  4. If the element is less than the search key, then discard the bottom half of the list.
  5. With the remaining list, find the new midpoint, and repeat until the search term is found.
For a list of 1000 elements, finding a specific element using a linear search would take, on average, 500 comparisons. The algorithm above, a little less than 10. I remember being blown away by this algorithm in an undergraduate lecture - it's so simple, and so powerful. In fact, this divide and conquer approach is used for a number of list-based operations: sorting, searching, and so on.

A typical implementation of the binary search algorithm, and the one used in the Java Developers' Kit, and other code libraries looks like this (taken from the blog post linked to above, and direct from the JDK):

1:     public static int binarySearch(int[] a, int key) {
2: int low = 0;
3: int high = a.length - 1;
4:
5: while (low <= high) {
6: int mid = (low + high) / 2;
7: int midVal = a[mid];
8:
9: if (midVal < low =" mid"> key)
10: low = mid + 1;
11: else if (midVal > key)
12: high = mid - 1;
13: else
14: return mid; // key found
15: }
16: return -(low + 1); // key not found.
17: }
The problem is in the line that finds the midpoint in the list: (low+high)/2. For most applications, this will work fine, but as low and high get very large, there's a danger that the maximum integer value for a variable is approached (that's 2^31-1, or about 2 billion for Java). In other words, if the search list contains billions of elements, the algorithm as implemented above will overflow to a negative value (since the topmost bit represents the sign of a number), and throw an error when you try to look up that element.

There are solutions of course (there are other ways to calculate the midpoint without adding two very large numbers together). But the bug is a timebomb for any application that needs to search or sort very large lists. Sure, 2 billion is a large number, but I'm sure there's a few data warehouses out there that would be dangerously close to that number in terms of fact table rows. Be sure that your DBMS vendor is all across this - it took 10 years for the bug to show up in Java.

Thursday, January 11, 2007

Excel Patch for Standard Deviation Bug

Just noticed a new update for the Mac version of Microsoft Office 2004 (11.3.3) - note that it's not yet showing up in the automatic update tool. One of the fixes included in the update is for:

an issue that causes standard deviation calculations to produce inaccurate results when the calculations are used in PivotTable reports.

For those of you running Macs (and there's a few, judging by our logs), and using StDev in your pivot tables (or use a tool that does), get on updating.

It does make you think - reliance on any one tool always exposes us to the risk that bugs or kludges in implementation will give us incorrect results, and particularly so with Excel given its ubiquity. I couldn't find any more details on the bug in my quick hunt on Microsoft's site, or a Google search, but did turn up these papers critiquing Excel 97's implementation of a number of statistical functions (referred to and addressed by Microsoft in this KnowledgeBase article):
  • Knusel, L. On the Accuracy of Statistical Distributions in Microsoft Excel 97, Computational Statistics and Data Analysis, 26, 375-377, 1998.
  • McCullough, B.D. & B. Wilson, On the accuracy of statistical procedures in Microsoft Excel 97, Computational Statistics and Data Analysis, 31, 27-37, 1999.
For those of you interested in the use of spreadsheets (in general, not just Excel) and the associated risks, check out the European Spreadsheet Risks Interest Group's site.