Tuesday, January 16, 2007

Binary Search Broken

Still on the theme of ubiquitous bugs, I was chatting with a friend the other day and he mentioned that the canonical description of the binary search algorithm contained a bug. The Official Google Research Blog has all the details. The binary search algorithm, for those of you who don't have a CompSc background, is an amazingly efficient algorithm for searching an ordered list of items - elegant and simple, it's much more efficient than a simple linear search: a straight through iterative search is order N/2 (on average, for a list of N elements, the search will take N/2 iterations to find a specific element); binary search is order log (2) N. The basic algorithm works like this:

  1. Find the midpoint in the ordered list
  2. Compare the middle element to the value you are searching for.
  3. If the element is greater than the search key, then discard the top half of the list.
  4. If the element is less than the search key, then discard the bottom half of the list.
  5. With the remaining list, find the new midpoint, and repeat until the search term is found.
For a list of 1000 elements, finding a specific element using a linear search would take, on average, 500 comparisons. The algorithm above, a little less than 10. I remember being blown away by this algorithm in an undergraduate lecture - it's so simple, and so powerful. In fact, this divide and conquer approach is used for a number of list-based operations: sorting, searching, and so on.

A typical implementation of the binary search algorithm, and the one used in the Java Developers' Kit, and other code libraries looks like this (taken from the blog post linked to above, and direct from the JDK):

1:     public static int binarySearch(int[] a, int key) {
2: int low = 0;
3: int high = a.length - 1;
4:
5: while (low <= high) {
6: int mid = (low + high) / 2;
7: int midVal = a[mid];
8:
9: if (midVal < low =" mid"> key)
10: low = mid + 1;
11: else if (midVal > key)
12: high = mid - 1;
13: else
14: return mid; // key found
15: }
16: return -(low + 1); // key not found.
17: }
The problem is in the line that finds the midpoint in the list: (low+high)/2. For most applications, this will work fine, but as low and high get very large, there's a danger that the maximum integer value for a variable is approached (that's 2^31-1, or about 2 billion for Java). In other words, if the search list contains billions of elements, the algorithm as implemented above will overflow to a negative value (since the topmost bit represents the sign of a number), and throw an error when you try to look up that element.

There are solutions of course (there are other ways to calculate the midpoint without adding two very large numbers together). But the bug is a timebomb for any application that needs to search or sort very large lists. Sure, 2 billion is a large number, but I'm sure there's a few data warehouses out there that would be dangerously close to that number in terms of fact table rows. Be sure that your DBMS vendor is all across this - it took 10 years for the bug to show up in Java.

No comments: