External Memory Algorithms: Dealing With MASSIVE Data
Final progress rept. 1 Aug 2001-30 Jun 2003
DUKE UNIV DURHAM NC DEPT OF COMPUTER SCIENCE
Pagination or Media Count:
The bottleneck in many applications that process massive amounts of data is the IO communications between internal memory and external memory. The bottleneck is accentuated as processors get faster and parallel processors are used. Parallel disk arrays are often used to increase the IO bandwidth. The goal of this proposal is to deepen our understanding of the limits of IO systems and to construct external memory algorithms that are provably efficient. The three measures of performance are number of IOs, disk storage space, and CPU time. Even when the data fit entirely in memory, communication can still be the bottleneck, and the related issues of caching become important. Theoretical work involves development and analysis of provably efficient external memory algorithms and cache-efficient algorithms for a variety of important application areas. We address several batched and on-line problems, involving text databases, prefetching and streaming data from parallel disks, and database selectivity estimation. Our experimental validation uses our TPIE programming environment. Plans for the coming year are to address bottleneck issues in parallel disks, text databases, and XML databases.
- Computer Programming and Software
- Computer Hardware