In Search of an API for Scalable File Systems: Under the Table or Above it?
CARNEGIE-MELLON UNIV PITTSBURGH PA PARALLEL DATA LABORATORY
Pagination or Media Count:
cluster file systems have been used by the high performance computing HPC community at even larger scales for more than a decade. These cluster file systems, including IBM GPFS, Panasas PanFS, PVFS and Lustre, are required to meet the scalability demands of highly parallel IO access patterns generated by scientific applications that execute simultaneously on tens to hundreds of thousands of nodes. Thus, given the importance of scalable storage to both the DISC and the HPC world, we take a step back and ask ourselves if we are at a point where we can distill the key commonalities of these scalable file systems. This is not a paper about engineering yet another right file system or database, but rather about how do we evolve the most dominant data storage API - the file system interface - to provide the right abstraction for both DISC and HPC applications. What structures should be added to the file system to enable highly scalable and highly concurrent storage Our goal is not to define the API calls per se, but to identify the file system abstractions that should be exposed to programmers to make their applications more powerful and portable. This paper highlights two such abstractions. First, we show how commodity large-scale file systems can support distributed data processing enabled by the HadoopMapReduce style of parallel programming frameworks. And second, we argue for an abstraction that supports indexing and searching based on extensible attributes, by interpreting BigTable as a file system with a filtered directory scan interface.
- Computer Programming and Software