Accession Number:



Improving Efficiency of Indexing by Using a Hierarchical Merge Approach

Descriptive Note:

Technical rept.

Corporate Author:


Personal Author(s):

Report Date:


Pagination or Media Count:



As electronic collections become larger and more numerous, systems capable of working with very large collections will be more and more in demand. On the other hand, computer main memory is relatively limited and constrained compared to the amount of data in a large collection. The requirement for larger memory while building big databases can sometimes make this resource a bottleneck for an information indexing system. INQUERY is a state-of-the-art, widely used, full-text information retrieval system. The INQUERY document indexing system consists of two main operations parsing and merging. The subsystem responsible for parsing is called the Parser. It creates partial inverted lists by scanning, lexically analyzing, and inverting documents. A partial inverted list contains document entries for a subset of the documents in the collection. It must be combined with other partial inverted lists for the same term to create a final inverted list for the document collection. The Parser buffers partial inverted lists in main memory and flushes them to intermediate files when the buffer is full. The subsystem responsible for merging is called the Merger. After all of the documents have been parsed, the Merger combines the intermediate files to produce the final inverted lists for the collection. The aim of this project is to solve the efficiency and scalability problem of the Merger for the INQUERY indexing system. The speed performance of the old Merger mergebtl degrades significantly when used in building big databases under tight memory space limitations. The authors have found a better solution by using a hierarchical merge approach. Timing tests on the new Merger indicates that merge time can be significantly reduced. Hierarchical merge nicely solves the problem of scalability and greatly improves the efficiency of building very large databases for information retrieval systems.

Subject Categories:

  • Information Science
  • Linguistics
  • Cybernetics

Distribution Statement: