A Methodology for Empirical Performance Evaluation of Page Segmentation Algorithms
MARYLAND UNIV COLLEGE PARK CENTER FOR AUTOMATION RESEARCH
Pagination or Media Count:
Document page segmentation is a crucial preprocessing step in Optical Character Recognition OCR systems. While numerous page segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation--empirical or theoretical-- of these algorithms. Fore the existing performance evaluation methods, two crucial components are usually missing 1 automatic training of algorithms with free parameters and 2 statistical and error analysis of experimental results. In this thesis, we use the following five-step methodology to quantitatively compare the performance of page segmentation algorithms 1 First we create mutually exclusive training and test datasets with groundtruth, 2 we then select a meaningful and computable performance metric, 3 an optimization procedure is then used to search automatically for the optimal parameter values of the segmentation algorithms, 4 the segmentation algorithms are then evaluated on the test dataset, and finally 5 a statistical error analysis is performed to give the statistical significance of the experimental results. The automatic training of algorithms is posed as an optimization problem and a direct search method -- the simplex method -- is sued to search for a set of optimal parameter values. A paired-model statistical analysis and an error analysis are conducted to provide confidence intervals for the experimental results and to interpret the functionalities of algorithms. This methodology is applied to the evaluation of five page segmentation algorithms, of which three are representative research algorithms and the other two are well-known commercial products, on 978 images from the University of Washington III dataset. It is found that the performances of the Voronoi, Docstrum and Caere segmentation algorithms are not significantly different from each other, but they are significantly better than that of ScanSofts segmentation algorithm, which in turn is significantly better than X-Y cut.
- Information Science
- Operations Research
- Test Facilities, Equipment and Methods