A Methodology for Empirical Performance Evaluation of Page Segmentation Algorithms

Mao, Song; Kanungo, Tapas

A Methodology for Empirical Performance Evaluation of Page Segmentation Algorithms

Active / Technical Report | Accession Number: ADA458685 |

Open PDF

Abstract:

Document page segmentation is a crucial preprocessing step in Optical Character Recognition OCR systems. While numerous page segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation--empirical or theoretical-- of these algorithms. Fore the existing performance evaluation methods, two crucial components are usually missing 1 automatic training of algorithms with free parameters and 2 statistical and error analysis of experimental results. In this thesis, we use the following five-step methodology to quantitatively compare the performance of page segmentation algorithms 1 First we create mutually exclusive training and test datasets with groundtruth, 2 we then select a meaningful and computable performance metric, 3 an optimization procedure is then used to search automatically for the optimal parameter values of the segmentation algorithms, 4 the segmentation algorithms are then evaluated on the test dataset, and finally 5 a statistical error analysis is performed to give the statistical significance of the experimental results. The automatic training of algorithms is posed as an optimization problem and a direct search method -- the simplex method -- is sued to search for a set of optimal parameter values. A paired-model statistical analysis and an error analysis are conducted to provide confidence intervals for the experimental results and to interpret the functionalities of algorithms. This methodology is applied to the evaluation of five page segmentation algorithms, of which three are representative research algorithms and the other two are well-known commercial products, on 978 images from the University of Washington III dataset. It is found that the performances of the Voronoi, Docstrum and Caere segmentation algorithms are not significantly different from each other, but they are significantly better than that of ScanSofts segmentation algorithm, which in turn is significantly better than X-Y cut.

Author(s):

Mao, Song ; Kanungo, Tapas

Author Organization(s):

MARYLAND UNIV COLLEGE PARK CENTER FOR AUTOMATION RESEARCH

Descriptive Note:

Technical rept.

Supplementary Note:

Report no. CS-TR-4093. Sponsored in part by Army Research Laboratory. The original document contains color images.

Pagination:

0034

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Report Number(s):

LAMP-TR-87, CAR-TR-933

Contract/Grant Number(s):

MDA9049-6C-1250

Monitor Series:

DOD

Subject Terms

Joint Capability Areas:

JCA_5_Command and Control; JCA_1.2_Force Preparation; JCA_1_Force Support; JCA_5.3_Planning; JCA_8_Building Partnerships

Communities of Interest:

C4I

Descriptor(s):

*TEST AND EVALUATION, *ALGORITHMS, *COMPARISON, *OPTICAL CHARACTER RECOGNITION, OPTIMIZATION, STATISTICAL ANALYSIS, ERROR ANALYSIS, PERFORMANCE(ENGINEERING), PARAMETERS

Field(s)/Group(s):

Information Science, Operations Research, Cybernetics, Test Facilities, Equipment and Methods

Keyword(s):

*PAGE SEGMENTATION, *PERFORMANCE EVALUATION, GROUNDTRUTH, DATASETS

Report Date:

1999 Dec 01

Creation Date:

2006 Dec 27