Accession Number:



Portable Language-Independent Adaptive Translation from OCR

Descriptive Note:

Quarterly status rept. no. 8, 1 Jul-30 Sep 2009

Corporate Author:


Personal Author(s):

Report Date:


Pagination or Media Count:



This quarter, we re-designed the Shape-DNA based rule line cleaning algorithm to minimize the degradation of the shape of text characters. Recall that in the Shape-DNA based cleaning approach, the projection onto the Shape-DNA space produces a rule line distance image that is used to clean the rule lines. However, this cleaning process can and does remove portions of legitimate text characters that resemble rule lines. Therefore, instead of using the rule line distance images for directly cleaning rule lines, we now use this image to model the rule lines present in the document. Specifically, by applying Hough transform to the rule line distance image, we compute a set of model parameters. In addition, we estimate the average thickness of the rule lines using the original input image. Finally, we use both the rule line model parameters and the rule line thickness information with a sliding window to clean the rule lines. Figure 2 shows an example where the performance of the new rule line cleaning algorithm is compared with the performance of the previous version of the shape-DNA cleaning. This reporting period, we also improved the restoration algorithm for removing the artifacts introduced by rule line cleaning. Similar to rule line cleaning algorithm, Shape-DNA based restoration algorithm also includes an off-line training process, where text characters shapes are learned off-line by training about 100 handwritten text images with no rule lines and a Shape-DNA database is computed from the shape patterns. These shape blocks from the input image onto the database and by searching for the closest shape pattern in the database. Unlike our previous version, where shape-DNA restoration was applied to entire image, we now use the estimated rule line model parameters to constrain the restoration into the local proximity of detected rule lines.

Subject Categories:

  • Linguistics
  • Cybernetics

Distribution Statement: