Accession Number:

ADA460572

Title:

Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day

Descriptive Note:

Corporate Author:

JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE AND SPEECH PROCESSING (CLSP)

Personal Author(s):

Report Date:

2002-01-01

Pagination or Media Count:

8.0

Abstract:

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech POS tagger in a new language using only one person day of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages 1 an online or hard-copy pocket-sized bilingual dictionary, 2 a basic library reference grammar, and 3 access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of context window feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort.

Subject Categories:

  • Linguistics
  • Anatomy and Physiology

Distribution Statement:

APPROVED FOR PUBLIC RELEASE