Deducing Linguistic Structure from the Statistics of Large Corpora

Magerman, David; Marcus, Mitchell; Santorini, Beatric

Deducing Linguistic Structure from the Statistics of Large Corpora

Active / Technical Report | Accession Number: ADA458686 |

Open PDF

Abstract:

Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4 error rate, when trained on moderate sized 500K word corpora of English text e.g. Church, 1988 Hindle, 1989. The success of these techniques suggests that much of the grammatical structure of language may be derived automatically through distributional analysis, an approach attempted and abandoned in the 1950s. We describe here two experiments to see how far purely distributional techniques can be pushed to automatically provide both a set of part of speech tags for English, and a grammatical analysis of free English text. We also discuss the state of a tagged NL corpus to aid such research now amounting to 4 million words of hand-corrected part-of-speech tagging.

Author(s):

Magerman, David ; Marcus, Mitchell ; Santorini, Beatric

Author Organization(s):

PENNSYLVANIA UNIV PHILADELPHIA DEPT OF COMPUTER AND INFORMATION SCIENCE

Supplementary Note:

Sponsored in part by DARPA and ARO grant DAAL03-89-C0031. DOI: 10.21236/ADA458686

Pagination:

0009

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Contract/Grant Number(s):

N0014-85-K0018, AFOSR-90-0066

Monitor Series:

AFOSR

Subject Terms

Joint Capability Areas:

JCA_1_Force Support; JCA_5_Command and Control; JCA_5.3_Planning; JCA_1.2.7_Experimentation; JCA_1.3_Human Capital Management; JCA_1.3.1_Personnel and Family Support; JCA_1.2_Force Preparation; JCA_8_Building Partnerships

Modernization Areas:

AI and Machine Learning

Communities of Interest:

No COI(s) Identified

Descriptor(s):

*LINGUISTICS, *GRAMMARS, STOCHASTIC PROCESSES, RESOLUTION, LANGUAGE, SPEECH, ENGLISH LANGUAGE, LEXICOGRAPHY, LABELS, SYMBOLS, AMBIGUITY, DISTRIBUTION FUNCTIONS

Field(s)/Group(s):

Linguistics

Report Date:

1990 Jan 01

Creation Date:

2006 Dec 27