Accession Number:



BBN: Description of the PLUM System as Used for MUC-5

Descriptive Note:

Conference paper

Corporate Author:


Report Date:


Pagination or Media Count:



Traditional approaches to the problem of extracting data from texts have emphasized hand-crafted linguistic knowledge. In contrast, BBNs PLUM system Probabilistic Language Understanding Model was developed as part of an ARPA-funded research effort on integrating probabilistic language models with more traditional linguistic techniques. Our research and development goals are more rapid development of new applications, the ability to train and re-train systems based on user markings of correct and incorrect output, more accurate selection among interpretations when more than one is found, and more robust partial interpretation when no complete interpretation can be found. We began this research agenda approximately three years ago. During the past two years, we have evaluated much of our effort in porting our data extraction system PLUM to a new language Japanese and to two new domains. Three key design features distinguish PLUM statistical language modeling, learning algorithms and partial understanding. The first key feature is the use of statistical modeling to guide processing. For the version of PLUM used in MUC-5, part of speech information was determined by using well-known Markov modeling techniques embodied in BBNs part-of-speech tagger POST 5. We also used a correction model, AMED 3, for improving Japanese segmentation and part-of-speech tags assigned by JUMAN. For the microelectronics domain, we used a probabilistic model to help identify the role of a company in a capability whether it is a developer, user, etc.. Statistical modeling in PLUM contributes to portability, robustness, and trainability. The second key feature is our use of learning algorithms both to obtain the knowledge bases used by PLUMs processing modules and to train the probabilistic algorithms. A third key feature is partial understanding. All components of PLUM are designed to operate on partially interpretable input.

Subject Categories:

  • Information Science
  • Linguistics
  • Cybernetics

Distribution Statement: