Accession Number:



The Termolator: Terminology Recognition Based on Chunking, Statistical and Search Based Scores

Descriptive Note:

Journal Article - Open Access

Corporate Author:

New York University, Department of Computer Science New York United States

Report Date:


Pagination or Media Count:



The Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks that contain out-of-vocabulary words, nominalizations, technical adjectives, and other specialized word classes. The distributional component ranks such term chunks according to several metrics including a a set of metrics that favors term chunks that are relatively more frequent in a foreground corpus about a single topic than they are in a background or multi-topic corpus b a well-formedness score based on linguistic features and c a relevance score which measures how often terms appear in articles and patents in a Yahoo web search. We analyse the contributions made by each of these components and show that all modules contribute to the systems performance, both in terms of the number and quality of terms identified. This paper expands upon previous publications about this research and includes descriptions of some of the improvements made since its initial release. This study also includes a comparison with another terminology extraction system available on-line, Termostat Drouin, 2003. We found that the systems get comparable results when applied to small amounts of data about 50 precision for a single foreground file Einsteins Theory of Relativity. However, when running the system with 500 patent files as foreground, Termolator performed significantly better than Termostat. For 500 refrigeration patents, Termolator got 70 precision vs. Termostats 52 .

Subject Categories:

Distribution Statement: