Inter-Rater Agreement Measures and the Refinement of Metrics in the PLATO MT Evaluation Paradigm
MITRE CORP MCLEAN VA
Pagination or Media Count:
The PLATO machine translation MT evaluation MTE research program has as a goal the systematic development of a predictive relationship between discrete, well-defined MTE metrics and the specific information processing tasks that can be reliably performed with MT output. Traditional measures of quality, informed by International Standards for Language Engineering ISLE, namely, clarity, coherence, morphology, syntax, general and domain-specific lexical robustness, and named-entity translation, as well as a DARPA-inspired measure of adequacy are at the core of the program. For robust validation, indispensable for refinement of test and guidelines, we conduct tests of inter-rater reliability on the assessments. Here we discuss development and report on results of our inter-rater reliability tests, focusing on results for Clarity and the Coherence, the first two assessments in the PLATO suite, and we discuss our method for iteratively refining our linguistic metrics and the guidelines for applying them within the PLATO evaluation paradigm. Finally, we discuss reasons why kappa might not be the best measure of inter-rater agreement for our purposes, and suggest directions for future investigation.
- Computer Programming and Software