Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff's (2001) Approach
DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION EDINBURGH (AUSTRALIA) COMMAND CONTROL COMMUNICATIONS AND INTELLIGENCE DIV
Pagination or Media Count:
There is a distinct lack of tools that provide a comprehensive measure of the similarity between corpora. Finding similar corpora is necessary for the design of certain user studies investigating text processing. It is also useful for ensuring comparability between studies on document analysis conducted across classified and unclassified domains. In this study, human judgements of corpora similarity were obtained as a gold standard. These were then compared to the values provided by Kilgarriffs 2001 chi-square X2 statistic. The findings indicated a high level of agreement between the participants, with 77 shared variance in overall similarity judgements. The results of the X2 measure also correlated well with the human results, with a correlation of approximately 0.66. Although there are complexities associated with the X2 technique that need to be examined in further research, this study provides extremely promising results, suggesting that a statistical technique could provide results that are comparable to human judgements.
- Information Science