DISCRIMINANT ANALYSIS FOR CONTENT CLASSIFICATION
Final rept. 23 Nov 1964-22 Nov 1965
IBM FEDERAL SYSTEMS DIV GAITHERSBURG MD
Pagination or Media Count:
A series of experiments was performed to investigate the effectiveness and utility of automatically classifying documents through the use of multiple discriminant functions. Classification is accomplished by computing the distance from the mean vector of each category to the vector of observed frequencies of a document and assigning the document to the category having the highest probability. Data concerning the effect of the principal classification parameters on classification performance is reported, based on a data base of approximately 2700 abstracts from the solid state physics field. The parameters studied were the number of sample documents required to define a category, the length of documents, the interrelationship of the number of sample documents and their lengths, the relation of the number of word types in a document to the number of categories assigned to it, levels in a structure, homogeneity of categories, and performance measures. A higher performance level was obtained when samples of 140 documents were used to define each category than with samples of 35 and 70 documents. Classification results obtained on independent test sets of documents ranged from 73 to 92 percent. The test sets contained 419 and 1333 documents. Results are also reported in terms of Swets effectiveness measure and Cleverdons ratios of relevance, recall and precision.
- Information Science
- Computer Programming and Software