Accession Number : ADA460898


Title :   The Form is the Substance: Classification of Genres in Text


Corporate Author : DEPARTMENT OF DEFENSE WASHINGTON DC


Personal Author(s) : Dewdney, Nigel ; VanEss-Dykema, Carol ; MacMillan, Richard


Full Text : https://apps.dtic.mil/dtic/tr/fulltext/u2/a460898.pdf


Report Date : Jan 2001


Pagination or Media Count : 9


Abstract : Categorization of text in IR has traditionally focused on topic. As use of the Internet and e-mail increases. categorization has become a key area of research as users demand methods of prioritizing documents. This work investigates text, classification by format style, i.e. genre,. and demonstrates. by complementing topic classification. that it can significantly improve retrieval of information. The paper compares use of presentation features to word features and the combination thereof, using Naive Bayes, C4.5 and SVM classifiers. Results show use of combined feature sets with SVM yields 92% classification accuracy in sorting seven genres.


Descriptors :   *DOCUMENTS , *FORMATS , *CLASSIFICATION , INFORMATION RETRIEVAL , WORDS(LANGUAGE) , INTERNET , ELECTRONIC MAIL


Subject Categories : Information Science


Distribution Statement : APPROVED FOR PUBLIC RELEASE