What's in a URL? Genre Classification from URLs

Abramson, Myriam; Aha, David W.

What's in a URL? Genre Classification from URLs

Active / Technical Report | Accession Number: ADA599843 |

Open PDF

Abstract:

The importance of URLs in the representation of a document cannot be overstated. Shorthand mnemonics such as wiki or blog are often embedded in a URL to convey its functional purpose or genre. Other mnemonics have evolved from use e.g., a Wordpress particle is strongly suggestive of blogs. Can we leverage from this predictive power to induce the genre of a document from the representation of a URL This paper presents a methodology for webpage genre classification from URLs which, to our knowledge, has not been previously attempted. Experiments using machine learning techniques to evaluate this claim show promising results and a novel algorithm for character n-gram decomposition is provided. Such a capability could be useful to improve personalized search results, disambiguate content, efficiently crawl the Web in search of relevant documents, and construct behavioral profiles from clickstream data without parsing the entire document.

Author(s):

Abramson, Myriam ; Aha, David W.

Author Organization(s):

NAVAL RESEARCH LAB WASHINGTON DC

Descriptive Note:

Conference paper

Supplementary Note:

Presented at the Intelligent Techniques for Web Personalization Workshop at AAAI, 2012.

Pagination:

0009

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Monitor Series:

NRL

Subject Terms

Joint Capability Areas:

JCA_1.2_Force Preparation; JCA_1_Force Support; JCA_5_Command and Control; JCA_1.2.1_Training; JCA_1.2.7_Experimentation; JCA_6_Net Centric; JCA_5.3_Planning; JCA_6.1.2_Wireless Transmission; JCA_6.1_Information Transport; JCA_1.2.3_Educating; JCA_3.2_Engagement; JCA_3_Force Application; JCA_5.2.2_Develop Knowledge and Situational Awareness; JCA_5.2_Understand; JCA_5.5.2_Task; JCA_5.5_Direct; JCA_6.4.1_Secure Information Exchange; JCA_8_Building Partnerships

Modernization Areas:

AI and Machine Learning; Autonomy

Communities of Interest:

Autonomy

Descriptor(s):

*CLASSIFICATION, *FEATURE EXTRACTION, *INTERNET, *LEARNING MACHINES, ALGORITHMS, DATA MINING, EMBEDDING, MNEMONICS, PRECISION, PREDICTIONS, RECALL

Field(s)/Group(s):

Information Science, Computer Systems, Cybernetics

Keyword(s):

*GENRE CLASSIFICATION, URL(UNIFORM RESOURCE LOCATOR), WEBPAGES, TOPIC CLASSIFICATION

Report Date:

2011 Jan 01

Creation Date:

2014 Jun 02