Locality in Search Engine Queries and Its Implications for Caching

Xie, Yinglian; O'hallaron, David

Locality in Search Engine Queries and Its Implications for Caching

Active / Technical Report | Accession Number: ADA458510 |

Open PDF

Abstract:

Caching is a popular technique for reducing both server load and user response time in distributed systems. In this paper, the authors are interested in the question of whether caching might be effective for search engines as well. They study two real search engine traces by examining query locality and its implications for caching. The two search engines studied are Vivisimo and Excite. Their trace analysis results show that queries have significant locality, with query frequency following a Zipf distribution. Very popular queries are shared among different users and can be cached at servers or proxies, while 16 to 22 of the queries are from the same users and should be cached at the user side. Multiple-word queries are shared less often and should be cached mainly at the user side. If caching is to be done at the user side, short-term caching for hours will be enough to cover query temporal locality, while serverproxy caching should be based on longer periods such as days. Most users have small lexicons when submitting queries. Frequent users who submit many search requests tend to reuse a small subset of words to form queries. Thus, with proxy or user side caching, prefetching based on user lexicon looks promising.

Author(s):

Xie, Yinglian ; O'hallaron, David

Author Organization(s):

CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE

Descriptive Note:

Research paper

Supplementary Note:

Sponsored in part by the Advanced Research Projects Agency (ARPA) and the National Science Foundation (NSF).

Pagination:

0022

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Report Number(s):

CMU-CS-01-128

Contract/Grant Number(s):

F30602-96-1-0287, NSF-CMS-9980063

Monitor Series:

AFRL/RS

Subject Terms

Joint Capability Areas:

JCA_5_Command and Control; JCA_6_Net Centric; JCA_8_Building Partnerships

Communities of Interest:

No COI(s) Identified

Descriptor(s):

*DATA MANAGEMENT, *REDUCTION, *SEARCHING, *INFORMATION RETRIEVAL, *CLIENT SERVER SYSTEMS, *INTERNET, *WORKLOAD, OPTIMIZATION, USER NEEDS, VOCABULARY, RANKING, STATISTICAL ANALYSIS, PERFORMANCE(ENGINEERING), SHARING, REACTION TIME, EFFICIENCY

Field(s)/Group(s):

Information Science, Computer Programming and Software, Computer Systems

Keyword(s):

*SEARCH ENGINE QUERIES, *QUERY RESULTS CACHING, *QUERY LOCALITY, QUERY TRACING, USER LEXICON ANALYSIS, WEB PREFETCHING, VIVISIMO SEARCH ENGINE, EXCITE SEARCH ENGINE, SERVER SIDE CACHING, PROXY SIDE CACHING, USER SIDE CACHING, TEMPORAL QUERY LOCALITY, SERVER WORKLOAD REDUCTION, QUERY REPETITION DISTRIBUTION, QUERY ANALYSIS, USER QUERIES, CACHE PLACEMENT, USER ACCESS LATENCY

Report Date:

2001 May 01

Creation Date:

2007 Feb 14