article

On a combination of probabilistic and boolean ir models for WWW document retrieval

Authors:
Masaharu Yoshioka

Hokkaido University, Hokkaido, Japan

Hokkaido University, Hokkaido, Japan
View Profile

,
Makoto Haraguchi

Hokkaido University, Hokkaido, Japan

Hokkaido University, Hokkaido, Japan
View Profile

ACM Transactions on Asian Language Information Processing Volume 4 Issue 3pp 340–356https://doi.org/10.1145/1111667.1111674

Published:01 September 2005Publication History

ACM Transactions on Asian Language Information Processing

Abstract

Even though a Boolean query can express the information need precisely enough to select relevant documents, it is not easy to construct an appropriate Boolean query that covers all relevant documents. To utilize a Boolean query effectively, a mechanism to retrieve as many as possible relevant documents is therefore required. In accordance with this requirement, we propose a method for modifying a given Boolean query by using information from a relevant document set. The retrieval results, however, may deteriorate if some important query terms are removed by this reformulation. A further mechanism is thus required in order to use other query terms that are useful for finding more relevant documents, but are not strictly required in relevant documents. To meet this requirement, we propose a new method that combines the probabilistic IR and the Boolean IR models. We also introduce a new IR system---called appropriate Boolean query reformulation for information retrieval (ABRIR)---based on these two methods and the Okapi system. ABRIR uses both a word index and a phrase index formed from combinations of two adjacent noun words. The effectiveness of these two methods was confirmed according to the NTCIR-4 Web test collection.

References

Anick, P. G., Brennan, J. D., Flynn, R. A., Hanssen, D. R., Alvey, B., and Robbins, J. M. 1990. A direct manipulation interface for Boolean information retrieval via natural language query. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, 5--7 September 1990, Proceedings, J.-L. Vidick, Ed. ACM, New York. 135--150. Google Scholar
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, MA. Google Scholar
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 1--7, 107--117. Google Scholar
Cutting, D. R., Pedersen, J. O., Karger, D., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 318--329. Google Scholar
Eastman, C. M. and Jansen, B. J. 2003. Coverage, relevance, and ranking: The impact of query operators on web search engine results. ACM Transactions on Information Systems 21, 4, 383--411. Google Scholar
Eguchi, K., Oyama, K., Aizawa, A., and Ishikawa, H. 2004. Overview of the informational retrieval task at ntcir-4 web. In Working Notes of the Fourth NTCIR Workshop Meeting. http://research.nii.ac.jp/ntcir-ws4/NTCIR4-WN/WEB/NTCIR4WN-OV-WEB-A-EguchiK.pdf.Google Scholar
Hearst, M. A. 1999. Modern Information Retrieval. Addison-Wesley, Chapter 10 User Interfaces and Visualization, 257--323. Google Scholar
Jones, S. 1998. Graphical query specification and dynamic result previews for a digital library. In ACM Symposium on User Interface Software and Technology. 143--151. Google Scholar
Kekalainen, J. and Jarvelin, K. 1998. The impact of query structure and query expansion on retrieval performance. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 130--137. Google Scholar
Koenemann, J. and Belkin, N. J. 1996. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In Proceedings of ACM Conference on Human Factors in Computing Systems. 205--212. Google Scholar
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., and Asahara, M. 2000. Morphological Analysis System ChaSen version 2.2.1 Manual. Nara Institute of Science and Technology.Google Scholar
Robertson, S. E. and Walker, S. 2000. Okapi/Keenbow at TREC-8. In Proceedings of TREC-8. 151--162.Google Scholar
Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Communications of the ACM 26, 11, 1022--1036. Google Scholar
Shaw, J. A. and Fox, E. A. 1994. Combination of multiple searches. In Text REtrieval Conference. 105--108.Google Scholar
Spink, A., Wolfram, D., Jansen, M. B. J., and Saracevic, T. 2001. Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology 52, 3, 226--234. Google Scholar
Takano, A., Niwa, Y., Nishioka, S., Hisamitsu, T., Iwayama, M., and Imaichi, O. 2001. Associative information access using dualnavi. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium. 771--772.Google Scholar
Toyoda, M., Kitsuregawa, M., Mano, H., Itoh, H., and Ogawa, Y. 2002. University of tokyo/ricoh at ntcir-3 web retrieval task. In Proceedings of the Third NTCIR Workshop on research in information Retrieval, Automatic Text Summarization and Question Answering. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-WEB-ToyodaM.pdf.Google Scholar
Uchiyama, M. and Isahara, H. 2001. Implementation of an IR package. In IPSJ SIGNotes, 2001-FI-63. 57--64 (in Japan).Google Scholar
Xu, J. and Croft, W. B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 4--11. Google Scholar
Yoshioka, M. and Haraguchi, M. 2003. Construction of personalized and purpose-oriented thesaurus. In Proceedings of Asian Association for Lexicography '03 (ASIALEX). 461--466.Google Scholar
Young, D. and Shneiderman, B. 1993. A graphical filter/flow representation of Boolean queries: A prototype implementation and evaluation. Journal of the American Society of Information Science 44, 6, 327--339. Google Scholar

Index Terms

On a combination of probabilistic and boolean ir models for WWW document retrieval
1. Information systems
  1. Information retrieval

Recommendations

Two models of retrieval with probabilistic indexing
SIGIR '86: Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval

We describe two retrieval models for probabilistic indexing. The binary independence indexing (BII) model is a generalized version of the Maron & Kuhns indexing model. In this model, the indexing weight of a descriptor in a document is an estimate of ...
Read More
Document expansion for image retrieval
RIAO '10: Adaptivity, Personalization and Fusion of Heterogeneous Information

Successful information retrieval requires effective matching between the user's search request and the contents of relevant documents. Often the request entered by a user may not use the same topic relevant terms as the authors' of these documents. One ...
Read More
A comparison of Chinese document indexing strategies and retrieval models

With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). ...
Read More

Reviews

Reviewer: Joseph S. Fulda

The authors presume that readers will have substantial knowledge of the details of query reformulation in the context of information retrieval (IR), including background material, acronyms, and a significant familiarity with some of the mathematical formulae in common use in this specialized subfield. In addition, the paper is at once dense and yet sparse on some key details. In preparing this review, I found Crouch et al. [1] and Azzopardi's [2] works very helpful to overcome my own deficiencies in some of these respects. Thus, this paper can be recommended only to subspecialists. The paper is based on the concept of query reformulation, which is defined in Crouch et al. [1] as: a technique ... successfully used to enhance ... retrieval effectiveness of queries. [Using] [r]elevance feedback ... the query is automatically reformulated based on information contained in the original query and in retrieved documents judged relevant (and non-relevant) by the user. A similar approach is pseudo- (or pseudo-relevance) feedback, wherein a certain set of documents is assumed relevant ... and ... is then used to reformulate the query (pages 1-2). "[This] requires no interaction with the user" and it turns out to work, however surprising that may be. This paper's contribution is a new method based on the general aims quoted above from Crouch et al. [1]. Methods are judged effective if they have high precision, "the proportion of retrieved information that is actually relevant" [2], and high recall, "the fraction of relevant documents that have been retrieved" [2]. A bit of thought will convince the reader that these two aims are naturally antagonistic in the absence of very long, specific queries. The assumption behind all these methods is, as put by Azzopardi [2], that the user's mind is a "noisy channel," meaning that the user is not really sure what he is looking for, although he recognizes it when he sees it. Moreover, different users have different ideal documents in mind, yet they express their needs with identical queries. The combination of these factors is what has led the IR community to accept pseudo-relevance as a valid concept-that, and the fact that it works. The new method uses the most relevant five documents containing all the terms (meaning nouns, and only those nouns that are syntactically nouns, thereby excluding verbal uses that serve the semantic purpose of nouns) or phrases from the original query, and finds additional terms (but not phrases) that frequently co-occur in the relevant documents to generate new pseudo-relevant documents. Probabilistic analysis is applied to allow the presence of some documents in the final result set without all of the original terms or phrases, but with, instead, what are presumed to be complementary (synonymous) terms. The authors have tested their method extensively and are convinced it works. Although I am unable to independently assess this claim (partly because the language used for terms was Japanese and partly because of the density and sparseness of the presentation), I certainly do believe it. Yet, I would hope that this is not the wave of the future. Search engines, for all their prominence and ubiquity, are still a new tool, and are not properly used. One of the key problems is that users rely on the first few results too often [1,2], do not formulate queries thoughtfully, and the like. Time will improve these behaviors, even as time has tempered the urge to send out email too quickly or without running a spelling check. In other words, it would be better for the user's mind to be less of a noisy channel than to second-guess it by automatic means, which will only weaken critical thinking. Moreover, today's search engines are extremely primitive; their extraordinary usefulness comes almost exclusively from their vast size. Various proximity operators, sensitivity to special symbols, and a variety of other searching enhancements available in commercial online databases could easily be added to search engines, and would render research of this type far less necessary. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 4, Issue 3
September 2005
138 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1111667
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2005
Published in talip Volume 4, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Boolean IR model
probabilistic IR model
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 508
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On a combination of probabilistic and boolean ir models for WWW document retrieval

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Two models of retrieval with probabilistic indexing

Document expansion for image retrieval

A comparison of Chinese document indexing strategies and retrieval models

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On a combination of probabilistic and boolean ir models for WWW document retrieval

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Two models of retrieval with probabilistic indexing

Document expansion for image retrieval

A comparison of Chinese document indexing strategies and retrieval models

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media