skip to main content
article

On a combination of probabilistic and boolean ir models for WWW document retrieval

Published:01 September 2005Publication History
Skip Abstract Section

Abstract

Even though a Boolean query can express the information need precisely enough to select relevant documents, it is not easy to construct an appropriate Boolean query that covers all relevant documents. To utilize a Boolean query effectively, a mechanism to retrieve as many as possible relevant documents is therefore required. In accordance with this requirement, we propose a method for modifying a given Boolean query by using information from a relevant document set. The retrieval results, however, may deteriorate if some important query terms are removed by this reformulation. A further mechanism is thus required in order to use other query terms that are useful for finding more relevant documents, but are not strictly required in relevant documents. To meet this requirement, we propose a new method that combines the probabilistic IR and the Boolean IR models. We also introduce a new IR system---called appropriate Boolean query reformulation for information retrieval (ABRIR)---based on these two methods and the Okapi system. ABRIR uses both a word index and a phrase index formed from combinations of two adjacent noun words. The effectiveness of these two methods was confirmed according to the NTCIR-4 Web test collection.

References

  1. Anick, P. G., Brennan, J. D., Flynn, R. A., Hanssen, D. R., Alvey, B., and Robbins, J. M. 1990. A direct manipulation interface for Boolean information retrieval via natural language query. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, 5--7 September 1990, Proceedings, J.-L. Vidick, Ed. ACM, New York. 135--150. Google ScholarGoogle Scholar
  2. Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley, Reading, MA. Google ScholarGoogle Scholar
  3. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 1--7, 107--117. Google ScholarGoogle Scholar
  4. Cutting, D. R., Pedersen, J. O., Karger, D., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 318--329. Google ScholarGoogle Scholar
  5. Eastman, C. M. and Jansen, B. J. 2003. Coverage, relevance, and ranking: The impact of query operators on web search engine results. ACM Transactions on Information Systems 21, 4, 383--411. Google ScholarGoogle Scholar
  6. Eguchi, K., Oyama, K., Aizawa, A., and Ishikawa, H. 2004. Overview of the informational retrieval task at ntcir-4 web. In Working Notes of the Fourth NTCIR Workshop Meeting. http://research.nii.ac.jp/ntcir-ws4/NTCIR4-WN/WEB/NTCIR4WN-OV-WEB-A-EguchiK.pdf.Google ScholarGoogle Scholar
  7. Hearst, M. A. 1999. Modern Information Retrieval. Addison-Wesley, Chapter 10 User Interfaces and Visualization, 257--323. Google ScholarGoogle Scholar
  8. Jones, S. 1998. Graphical query specification and dynamic result previews for a digital library. In ACM Symposium on User Interface Software and Technology. 143--151. Google ScholarGoogle Scholar
  9. Kekalainen, J. and Jarvelin, K. 1998. The impact of query structure and query expansion on retrieval performance. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 130--137. Google ScholarGoogle Scholar
  10. Koenemann, J. and Belkin, N. J. 1996. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In Proceedings of ACM Conference on Human Factors in Computing Systems. 205--212. Google ScholarGoogle Scholar
  11. Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., Takaoka, K., and Asahara, M. 2000. Morphological Analysis System ChaSen version 2.2.1 Manual. Nara Institute of Science and Technology.Google ScholarGoogle Scholar
  12. Robertson, S. E. and Walker, S. 2000. Okapi/Keenbow at TREC-8. In Proceedings of TREC-8. 151--162.Google ScholarGoogle Scholar
  13. Salton, G., Fox, E. A., and Wu, H. 1983. Extended Boolean information retrieval. Communications of the ACM 26, 11, 1022--1036. Google ScholarGoogle Scholar
  14. Shaw, J. A. and Fox, E. A. 1994. Combination of multiple searches. In Text REtrieval Conference. 105--108.Google ScholarGoogle Scholar
  15. Spink, A., Wolfram, D., Jansen, M. B. J., and Saracevic, T. 2001. Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology 52, 3, 226--234. Google ScholarGoogle Scholar
  16. Takano, A., Niwa, Y., Nishioka, S., Hisamitsu, T., Iwayama, M., and Imaichi, O. 2001. Associative information access using dualnavi. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium. 771--772.Google ScholarGoogle Scholar
  17. Toyoda, M., Kitsuregawa, M., Mano, H., Itoh, H., and Ogawa, Y. 2002. University of tokyo/ricoh at ntcir-3 web retrieval task. In Proceedings of the Third NTCIR Workshop on research in information Retrieval, Automatic Text Summarization and Question Answering. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-WEB-ToyodaM.pdf.Google ScholarGoogle Scholar
  18. Uchiyama, M. and Isahara, H. 2001. Implementation of an IR package. In IPSJ SIGNotes, 2001-FI-63. 57--64 (in Japan).Google ScholarGoogle Scholar
  19. Xu, J. and Croft, W. B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 4--11. Google ScholarGoogle Scholar
  20. Yoshioka, M. and Haraguchi, M. 2003. Construction of personalized and purpose-oriented thesaurus. In Proceedings of Asian Association for Lexicography '03 (ASIALEX). 461--466.Google ScholarGoogle Scholar
  21. Young, D. and Shneiderman, B. 1993. A graphical filter/flow representation of Boolean queries: A prototype implementation and evaluation. Journal of the American Society of Information Science 44, 6, 327--339. Google ScholarGoogle Scholar

Index Terms

  1. On a combination of probabilistic and boolean ir models for WWW document retrieval

        Recommendations

        Reviews

        Joseph S. Fulda

        The authors presume that readers will have substantial knowledge of the details of query reformulation in the context of information retrieval (IR), including background material, acronyms, and a significant familiarity with some of the mathematical formulae in common use in this specialized subfield. In addition, the paper is at once dense and yet sparse on some key details. In preparing this review, I found Crouch et al. [1] and Azzopardi's [2] works very helpful to overcome my own deficiencies in some of these respects. Thus, this paper can be recommended only to subspecialists. The paper is based on the concept of query reformulation, which is defined in Crouch et al. [1] as: a technique ... successfully used to enhance ... retrieval effectiveness of queries. [Using] [r]elevance feedback ... the query is automatically reformulated based on information contained in the original query and in retrieved documents judged relevant (and non-relevant) by the user. A similar approach is pseudo- (or pseudo-relevance) feedback, wherein a certain set of documents is assumed relevant ... and ... is then used to reformulate the query (pages 1-2). "[This] requires no interaction with the user" and it turns out to work, however surprising that may be. This paper's contribution is a new method based on the general aims quoted above from Crouch et al. [1]. Methods are judged effective if they have high precision, "the proportion of retrieved information that is actually relevant" [2], and high recall, "the fraction of relevant documents that have been retrieved" [2]. A bit of thought will convince the reader that these two aims are naturally antagonistic in the absence of very long, specific queries. The assumption behind all these methods is, as put by Azzopardi [2], that the user's mind is a "noisy channel," meaning that the user is not really sure what he is looking for, although he recognizes it when he sees it. Moreover, different users have different ideal documents in mind, yet they express their needs with identical queries. The combination of these factors is what has led the IR community to accept pseudo-relevance as a valid concept-that, and the fact that it works. The new method uses the most relevant five documents containing all the terms (meaning nouns, and only those nouns that are syntactically nouns, thereby excluding verbal uses that serve the semantic purpose of nouns) or phrases from the original query, and finds additional terms (but not phrases) that frequently co-occur in the relevant documents to generate new pseudo-relevant documents. Probabilistic analysis is applied to allow the presence of some documents in the final result set without all of the original terms or phrases, but with, instead, what are presumed to be complementary (synonymous) terms. The authors have tested their method extensively and are convinced it works. Although I am unable to independently assess this claim (partly because the language used for terms was Japanese and partly because of the density and sparseness of the presentation), I certainly do believe it. Yet, I would hope that this is not the wave of the future. Search engines, for all their prominence and ubiquity, are still a new tool, and are not properly used. One of the key problems is that users rely on the first few results too often [1,2], do not formulate queries thoughtfully, and the like. Time will improve these behaviors, even as time has tempered the urge to send out email too quickly or without running a spelling check. In other words, it would be better for the user's mind to be less of a noisy channel than to second-guess it by automatic means, which will only weaken critical thinking. Moreover, today's search engines are extremely primitive; their extraordinary usefulness comes almost exclusively from their vast size. Various proximity operators, sensitivity to special symbols, and a variety of other searching enhancements available in commercial online databases could easily be added to search engines, and would render research of this type far less necessary. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian Language Information Processing
          ACM Transactions on Asian Language Information Processing  Volume 4, Issue 3
          September 2005
          138 pages
          ISSN:1530-0226
          EISSN:1558-3430
          DOI:10.1145/1111667
          Issue’s Table of Contents

          Copyright © 2005 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 September 2005
          Published in talip Volume 4, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader