skip to main content
10.1145/1963405.1963423acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Query segmentation revisited

Published:28 March 2011Publication History

ABSTRACT

We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented.

The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50,000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now.

References

  1. O. Alonso and S. Mizzaro. Can We Get Rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment. In Proceedings of the SIGIR 2009 Workshop on The Future of IR Evaluation.Google ScholarGoogle Scholar
  2. M. Bendersky, W. B. Croft, and D. Smith. Two-stage Query Segmentation for Information Retrieval. In J. Allan, J. A. Aslam, M. Sanderson, C. Zhai, and J. Zobel, editors, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, USA, July 20-24, 2009, pages 810--811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bendersky, W. B. Croft, and D. Smith. Structural Annotation of Search Queries Using Pseudo-Relevance Feedback. In J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, pages 1537--1540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Bergsma and Q. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 819--826.Google ScholarGoogle Scholar
  5. T. Brants and A. Franz. Web 1T 5-gram Version 1. Linguistic Data Consortium LDC2006T13, Philadelphia, 2006.Google ScholarGoogle Scholar
  6. T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large Language Models in Machine Translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 858--867.Google ScholarGoogle Scholar
  7. D. Brenes, D. Gayo-Avello, and R. Garcia. On the Fly Query Entity Decomposition Using Snippets. In Proceedings of the First Spanish Conference on Information Retrieval, CERI 2010, June 15-16, 2010, Madrid, Spain.Google ScholarGoogle Scholar
  8. W. B. Croft, M. Bendersky, H. Li, G. Xu. Query Representation and Understanding Workshop. SIGIR Forum, 44 (2): 48--53, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Guo, G. Xu, H. Li, and X. Cheng. A Unified and Discriminative Model for Query Refinement. In S. Myaeng, D. Oard, F. Sebastiani, T. Chua, and M. Leong, editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pages 379--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Hagen, M. Potthast, B. Stein, and C. Bräutigam. The Power of Naïve Query Segmentation. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 797--798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Huang, J. Gao, J. Miao, X. Li, K. Wang, and F. Behr. Exploring Web Scale Language Models for Search Query Processing. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 451--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating Query Substitutions. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, Proceedings of the 15th International Conference on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 23-26, 2006, pages 387--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Kiseleva, Q. Guo, E. Agichtein, D. Billsus, and W. Chai. Unsupervised Query Segmentation Using Click Data: Preliminary Results. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 1131--1132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Mishra, R. Roy, N. Ganguly, S. Laxman, and M. Choudhury. Unsupervised Query Segmentation Using Only Query Logs. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28-April 1, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006, article 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 789--790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. An Evaluation Framework for Plagiarism Detection. In C.-R. Huang and D. Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, Beijing, China, August 23-27, 2010, pages 997--1005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003. Posters.Google ScholarGoogle Scholar
  19. B. Tan and F. Peng. Unsupervised Query Segmentation Using Generative Language Models and Wikipedia. In J. Huai, R. Chen, H. Hon, Y. Liu, W. Ma, A. Tomkins, and X. Zhang, editors, Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008, pages 347--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Yu and H. Shi. Query Segmentation Using Conditional Random Fields. In M. T. Özsu, Y. Chen, and L. Chen, editors, Proceedings of the First International Workshop on Keyword Search on Structured Data, KEYS 2009, Providence, Rhode Island, USA, June 28, 2009, pages 21--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query Segmentation Based on Eigenspace Similarity. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2009, August 2-7, 2009, Singapore. Short papers, pages 185--188. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Query segmentation revisited

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '11: Proceedings of the 20th international conference on World wide web
      March 2011
      840 pages
      ISBN:9781450306324
      DOI:10.1145/1963405

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 March 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader