research-article

Query segmentation revisited

Authors:
Matthias Hagen

Bauhaus-Universität, Weimar, Germany

Bauhaus-Universität, Weimar, Germany
View Profile

,
Martin Potthast

Bauhaus-Universität, Weimar, Germany

Bauhaus-Universität, Weimar, Germany
View Profile

,
Benno Stein

Bauhaus-Universität, Weimar, Germany

Bauhaus-Universität, Weimar, Germany
View Profile

,
Christof Bräutigam

Bauhaus-Universität, Weimar, Germany

Bauhaus-Universität, Weimar, Germany
View Profile

WWW '11: Proceedings of the 20th international conference on World wide webMarch 2011Pages 97–106https://doi.org/10.1145/1963405.1963423

Published:28 March 2011Publication History

WWW '11: Proceedings of the 20th international conference on World wide web

Pages 97–106

ABSTRACT

We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented.

The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50,000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now.

References

O. Alonso and S. Mizzaro. Can We Get Rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment. In Proceedings of the SIGIR 2009 Workshop on The Future of IR Evaluation.Google Scholar
M. Bendersky, W. B. Croft, and D. Smith. Two-stage Query Segmentation for Information Retrieval. In J. Allan, J. A. Aslam, M. Sanderson, C. Zhai, and J. Zobel, editors, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, USA, July 20-24, 2009, pages 810--811. Google ScholarDigital Library
M. Bendersky, W. B. Croft, and D. Smith. Structural Annotation of Search Queries Using Pseudo-Relevance Feedback. In J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, pages 1537--1540. Google ScholarDigital Library
S. Bergsma and Q. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 819--826.Google Scholar
T. Brants and A. Franz. Web 1T 5-gram Version 1. Linguistic Data Consortium LDC2006T13, Philadelphia, 2006.Google Scholar
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large Language Models in Machine Translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, EMNLP-CoNLL 2007, June 28-30, 2007, Prague, Czech Republic, pages 858--867.Google Scholar
D. Brenes, D. Gayo-Avello, and R. Garcia. On the Fly Query Entity Decomposition Using Snippets. In Proceedings of the First Spanish Conference on Information Retrieval, CERI 2010, June 15-16, 2010, Madrid, Spain.Google Scholar
W. B. Croft, M. Bendersky, H. Li, G. Xu. Query Representation and Understanding Workshop. SIGIR Forum, 44 (2): 48--53, 2010. Google ScholarDigital Library
J. Guo, G. Xu, H. Li, and X. Cheng. A Unified and Discriminative Model for Query Refinement. In S. Myaeng, D. Oard, F. Sebastiani, T. Chua, and M. Leong, editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pages 379--386. Google ScholarDigital Library
M. Hagen, M. Potthast, B. Stein, and C. Bräutigam. The Power of Naïve Query Segmentation. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 797--798. Google ScholarDigital Library
J. Huang, J. Gao, J. Miao, X. Li, K. Wang, and F. Behr. Exploring Web Scale Language Models for Search Query Processing. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 451--460. Google ScholarDigital Library
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating Query Substitutions. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, Proceedings of the 15th International Conference on World Wide Web, WWW 2006, Edinburgh, Scotland, UK, May 23-26, 2006, pages 387--396. Google ScholarDigital Library
J. Kiseleva, Q. Guo, E. Agichtein, D. Billsus, and W. Chai. Unsupervised Query Segmentation Using Click Data: Preliminary Results. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 1131--1132. Google ScholarDigital Library
N. Mishra, R. Roy, N. Ganguly, S. Laxman, and M. Choudhury. Unsupervised Query Segmentation Using Only Query Logs. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28-April 1, 2011. Google ScholarDigital Library
G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information Systems, Infoscale 2006, Hong Kong, May 30-June 1, 2006, article 1. Google ScholarDigital Library
M. Potthast. Crowdsourcing a Wikipedia Vandalism Corpus. In F. Crestani, S. Marchand-Maillet, H.-H. Chen, E. N. Efthimiadis, and J. Savoy, editors, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010, pages 789--790. Google ScholarDigital Library
M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. An Evaluation Framework for Plagiarism Detection. In C.-R. Huang and D. Jurafsky, editors, Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, Beijing, China, August 23-27, 2010, pages 997--1005. Google ScholarDigital Library
K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In Proceedings of the Twelfth International World Wide Web Conference, WWW 2003, Budapest, Hungary, May 20-24, 2003. Posters.Google Scholar
B. Tan and F. Peng. Unsupervised Query Segmentation Using Generative Language Models and Wikipedia. In J. Huai, R. Chen, H. Hon, Y. Liu, W. Ma, A. Tomkins, and X. Zhang, editors, Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21-25, 2008, pages 347--356. Google ScholarDigital Library
X. Yu and H. Shi. Query Segmentation Using Conditional Random Fields. In M. T. Özsu, Y. Chen, and L. Chen, editors, Proceedings of the First International Workshop on Keyword Search on Structured Data, KEYS 2009, Providence, Rhode Island, USA, June 28, 2009, pages 21--26. Google ScholarDigital Library
C. Zhang, N. Sun, X. Hu, T. Huang, and T.-S. Chua. Query Segmentation Based on Eigenspace Similarity. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the Fourth International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2009, August 2-7, 2009, Singapore. Short papers, pages 185--188. Google ScholarDigital Library

Index Terms

Query segmentation revisited
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

The power of naive query segmentation
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

We address the problem of query segmentation: given a keyword query submitted to a search engine, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve good segmentation performance on a gold standard ...
Read More
Towards optimum query segmentation: in doubt without
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Query segmentation is the problem of identifying those keywords in a query, which together form compound concepts or phrases like "new york times". Such segments can help a search engine to better interpret a user's intents and to tailor the search ...
Read More
Improving unsupervised query segmentation using parts-of-speech sequence information
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

We present a generic method for augmenting unsupervised query segmentation by incorporating Parts-of-Speech (POS) sequence information to detect meaningful but rare n-grams. Our initial experiments with an existing English POS tagger employing two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '11: Proceedings of the 20th international conference on World wide web
March 2011
840 pages
ISBN:9781450306324
DOI:10.1145/1963405
General Chairs:
S. Sadagopan
IIIT-Bangalore, India
,
Krithi Ramamritham
IIT-Bombay, India
,
Arun Kumar
IBM Research, India
,
M. P. Ravindra
Infosys E & R, India
,
Program Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Kumar
Yahoo! Research, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
corpus
query segmentation
web n-grams
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 42
  Total Citations
  View Citations
- 583
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Query segmentation revisited

WWW '11: Proceedings of the 20th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

The power of naive query segmentation

Towards optimum query segmentation: in doubt without

Improving unsupervised query segmentation using parts-of-speech sequence information