Abstract
As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agichtein, E., Brill, E., Dumais, S., and Ragno, R.: Learning User Interaction Models for Predicting Web Search Result Preferences, In SIGIR’06, 29th International ACM Conference on Research and Development on Information Retrieval, 2006. (ACM, New York, NY) pp 3–10.
Anick, P.: Using Terminological Feedback for Web Search Refinement – A Log-based Study., In SIGIR’03, 26th International ACM Conference on Research and Development on Information Retrieval, 2003. (ACM, New York, NY) pp 88–95.
Bishop, C.: Pattern Recognition and Machine Learning. (Springer, New York, NY, 2006).
Brin, S., and Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine, In WWW’98, 7th International Conference on World Wide Web, 1998. (Elsevier Science Publishers B. V., Amsterdam, The Netherlands) pp 107–117.
Click Quality Team, Google, Inc. How Fictitious Clicks Occur in Third-Party Click Fraud Audit Reports. http://www.google.com/adwords/ReportonThird-PartyClickFraudAuditing.pdf, 2006.
Daswani, N., Stoppelman, M., and the Google Click Quality and Security Teams: The Anatomy of Clickbot. A, In HOTBOTS’07, 1st Workshop on Hot Topics in Understanding Botnets, 2007. (USENIX Association, Berkeley, CA) pp 11–11.
Fetterly, D., Manasse, M., and Najork, M.: Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages, In WebDB’04, 7th International Workshop on the Web and Databases, 2004. (ACM, New York, NY) pp 1–6.
Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I. H.: Data Mining in Bioinformatics Using Weka. Bioinformatics, 20(15), 2479–2481, 1994.
Kamvar, M., and Baluja, S.: A Large Scale Study of Wireless Search Behavior: Google Mobile Search, In CHI’06, CHI Conference on Human Factors in Computing Systems, 2006. (ACM, New York, NY) pp 701–709.
Karasaridis, A., Rexroad, B., and Hoeflin, D.: Wide-scale Botnet Detection and Characterization, In HOTBOTS’07, 1st Workshop on Hot Topics in Understanding Botnets, 2007. (USENIX Association, Berkeley, CA) pp 7–7.
Schluessler, T., Goglin, S., and Johnson, E.: Is a Bot at the Controls? Detecting Input Data Attacks, In NetGames’07, 6th Workshop on Network and System Support for Games, 2007. (ACM, New York, NY) pp 1–6.
Stokes, J. W., Platt, J. C., Kravis, J., and Shilman, M.: ALADIN: Active Learning of Anomalies to Detect Intrusions, Microsoft Research Technical Report MSR-TR-2008-24, March 4, 2008.
Tuzhilin, A.: The Lane’s Gifts v. Google Report. http://googleblog.blogspot.com/pdf/Tuzhilin_Report.pdf.
Wu, K.-L., Yu, P. S., and Ballman, A.: SpeedTracer: A Web Usage Mining and Analysis Tool. http://www.research.ibm.com/journal/sj/371/wu.html, 1998.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Buehrer, G., Stokes, J.W., Chellapilla, K., Platt, J.C. (2009). Classification of Automated Search Traffic. In: King, I., Baeza-Yates, R. (eds) Weaving Services and People on the World Wide Web. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00570-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-00570-1_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00569-5
Online ISBN: 978-3-642-00570-1
eBook Packages: Computer ScienceComputer Science (R0)