Skip to main content

Classification of Automated Search Traffic

  • Chapter
  • First Online:
Weaving Services and People on the World Wide Web

Abstract

As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agichtein, E., Brill, E., Dumais, S., and Ragno, R.: Learning User Interaction Models for Predicting Web Search Result Preferences, In SIGIR’06, 29th International ACM Conference on Research and Development on Information Retrieval, 2006. (ACM, New York, NY) pp 3–10.

    Google Scholar 

  2. Anick, P.: Using Terminological Feedback for Web Search Refinement – A Log-based Study., In SIGIR’03, 26th International ACM Conference on Research and Development on Information Retrieval, 2003. (ACM, New York, NY) pp 88–95.

    Google Scholar 

  3. Bishop, C.: Pattern Recognition and Machine Learning. (Springer, New York, NY, 2006).

    MATH  Google Scholar 

  4. Brin, S., and Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine, In WWW’98, 7th International Conference on World Wide Web, 1998. (Elsevier Science Publishers B. V., Amsterdam, The Netherlands) pp 107–117.

    Google Scholar 

  5. Click Quality Team, Google, Inc. How Fictitious Clicks Occur in Third-Party Click Fraud Audit Reports. http://www.google.com/adwords/ReportonThird-PartyClickFraudAuditing.pdf, 2006.

  6. Daswani, N., Stoppelman, M., and the Google Click Quality and Security Teams: The Anatomy of Clickbot. A, In HOTBOTS’07, 1st Workshop on Hot Topics in Understanding Botnets, 2007. (USENIX Association, Berkeley, CA) pp 11–11.

    Google Scholar 

  7. Fetterly, D., Manasse, M., and Najork, M.: Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages, In WebDB’04, 7th International Workshop on the Web and Databases, 2004. (ACM, New York, NY) pp 1–6.

    Google Scholar 

  8. Frank, E., Hall, M., Trigg, L., Holmes, G., and Witten, I. H.: Data Mining in Bioinformatics Using Weka. Bioinformatics, 20(15), 2479–2481, 1994.

    Article  Google Scholar 

  9. Kamvar, M., and Baluja, S.: A Large Scale Study of Wireless Search Behavior: Google Mobile Search, In CHI’06, CHI Conference on Human Factors in Computing Systems, 2006. (ACM, New York, NY) pp 701–709.

    Google Scholar 

  10. Karasaridis, A., Rexroad, B., and Hoeflin, D.: Wide-scale Botnet Detection and Characterization, In HOTBOTS’07, 1st Workshop on Hot Topics in Understanding Botnets, 2007. (USENIX Association, Berkeley, CA) pp 7–7.

    Google Scholar 

  11. Schluessler, T., Goglin, S., and Johnson, E.: Is a Bot at the Controls? Detecting Input Data Attacks, In NetGames’07, 6th Workshop on Network and System Support for Games, 2007. (ACM, New York, NY) pp 1–6.

    Google Scholar 

  12. Stokes, J. W., Platt, J. C., Kravis, J., and Shilman, M.: ALADIN: Active Learning of Anomalies to Detect Intrusions, Microsoft Research Technical Report MSR-TR-2008-24, March 4, 2008.

    Google Scholar 

  13. Tuzhilin, A.: The Lane’s Gifts v. Google Report. http://googleblog.blogspot.com/pdf/Tuzhilin_Report.pdf.

  14. Wu, K.-L., Yu, P. S., and Ballman, A.: SpeedTracer: A Web Usage Mining and Analysis Tool. http://www.research.ibm.com/journal/sj/371/wu.html, 1998.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Greg Buehrer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Buehrer, G., Stokes, J.W., Chellapilla, K., Platt, J.C. (2009). Classification of Automated Search Traffic. In: King, I., Baeza-Yates, R. (eds) Weaving Services and People on the World Wide Web. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00570-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00570-1_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00569-5

  • Online ISBN: 978-3-642-00570-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics