skip to main content
research-article

Power Law Distributions in Information Retrieval

Published:16 February 2016Publication History
Skip Abstract Section

Abstract

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

Skip Supplemental Material Section

Supplemental Material

References

  1. Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. 2001. Search in power-law networks. Physical Review E 64, 4 (2001), 046135.Google ScholarGoogle ScholarCross RefCross Ref
  2. Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automated Control 19, 6 (1974), 716--723.Google ScholarGoogle ScholarCross RefCross Ref
  3. Avi Arampatzis and Jaap Kamps. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Avi Arampatzis and Jaap Kamps. 2009. A signal-to-noise approach to score normalization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 797--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Taylor B. Arnold and John W. Emerson. 2011. Nonparametric goodness-of-fit tests for discrete null distributions. The R Journal 3, 2 (2011), 34--39.Google ScholarGoogle ScholarCross RefCross Ref
  6. Harshvardhan Asthana, Ruoxun Fu, and Ingemar J. Cox. 2011. On the feasibility of unstructured peer-to-peer information retrieval. In Advances in Information Retrieval Theory. Springer, 125--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leif Azzopardi. 2009. Query side evaluation: An empirical analysis of effectiveness and effort. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, 556--563. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Harald Baayen. 2001. Word Frequency Distributions. Springer.Google ScholarGoogle Scholar
  9. Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih-Reza Amini. 2014. Re-ranking approach to classification in large-scale power-law distributed category systems. In Proceedings of the 37th International ACM SIGIR Conference on Research (SIGIR 2014). ACM, 1059--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David F. Babbel, Vincent J. Strickler, and Ricki S. Dolan. 2009. Statistical string theory for courts: If the data don’t fit. Legal Technology Risk Management 4 (2009), 1.Google ScholarGoogle Scholar
  11. Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ricardo Baeza-Yates, Javier Ruiz-del Solar, Rodrigo Verschae, Carlos Castillo, and Carlos Hurtado. 2004. Content-based image retrieval and characterization on specific web collections. In Image and Video Retrieval. Springer, 189--198.Google ScholarGoogle Scholar
  13. Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval. Springer, 56--65.Google ScholarGoogle Scholar
  14. Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 76--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Albert-László Barabási, Réka Albert, and Hawoong Jeong. 1999. Mean-field theory for scale-free random networks. Physica A: Statistical Mechanics and Its Applications 272, 1 (1999), 173--187.Google ScholarGoogle ScholarCross RefCross Ref
  16. Heiko Bauke. 2007. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B-Condensed Matter and Complex Systems 58, 2 (2007), 167--173.Google ScholarGoogle ScholarCross RefCross Ref
  17. Michael A. Bean. 2001. Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering. Vol. 6. American Mathematical Society.Google ScholarGoogle Scholar
  18. Luca Becchetti and Carlos Castillo. 2006. The distribution of pagerank follows a power-law only for particular values of the damping factor. In Proceedings of the 15th International Conference on World Wide Web. ACM, 941--942. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Casper Beckman. 1999. Chinese character frequencies. http://casper.beckman.uiuc.edu/∼c-tsai4/chinese/charfreq.html. (1999). No longer available.Google ScholarGoogle Scholar
  20. Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef Teugels. 2006. Statistics of Extremes: Theory and Applications. John Wiley & Sons.Google ScholarGoogle Scholar
  21. Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. SpamRank--Fully automatic link spam detection work in progress. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  22. Kerstin Bischoff, Claudiu S. Firan, Wolfgang Nejdl, and Raluca Paiu. 2008. Can all tags be used for search?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, and Fausto Rabitti. 2009. CoPhIR: A test collection for content-based image retrieval. arXiv preprint arXiv:0905.4627 (2009).Google ScholarGoogle Scholar
  24. Abraham Bookstein. 1990. Informetric distributions, part I: Unified overview. American Society for Information Science 41, 5 (1990), 368--375.Google ScholarGoogle ScholarCross RefCross Ref
  25. George E. P. Box and David R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) (1964), 211--252.Google ScholarGoogle Scholar
  26. Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’99), Vol. 1. IEEE, 126--134.Google ScholarGoogle ScholarCross RefCross Ref
  27. Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33, 1 (2000), 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mark Buchanan. 2004. Power laws & the new science of complexity management. Strategy+ Business 34 (2004), 1--8.Google ScholarGoogle Scholar
  29. Kenneth P. Burnham and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.Google ScholarGoogle Scholar
  30. Pedro Cano, Oscar Celma, Markus Koppenberger, and Javier M. Buldu. 2006. Topology of music recommendation networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16, 1 (2006), 013107.Google ScholarGoogle ScholarCross RefCross Ref
  31. Domenico Cantone, Salvatore Cristofaro, Simone Faro, and Emanuele Giaquinta. 2009. Finite state models for the generation of large corpora of natural language texts. In Proceedings of the 7th International Workshop on Finite-state Methods and Natural Language Processing, Vol. 191. IOS Press, 175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang Li. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 875--883. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys (CSUR) 38, 1 (2006), 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael Chau, Yan Lu, Xiao Fang, and Christopher C. Yang. 2009. Characteristics of character usage in Chinese Web searching. Information Processing & Management 45, 1 (2009), 115--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 663--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Serena H. Chen and Carmel A. Pollino. 2012. Good practice in Bayesian network modelling. Environmental Modelling & Software 37 (2012), 134--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pasquale Cirillo. 2013. Are your data really pareto distributed? Physica A: Statistical Mechanics and its Applications 392, 23 (2013), 5947--5962.Google ScholarGoogle Scholar
  38. Kevin A. Clarke. 2003. Nonparametric model discrimination in international relations. Journal of Conflict Resolution 47, 1 (2003), 72--93.Google ScholarGoogle ScholarCross RefCross Ref
  39. Kevin A. Clarke. 2007. A simple distribution-free test for nonnested model selection. Political Analysis 15, 3 (2007), 347--363.Google ScholarGoogle ScholarCross RefCross Ref
  40. Aaron Clauset, Cosma R. Shalizi, and Mark E. J. Newman. 2007. Power-law distributions in empirical data. SIAM review 51, 4 (2007), 661--703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maarten Clements, Arjen P. de Vries, and Marcel J. T. Reinders. 2010. The influence of personalization on tag query length in social media search. Information Processing & Management 46, 4 (2010), 403--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Will Cook, Paul Ormerod, and Ellie Cooper. 2004. Scaling behaviour in the number of criminal acts committed by individuals. Journal of Statistical Mechanics: Theory and Experiment 2004, 7 (2004), P07003.Google ScholarGoogle ScholarCross RefCross Ref
  43. Gregory W. Corder and Dale I. Foreman. 2009. Nonparametric Statistics for Non-Statisticians: A Step-By-Step Approach. John Wiley & Sons.Google ScholarGoogle Scholar
  44. Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval 14, 5 (2011), 441--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mark E. Crovella and Murad S. Taqqu. 1999. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability 1, 1 (1999), 55--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wang Dahui, Li Menghui, and Di Zengru. 2005. True reason for Zipf’s law in language. Physica A: Statistical Mechanics and its Applications 358, 2 (2005), 545--550.Google ScholarGoogle Scholar
  48. Russell Davidson and James G. MacKinnon. 1981. Several tests for model specification in the presence of alternative hypotheses. Econometrica: Journal of the Econometric Society (1981), 781--793.Google ScholarGoogle Scholar
  49. Shuai Ding, Josh Attenberg, Ricardo Baeza-Yates, and Torsten Suel. 2011. Batch query processing for web search engines. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Sandor Dominich and Tamas Kiezer. 2005. Zipfs law, small world and Hungarian language. Alkalmazott Nyelvtudomány 1, 2 (2005), 5--24. In Hungarian.Google ScholarGoogle Scholar
  51. Joshua Drucker. 2007. Regional Dominance and Industrial Success: A Productivity-Based Analysis. ProQuest.Google ScholarGoogle Scholar
  52. Jan Eeckhout. 2004. Gibrat’s law for (all) cities. American Economic Review (2004), 1429--1451.Google ScholarGoogle Scholar
  53. Leo Egghe. 2000. The distribution of N-grams. Scientometrics 47, 2 (2000), 237--252.Google ScholarGoogle ScholarCross RefCross Ref
  54. Ramon Ferrer-i Cancho and Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One 5, 3 (2010), e9411.Google ScholarGoogle ScholarCross RefCross Ref
  55. Andrey Feuerverger and Peter Hall. 1999. Estimating a tail exponent by modelling departure from a Pareto distribution. The Annals of Statistics 27, 2 (1999), 760--781.Google ScholarGoogle ScholarCross RefCross Ref
  56. Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. 2011. Statistical distributions. John Wiley & Sons.Google ScholarGoogle Scholar
  57. Xavier Gabaix. 2009. Power laws in economics and finance. Annual Review of Economics 1 (2009), 255--93.Google ScholarGoogle ScholarCross RefCross Ref
  58. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, and Wolfgang Kellerer. 2010. Outtweeting the Twitterers - Predicting information cascades in microblogs. In Proceedings of the 3rd Conference on Online Social Networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. David Garcia, Pavlin Mavrodiev, and Frank Schweitzer. 2013. Social resilience in online communities: The autopsy of friendster. In Proceedings of the First ACM Conference on Online Social Networks. ACM, 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Wolfgang Gatterbauer. 2011. Rules of thumb for information acquisition from large and redundant data. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. 479--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Natalie Glance, Matthew Hurst, and Takashi Tomokiyo. 2004. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging ecosystem: Aggregation, Analysis and Dynamics, Vol. 2004. ACM.Google ScholarGoogle Scholar
  62. Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. Technical Report. Google. http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html.Google ScholarGoogle Scholar
  63. Greg N. Gregoriou. 2009. Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation. Vol. 481. John Wiley & Sons.Google ScholarGoogle Scholar
  64. Peter Grünwald. 2007. The Minimum Description Length Principle. MIT press.Google ScholarGoogle Scholar
  65. Cathal Gurrin and Alan F. Smeaton. 2004. Replicating web structure in small-scale test collections. Information retrieval 7, 3--4 (2004), 239--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Matthias Hagen, Martin Potthast, Benno Stein, and Christof Braeutigam. 2010. The power of naive query segmentation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 797--798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Harry Halpin, Valentin Robu, and Hana Shepherd. 2007. The complex dynamics of collaborative tagging. In Proceedings of the 16th International Conference on World Wide Web. ACM, 211--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Robert K. Hammond and James E. Bickel. 2013. Reexamining discrete approximations to continuous distributions. Decision Analysis 10, 1 (2013), 6--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Claudia Hauff and Leif Azzopardi. 2005. Age dependent document priors in link structure analysis. In Advances in Information Retrieval. Springer, 552--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Harold S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Daniel Heesch and Stefan Rüger. 2004. NNk networks for content-based image retrieval. In Advances in Information Retrieval. Springer, 253--266.Google ScholarGoogle Scholar
  72. Joseph Hilbe. 2011. Negative Binomial Regression. Cambridge University Press.Google ScholarGoogle Scholar
  73. Bruce M. Hill. 1975. A simple general approach to inference about the tail of a distribution. The Annals of Statistics 3, 5 (1975), 1163--1174.Google ScholarGoogle ScholarCross RefCross Ref
  74. Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications (2006), 411--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Bernardo A. Huberman and Lada A. Adamic. 1999. Evolutionary dynamics of the world wide web. arXiv Preprint Cond-Mat/9901071 (1999).Google ScholarGoogle Scholar
  76. Clifford M. Hurvich and Chih-Ling Tsai. 1989. Regression and time series model selection in small samples. Biometrika 76, 2 (1989), 297--307.Google ScholarGoogle ScholarCross RefCross Ref
  77. Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. 2000. The large-scale organization of metabolic networks. Nature 407, 6804 (2000), 651--654.Google ScholarGoogle ScholarCross RefCross Ref
  78. Hai Jin, Xiaomin Ning, and Hanhua Chen. 2006. Efficient search for peer-to-peer information retrieval using semantic small world. In Proceedings of the 15th International Conference on World Wide Web. ACM, 1003--1004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Shudong Jin and Azer Bestavros. 2000. Sources and characteristics of web temporal locality. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. IEEE, 28--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Norman L. Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. 2002. Continuous Multivariate Distributions, Volume 1, Models and Applications. Vol. 59. New York: John Wiley & Sons.Google ScholarGoogle Scholar
  81. Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. 2002. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking 10, 5 (2002), 589--603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Jaap Kamps and Marijn Koolen. 2008. The importance of link evidence in Wikipedia. In Advances in Information Retrieval. Springer, 270--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Noriaki Kawamae. 2014. Supervised N-gram topic model. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (Web Search and Data Mining’14). 473--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Noam Koenigstein, Yuval Shavitt, Ela Weinsberg, and Udi Weinsberg. 2010. On the applicability of peer-to-peer data in music information retrieval research. In International Society for Music Information Retrieval. 273--278.Google ScholarGoogle Scholar
  85. Leonid Kopylev. 2012. Constrained parameters in applications: Review of issues and approaches. International Scholarly Research Notices 2012 (2012).Google ScholarGoogle ScholarCross RefCross Ref
  86. Beate Krause, Robert Jäschke, Andreas Hotho, and Gerd Stumme. 2008. Logsonomy-social information retrieval with logdata. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia. ACM, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Jérôme Kunegis and Julia Preusse. 2012. Fairness on the web: Alternatives to the power law. In Proceedings of the 4th Annual ACM Web Science Conference. ACM, 175--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Erich L. Lehmann and Joseph P. Romano. 2006. Testing Statistical Hypotheses. Springer.Google ScholarGoogle Scholar
  90. Mark Levy and Mark Sandler. 2009. Music information retrieval using social tags and audio. IEEE Transactions on Multimedia 11, 3 (2009), 383--395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Christina Lioma. 2007. Part of Speech n-Grams for Information Retrieval. Ph.D. Dissertation. University of Glasgow.Google ScholarGoogle Scholar
  92. Christina Lioma and Iadh Ounis. 2007. Light syntactically-based index pruning for information retrieval. In Proceedings of the 29th European Conference on IR Research Advances in Information Retrieval (ECIR 2007), Rome, Italy, April 2--5, 2007, 88--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Christina Lioma and Iadh Ounis. 2008. A syntactically-based query reformulation technique for information retrieval. Information Processing & Management 44 (2008), 143--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Christina Lioma and Cornelis Joost van Rijsbergen. 2008. Part of speech N-grams and information retrieval. Revue française De Linguistique Appliquée 13, 1 (2008), 9--22.Google ScholarGoogle Scholar
  95. Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. ACM Knowledge Discovery and Data Mining: Explorations Newsletter 7, 1 (2005), 36--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Wuying Liu, Lin Wang, and Mianzhu Yi. 2013. Power law for text categorization. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 131--143.Google ScholarGoogle Scholar
  97. Roger Lowenstein. 2000. When Genius Failed: The Rise and Fall of Long-Term Capital Management. Random House Trade Paperbacks.Google ScholarGoogle Scholar
  98. Hans P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2 (1958), 159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in Information Retrieval -32rd European Conference on IR Research (ECIR’10). Springer, 627--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Colin L. Mallows. 1973. Some comments on CP. Technometrics 15, 4 (1973), 661--675.Google ScholarGoogle Scholar
  101. Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953).Google ScholarGoogle Scholar
  102. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press. Google ScholarGoogle Scholar
  103. Yuqing Mao and Zhiyong Lu. 2013. Predicting clicks of PubMed articles. In Proceedings of the AMIA Annual Symposium, Vol. 2013. American Medical Informatics Association, 947.Google ScholarGoogle Scholar
  104. Alberto Maydeu-Olivares and Carlos Garca-Forero. 2010. Goodness-of-fit testing. In International Encyclopedia of Education (3 ed.), Baker E. Peterson, P. and B. McGaw (Eds.). Elsevier, 190--196.Google ScholarGoogle Scholar
  105. Alberto Medina, Ibrahim Matta, and John Byers. 2000. On the origin of power laws in internet topologies. ACM SIGCOMM Computer Communication Review 30, 2 (2000), 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Mark M. Meerschaert and Hans-Peter Scheffler. 2001. Limit Distributions for Sums of Independent Random vectors: Heavy Tails in Theory and Practice. Vol. 321. John Wiley & Sons.Google ScholarGoogle Scholar
  107. Edgar Meij and Maarten de Rijke. 2007. Using prior information derived from citations in literature search. In Recherche d’Information et ses Applications.Google ScholarGoogle Scholar
  108. George A. Miller. 1957. Some effects of intermittent silence. American Journal of Psychology (1957), 311--314.Google ScholarGoogle Scholar
  109. Staša Milojević. 2010. Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2417--2425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Gilad Mishne and Natalie Glance. 2006. Leave a reply: An analysis of weblog comments. In Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem.Google ScholarGoogle Scholar
  111. Michael Mitzenmacher. 2004. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 2 (2004), 226--251.Google ScholarGoogle ScholarCross RefCross Ref
  112. Saeedeh Momtazi and Dietrich Klakow. 2010. Hierarchical Pitman-yor language model for information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 793--794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Fabrice Muhlenbach and Ricco Rakotomalala. 2005. Discretization of continuous attributes. Encyclopedia of Data Warehousing and Mining 1 (2005), 397--402.Google ScholarGoogle ScholarCross RefCross Ref
  114. Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 5 (2005), 323--351.Google ScholarGoogle ScholarCross RefCross Ref
  115. Christopher R. Palmer and Greg Steffan. 2000. Generating network topologies that obey power laws. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM’00),Vol. 1. IEEE, 434--438.Google ScholarGoogle Scholar
  116. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. David M. Pennock, Gary William Flake, Steve Lawrence, Eric J. Glover, and Clyde L. Giles. 2002. Winners don’t take all: Characterizing the competition for links on the web. In Proceedings of the National Academy of Sciences 99, 8 (2002), 5207--5211.Google ScholarGoogle ScholarCross RefCross Ref
  118. Matjaž Perc. 2010. Zipfs law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenias research as an example. Journal of Informetrics 4, 3 (2010), 358--364.Google ScholarGoogle ScholarCross RefCross Ref
  119. Isabella Peters and Wolfgang G. Stock. 2010. “Power tags” in information retrieval. Library Hi Tech 28, 1 (2010), 81--93.Google ScholarGoogle ScholarCross RefCross Ref
  120. Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability (1997), 855--900.Google ScholarGoogle Scholar
  121. David Posada and Thomas R. Buckley. 2004. Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53, 5 (2004), 793--808.Google ScholarGoogle ScholarCross RefCross Ref
  122. Le Quan Ha, Ji Ming, and Francis Jack Smith. 2003. Extension of Zipfs law to word and character n-grams for English and Chinese. Journal of Computational Linguistics and Chinese Language Processing 1, 77--102. Citeseer.Google ScholarGoogle Scholar
  123. Venugopalan Ramasubrama nian and Emin Gün Sirer. 2004. Beehive: Exploiting power law query distributions for O (1) lookup performance in peer to peer overlays. In Symposium on Networked Systems Design and Implementation. Usenix, San Francisco CA.Google ScholarGoogle Scholar
  124. Sidney Redner. 1998. How popular is your paper? An empirical study of the citation distribution. European Physical Journal B-Condensed Matter and Complex Systems 4, 2 (1998), 131--134.Google ScholarGoogle ScholarCross RefCross Ref
  125. William J. Reed. 2003. The Pareto law of incomes: An explanation and an extension. Physica A: Statistical Mechanics and Its Applications 319 (2003), 469--486.Google ScholarGoogle ScholarCross RefCross Ref
  126. William J. Reed and Murray Jorgensen. 2004. The double Pareto-lognormal distributiona new parametric model for size distributions. Communications in Statistics-Theory and Methods 33, 8 (2004), 1733--1753.Google ScholarGoogle ScholarCross RefCross Ref
  127. Matei Ripeanu and Ian T. Foster. 2002. Mapping the Gnutella network: Macroscopic properties of large-scale peer-to-peer systems. In IPTPS. Computing Research Repository, 85--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Seth Roberts and Harold Pashler. 2000. How persuasive is a good fit? A comment on theory testing. Psychological Review 107, 2 (2000), 358.Google ScholarGoogle ScholarCross RefCross Ref
  129. Issei Sato and Hiroshi Nakagawa. 2010. Topic models with power-law using Pitman-Yor process. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Christian D. Schunn and Dieter Wallach. 2005. Evaluating Goodness-of-Fit in Comparison of Models to Data. University of Saarland Press, Saarbrueken, 115--154.Google ScholarGoogle Scholar
  131. Gideon Schwarz. 1978. Estimating the dimension of a model. Annals of Statistics 6, 2 (1978), 461--464.Google ScholarGoogle ScholarCross RefCross Ref
  132. Ripunjai K. Shukla, Mohan Trivedi, and Manoj Kumar. 2010. On the proficient use of GEV distribution: A case study of subtropical monsoon region in India. Annals of Computer Science Series 8, 1 (2010).Google ScholarGoogle Scholar
  133. Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th International Conference on World Wide Web. ACM, 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. Herbert A. Simon. 1955. On a class of skew distribution functions. Biometrika (1955), 425--440.Google ScholarGoogle Scholar
  135. Ian Soboroff. 2002. Does wt10g look like the web? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 423--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Karen Spärck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11--21.Google ScholarGoogle ScholarCross RefCross Ref
  137. Laura Spierdijk and Mark Voorneveld. 2009. Superstars without talent? The Yule distribution controversy. Review of Economics and Statistics 91, 3 (2009), 648--652.Google ScholarGoogle ScholarCross RefCross Ref
  138. Kunwadee Sripanidkulchai, Bruce Maggs, and Hui Zhang. 2003. Efficient content location using interest-based locality in peer-to-peer systems. In Proceedings of the IEEE Societies’ 22nd Annual Joint Conference of the IEEE Computer and Communications (INFOCOM’03), Vol. 3. IEEE, 2166--2176.Google ScholarGoogle ScholarCross RefCross Ref
  139. Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E. Hinton. 2013. Modeling documents with deep boltzmann machines. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 616--625.Google ScholarGoogle Scholar
  140. Alexandru Tatar, Panayotis Antoniadis, Marcelo D. De Amorim, and Serge Fdida. 2014. From popularity prediction to ranking online news. Social Network Analysis and Mining 4, 1 (2014), 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  141. Jiancong Tong, Gang Wang, Douglas S. Stones, Shizhao Sun, Xiaoguang Liu, and Fan Zhang. 2013. Exploiting query term correlation for list caching in web search engines. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1817--1820. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Yana Volkovich, Nelly Litvak, and Debora Donato. 2007. Determining factors behind the PageRank log-log plot. In Algorithms and Models for the Web-Graph. Springer, 108--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Quang H. Vuong. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society (1989), 307--333.Google ScholarGoogle Scholar
  144. Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing web search using web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, 118--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Yiming Yang, Jian Zhang, and Bryan Kisiel. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 96--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Emmanuel J. Yannakoudakis, Ioannis Tsomokos, and Paul J. Hutton. 1990. N-Grams and their implication to natural language understanding. Pattern Recognition 23, 5 (1990), 509--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 325--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Haizheng Zhang and Victor Lesser. 2006. Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions. In Proceedings of the 5th International Joint Conference on Autonomous agents and Multiagent Systems. ACM, 305--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and Clyde L. Giles. 2008. Exploring social annotations for information retrieval. In Proceedings of the 17th International Conference on World Wide Web. ACM, 715--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. George K. Zipf. 1935. The Psycho-Biology of Language. Houghton, Mifflin.Google ScholarGoogle Scholar

Index Terms

  1. Power Law Distributions in Information Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 34, Issue 2
      April 2016
      220 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/2891107
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 February 2016
      • Accepted: 1 August 2015
      • Revised: 1 June 2015
      • Received: 1 October 2014
      Published in tois Volume 34, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader