research-article

Power Law Distributions in Information Retrieval

Authors:
Casper Petersen

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

,
Jakob Grue Simonsen

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

,
Christina Lioma

University of Copenhagen, Denmark

University of Copenhagen, Denmark
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 34 Issue 2Article No.: 8pp 1–37https://doi.org/10.1145/2816815

Published:16 February 2016Publication History

ACM Transactions on Information Systems

Abstract

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

Supplemental Material

Available for Download

zip

petersen.zip (13 MB)

Supplemental movie, appendix, image and software files for, Power Law Distributions in Information Retrieval

References

Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. 2001. Search in power-law networks. Physical Review E 64, 4 (2001), 046135.Google ScholarCross Ref
Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automated Control 19, 6 (1974), 716--723.Google ScholarCross Ref
Avi Arampatzis and Jaap Kamps. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarDigital Library
Avi Arampatzis and Jaap Kamps. 2009. A signal-to-noise approach to score normalization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 797--806. Google ScholarDigital Library
Taylor B. Arnold and John W. Emerson. 2011. Nonparametric goodness-of-fit tests for discrete null distributions. The R Journal 3, 2 (2011), 34--39.Google ScholarCross Ref
Harshvardhan Asthana, Ruoxun Fu, and Ingemar J. Cox. 2011. On the feasibility of unstructured peer-to-peer information retrieval. In Advances in Information Retrieval Theory. Springer, 125--138. Google ScholarDigital Library
Leif Azzopardi. 2009. Query side evaluation: An empirical analysis of effectiveness and effort. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, 556--563. Google ScholarDigital Library
Harald Baayen. 2001. Word Frequency Distributions. Springer.Google Scholar
Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih-Reza Amini. 2014. Re-ranking approach to classification in large-scale power-law distributed category systems. In Proceedings of the 37th International ACM SIGIR Conference on Research (SIGIR 2014). ACM, 1059--1062. Google ScholarDigital Library
David F. Babbel, Vincent J. Strickler, and Ricki S. Dolan. 2009. Statistical string theory for courts: If the data don’t fit. Legal Technology Risk Management 4 (2009), 1.Google Scholar
Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 183--190. Google ScholarDigital Library
Ricardo Baeza-Yates, Javier Ruiz-del Solar, Rodrigo Verschae, Carlos Castillo, and Carlos Hurtado. 2004. Content-based image retrieval and characterization on specific web collections. In Image and Video Retrieval. Springer, 189--198.Google Scholar
Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval. Springer, 56--65.Google Scholar
Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 76--85. Google ScholarDigital Library
Albert-László Barabási, Réka Albert, and Hawoong Jeong. 1999. Mean-field theory for scale-free random networks. Physica A: Statistical Mechanics and Its Applications 272, 1 (1999), 173--187.Google ScholarCross Ref
Heiko Bauke. 2007. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B-Condensed Matter and Complex Systems 58, 2 (2007), 167--173.Google ScholarCross Ref
Michael A. Bean. 2001. Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering. Vol. 6. American Mathematical Society.Google Scholar
Luca Becchetti and Carlos Castillo. 2006. The distribution of pagerank follows a power-law only for particular values of the damping factor. In Proceedings of the 15th International Conference on World Wide Web. ACM, 941--942. Google ScholarDigital Library
Casper Beckman. 1999. Chinese character frequencies. http://casper.beckman.uiuc.edu/&sim;c-tsai4/chinese/charfreq.html. (1999). No longer available.Google Scholar
Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef Teugels. 2006. Statistics of Extremes: Theory and Applications. John Wiley & Sons.Google Scholar
Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. SpamRank--Fully automatic link spam detection work in progress. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google Scholar
Kerstin Bischoff, Claudiu S. Firan, Wolfgang Nejdl, and Raluca Paiu. 2008. Can all tags be used for search?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 193--202. Google ScholarDigital Library
Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, and Fausto Rabitti. 2009. CoPhIR: A test collection for content-based image retrieval. arXiv preprint arXiv:0905.4627 (2009).Google Scholar
Abraham Bookstein. 1990. Informetric distributions, part I: Unified overview. American Society for Information Science 41, 5 (1990), 368--375.Google ScholarCross Ref
George E. P. Box and David R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) (1964), 211--252.Google Scholar
Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’99), Vol. 1. IEEE, 126--134.Google ScholarCross Ref
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33, 1 (2000), 309--320. Google ScholarDigital Library
Mark Buchanan. 2004. Power laws & the new science of complexity management. Strategy+ Business 34 (2004), 1--8.Google Scholar
Kenneth P. Burnham and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.Google Scholar
Pedro Cano, Oscar Celma, Markus Koppenberger, and Javier M. Buldu. 2006. Topology of music recommendation networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16, 1 (2006), 013107.Google ScholarCross Ref
Domenico Cantone, Salvatore Cristofaro, Simone Faro, and Emanuele Giaquinta. 2009. Finite state models for the generation of large corpora of natural language texts. In Proceedings of the 7th International Workshop on Finite-state Methods and Natural Language Processing, Vol. 191. IOS Press, 175. Google ScholarDigital Library
Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang Li. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 875--883. Google ScholarDigital Library
Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys (CSUR) 38, 1 (2006), 2. Google ScholarDigital Library
Michael Chau, Yan Lu, Xiao Fang, and Christopher C. Yang. 2009. Characteristics of character usage in Chinese Web searching. Information Processing & Management 45, 1 (2009), 115--130. Google ScholarDigital Library
Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 663--670. Google ScholarDigital Library
Serena H. Chen and Carmel A. Pollino. 2012. Good practice in Bayesian network modelling. Environmental Modelling & Software 37 (2012), 134--145. Google ScholarDigital Library
Pasquale Cirillo. 2013. Are your data really pareto distributed? Physica A: Statistical Mechanics and its Applications 392, 23 (2013), 5947--5962.Google Scholar
Kevin A. Clarke. 2003. Nonparametric model discrimination in international relations. Journal of Conflict Resolution 47, 1 (2003), 72--93.Google ScholarCross Ref
Kevin A. Clarke. 2007. A simple distribution-free test for nonnested model selection. Political Analysis 15, 3 (2007), 347--363.Google ScholarCross Ref
Aaron Clauset, Cosma R. Shalizi, and Mark E. J. Newman. 2007. Power-law distributions in empirical data. SIAM review 51, 4 (2007), 661--703. Google ScholarDigital Library
Maarten Clements, Arjen P. de Vries, and Marcel J. T. Reinders. 2010. The influence of personalization on tag query length in social media search. Information Processing & Management 46, 4 (2010), 403--412. Google ScholarDigital Library
Will Cook, Paul Ormerod, and Ellie Cooper. 2004. Scaling behaviour in the number of criminal acts committed by individuals. Journal of Statistical Mechanics: Theory and Experiment 2004, 7 (2004), P07003.Google ScholarCross Ref
Gregory W. Corder and Dale I. Foreman. 2009. Nonparametric Statistics for Non-Statisticians: A Step-By-Step Approach. John Wiley & Sons.Google Scholar
Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval 14, 5 (2011), 441--465. Google ScholarDigital Library
Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239--246. Google ScholarDigital Library
Mark E. Crovella and Murad S. Taqqu. 1999. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability 1, 1 (1999), 55--79. Google ScholarDigital Library
Wang Dahui, Li Menghui, and Di Zengru. 2005. True reason for Zipf’s law in language. Physica A: Statistical Mechanics and its Applications 358, 2 (2005), 545--550.Google Scholar
Russell Davidson and James G. MacKinnon. 1981. Several tests for model specification in the presence of alternative hypotheses. Econometrica: Journal of the Econometric Society (1981), 781--793.Google Scholar
Shuai Ding, Josh Attenberg, Ricardo Baeza-Yates, and Torsten Suel. 2011. Batch query processing for web search engines. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, 137--146. Google ScholarDigital Library
Sandor Dominich and Tamas Kiezer. 2005. Zipfs law, small world and Hungarian language. Alkalmazott Nyelvtudomány 1, 2 (2005), 5--24. In Hungarian.Google Scholar
Joshua Drucker. 2007. Regional Dominance and Industrial Success: A Productivity-Based Analysis. ProQuest.Google Scholar
Jan Eeckhout. 2004. Gibrat’s law for (all) cities. American Economic Review (2004), 1429--1451.Google Scholar
Leo Egghe. 2000. The distribution of N-grams. Scientometrics 47, 2 (2000), 237--252.Google ScholarCross Ref
Ramon Ferrer-i Cancho and Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One 5, 3 (2010), e9411.Google ScholarCross Ref
Andrey Feuerverger and Peter Hall. 1999. Estimating a tail exponent by modelling departure from a Pareto distribution. The Annals of Statistics 27, 2 (1999), 760--781.Google ScholarCross Ref
Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. 2011. Statistical distributions. John Wiley & Sons.Google Scholar
Xavier Gabaix. 2009. Power laws in economics and finance. Annual Review of Economics 1 (2009), 255--93.Google ScholarCross Ref
Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, and Wolfgang Kellerer. 2010. Outtweeting the Twitterers - Predicting information cascades in microblogs. In Proceedings of the 3rd Conference on Online Social Networks. Google ScholarDigital Library
David Garcia, Pavlin Mavrodiev, and Frank Schweitzer. 2013. Social resilience in online communities: The autopsy of friendster. In Proceedings of the First ACM Conference on Online Social Networks. ACM, 39--50. Google ScholarDigital Library
Wolfgang Gatterbauer. 2011. Rules of thumb for information acquisition from large and redundant data. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. 479--490. Google ScholarDigital Library
Natalie Glance, Matthew Hurst, and Takashi Tomokiyo. 2004. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging ecosystem: Aggregation, Analysis and Dynamics, Vol. 2004. ACM.Google Scholar
Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. Technical Report. Google. http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html.Google Scholar
Greg N. Gregoriou. 2009. Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation. Vol. 481. John Wiley & Sons.Google Scholar
Peter Grünwald. 2007. The Minimum Description Length Principle. MIT press.Google Scholar
Cathal Gurrin and Alan F. Smeaton. 2004. Replicating web structure in small-scale test collections. Information retrieval 7, 3--4 (2004), 239--263. Google ScholarDigital Library
Matthias Hagen, Martin Potthast, Benno Stein, and Christof Braeutigam. 2010. The power of naive query segmentation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 797--798. Google ScholarDigital Library
Harry Halpin, Valentin Robu, and Hana Shepherd. 2007. The complex dynamics of collaborative tagging. In Proceedings of the 16th International Conference on World Wide Web. ACM, 211--220. Google ScholarDigital Library
Robert K. Hammond and James E. Bickel. 2013. Reexamining discrete approximations to continuous distributions. Decision Analysis 10, 1 (2013), 6--25. Google ScholarDigital Library
Claudia Hauff and Leif Azzopardi. 2005. Age dependent document priors in link structure analysis. In Advances in Information Retrieval. Springer, 552--554. Google ScholarDigital Library
Harold S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA. Google ScholarDigital Library
Daniel Heesch and Stefan Rüger. 2004. NN^k networks for content-based image retrieval. In Advances in Information Retrieval. Springer, 253--266.Google Scholar
Joseph Hilbe. 2011. Negative Binomial Regression. Cambridge University Press.Google Scholar
Bruce M. Hill. 1975. A simple general approach to inference about the tail of a distribution. The Annals of Statistics 3, 5 (1975), 1163--1174.Google ScholarCross Ref
Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications (2006), 411--426. Google ScholarDigital Library
Bernardo A. Huberman and Lada A. Adamic. 1999. Evolutionary dynamics of the world wide web. arXiv Preprint Cond-Mat/9901071 (1999).Google Scholar
Clifford M. Hurvich and Chih-Ling Tsai. 1989. Regression and time series model selection in small samples. Biometrika 76, 2 (1989), 297--307.Google ScholarCross Ref
Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. 2000. The large-scale organization of metabolic networks. Nature 407, 6804 (2000), 651--654.Google ScholarCross Ref
Hai Jin, Xiaomin Ning, and Hanhua Chen. 2006. Efficient search for peer-to-peer information retrieval using semantic small world. In Proceedings of the 15th International Conference on World Wide Web. ACM, 1003--1004. Google ScholarDigital Library
Shudong Jin and Azer Bestavros. 2000. Sources and characteristics of web temporal locality. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. IEEE, 28--35. Google ScholarDigital Library
Norman L. Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. 2002. Continuous Multivariate Distributions, Volume 1, Models and Applications. Vol. 59. New York: John Wiley & Sons.Google Scholar
Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. 2002. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking 10, 5 (2002), 589--603. Google ScholarDigital Library
Jaap Kamps and Marijn Koolen. 2008. The importance of link evidence in Wikipedia. In Advances in Information Retrieval. Springer, 270--282. Google ScholarDigital Library
Noriaki Kawamae. 2014. Supervised N-gram topic model. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (Web Search and Data Mining’14). 473--482. Google ScholarDigital Library
Noam Koenigstein, Yuval Shavitt, Ela Weinsberg, and Udi Weinsberg. 2010. On the applicability of peer-to-peer data in music information retrieval research. In International Society for Music Information Retrieval. 273--278.Google Scholar
Leonid Kopylev. 2012. Constrained parameters in applications: Review of issues and approaches. International Scholarly Research Notices 2012 (2012).Google ScholarCross Ref
Beate Krause, Robert Jäschke, Andreas Hotho, and Gerd Stumme. 2008. Logsonomy-social information retrieval with logdata. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia. ACM, 157--166. Google ScholarDigital Library
Jérôme Kunegis and Julia Preusse. 2012. Fairness on the web: Alternatives to the power law. In Proceedings of the 4th Annual ACM Web Science Conference. ACM, 175--184. Google ScholarDigital Library
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600. Google ScholarDigital Library
Erich L. Lehmann and Joseph P. Romano. 2006. Testing Statistical Hypotheses. Springer.Google Scholar
Mark Levy and Mark Sandler. 2009. Music information retrieval using social tags and audio. IEEE Transactions on Multimedia 11, 3 (2009), 383--395. Google ScholarDigital Library
Christina Lioma. 2007. Part of Speech n-Grams for Information Retrieval. Ph.D. Dissertation. University of Glasgow.Google Scholar
Christina Lioma and Iadh Ounis. 2007. Light syntactically-based index pruning for information retrieval. In Proceedings of the 29th European Conference on IR Research Advances in Information Retrieval (ECIR 2007), Rome, Italy, April 2--5, 2007, 88--100. Google ScholarDigital Library
Christina Lioma and Iadh Ounis. 2008. A syntactically-based query reformulation technique for information retrieval. Information Processing & Management 44 (2008), 143--162. Google ScholarDigital Library
Christina Lioma and Cornelis Joost van Rijsbergen. 2008. Part of speech N-grams and information retrieval. Revue française De Linguistique Appliquée 13, 1 (2008), 9--22.Google Scholar
Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. ACM Knowledge Discovery and Data Mining: Explorations Newsletter 7, 1 (2005), 36--43. Google ScholarDigital Library
Wuying Liu, Lin Wang, and Mianzhu Yi. 2013. Power law for text categorization. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 131--143.Google Scholar
Roger Lowenstein. 2000. When Genius Failed: The Rise and Fall of Long-Term Capital Management. Random House Trade Paperbacks.Google Scholar
Hans P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2 (1958), 159--165. Google ScholarDigital Library
Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in Information Retrieval -32rd European Conference on IR Research (ECIR’10). Springer, 627--630. Google ScholarDigital Library
Colin L. Mallows. 1973. Some comments on C_P. Technometrics 15, 4 (1973), 661--675.Google Scholar
Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953).Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press. Google Scholar
Yuqing Mao and Zhiyong Lu. 2013. Predicting clicks of PubMed articles. In Proceedings of the AMIA Annual Symposium, Vol. 2013. American Medical Informatics Association, 947.Google Scholar
Alberto Maydeu-Olivares and Carlos Garca-Forero. 2010. Goodness-of-fit testing. In International Encyclopedia of Education (3 ed.), Baker E. Peterson, P. and B. McGaw (Eds.). Elsevier, 190--196.Google Scholar
Alberto Medina, Ibrahim Matta, and John Byers. 2000. On the origin of power laws in internet topologies. ACM SIGCOMM Computer Communication Review 30, 2 (2000), 18--28. Google ScholarDigital Library
Mark M. Meerschaert and Hans-Peter Scheffler. 2001. Limit Distributions for Sums of Independent Random vectors: Heavy Tails in Theory and Practice. Vol. 321. John Wiley & Sons.Google Scholar
Edgar Meij and Maarten de Rijke. 2007. Using prior information derived from citations in literature search. In Recherche d’Information et ses Applications.Google Scholar
George A. Miller. 1957. Some effects of intermittent silence. American Journal of Psychology (1957), 311--314.Google Scholar
Staša Milojević. 2010. Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2417--2425. Google ScholarDigital Library
Gilad Mishne and Natalie Glance. 2006. Leave a reply: An analysis of weblog comments. In Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem.Google Scholar
Michael Mitzenmacher. 2004. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 2 (2004), 226--251.Google ScholarCross Ref
Saeedeh Momtazi and Dietrich Klakow. 2010. Hierarchical Pitman-yor language model for information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 793--794. Google ScholarDigital Library
Fabrice Muhlenbach and Ricco Rakotomalala. 2005. Discretization of continuous attributes. Encyclopedia of Data Warehousing and Mining 1 (2005), 397--402.Google ScholarCross Ref
Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 5 (2005), 323--351.Google ScholarCross Ref
Christopher R. Palmer and Greg Steffan. 2000. Generating network topologies that obey power laws. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM’00),Vol. 1. IEEE, 434--438.Google Scholar
Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article 1. Google ScholarDigital Library
David M. Pennock, Gary William Flake, Steve Lawrence, Eric J. Glover, and Clyde L. Giles. 2002. Winners don’t take all: Characterizing the competition for links on the web. In Proceedings of the National Academy of Sciences 99, 8 (2002), 5207--5211.Google ScholarCross Ref
Matjaž Perc. 2010. Zipfs law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenias research as an example. Journal of Informetrics 4, 3 (2010), 358--364.Google ScholarCross Ref
Isabella Peters and Wolfgang G. Stock. 2010. “Power tags” in information retrieval. Library Hi Tech 28, 1 (2010), 81--93.Google ScholarCross Ref
Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability (1997), 855--900.Google Scholar
David Posada and Thomas R. Buckley. 2004. Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53, 5 (2004), 793--808.Google ScholarCross Ref
Le Quan Ha, Ji Ming, and Francis Jack Smith. 2003. Extension of Zipfs law to word and character n-grams for English and Chinese. Journal of Computational Linguistics and Chinese Language Processing 1, 77--102. Citeseer.Google Scholar
Venugopalan Ramasubrama nian and Emin Gün Sirer. 2004. Beehive: Exploiting power law query distributions for O (1) lookup performance in peer to peer overlays. In Symposium on Networked Systems Design and Implementation. Usenix, San Francisco CA.Google Scholar
Sidney Redner. 1998. How popular is your paper? An empirical study of the citation distribution. European Physical Journal B-Condensed Matter and Complex Systems 4, 2 (1998), 131--134.Google ScholarCross Ref
William J. Reed. 2003. The Pareto law of incomes: An explanation and an extension. Physica A: Statistical Mechanics and Its Applications 319 (2003), 469--486.Google ScholarCross Ref
William J. Reed and Murray Jorgensen. 2004. The double Pareto-lognormal distributiona new parametric model for size distributions. Communications in Statistics-Theory and Methods 33, 8 (2004), 1733--1753.Google ScholarCross Ref
Matei Ripeanu and Ian T. Foster. 2002. Mapping the Gnutella network: Macroscopic properties of large-scale peer-to-peer systems. In IPTPS. Computing Research Repository, 85--93. Google ScholarDigital Library
Seth Roberts and Harold Pashler. 2000. How persuasive is a good fit? A comment on theory testing. Psychological Review 107, 2 (2000), 358.Google ScholarCross Ref
Issei Sato and Hiroshi Nakagawa. 2010. Topic models with power-law using Pitman-Yor process. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--682. Google ScholarDigital Library
Christian D. Schunn and Dieter Wallach. 2005. Evaluating Goodness-of-Fit in Comparison of Models to Data. University of Saarland Press, Saarbrueken, 115--154.Google Scholar
Gideon Schwarz. 1978. Estimating the dimension of a model. Annals of Statistics 6, 2 (1978), 461--464.Google ScholarCross Ref
Ripunjai K. Shukla, Mohan Trivedi, and Manoj Kumar. 2010. On the proficient use of GEV distribution: A case study of subtropical monsoon region in India. Annals of Computer Science Series 8, 1 (2010).Google Scholar
Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th International Conference on World Wide Web. ACM, 327--336. Google ScholarDigital Library
Herbert A. Simon. 1955. On a class of skew distribution functions. Biometrika (1955), 425--440.Google Scholar
Ian Soboroff. 2002. Does wt10g look like the web? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 423--424. Google ScholarDigital Library
Karen Spärck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11--21.Google ScholarCross Ref
Laura Spierdijk and Mark Voorneveld. 2009. Superstars without talent? The Yule distribution controversy. Review of Economics and Statistics 91, 3 (2009), 648--652.Google ScholarCross Ref
Kunwadee Sripanidkulchai, Bruce Maggs, and Hui Zhang. 2003. Efficient content location using interest-based locality in peer-to-peer systems. In Proceedings of the IEEE Societies’ 22nd Annual Joint Conference of the IEEE Computer and Communications (INFOCOM’03), Vol. 3. IEEE, 2166--2176.Google ScholarCross Ref
Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E. Hinton. 2013. Modeling documents with deep boltzmann machines. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 616--625.Google Scholar
Alexandru Tatar, Panayotis Antoniadis, Marcelo D. De Amorim, and Serge Fdida. 2014. From popularity prediction to ranking online news. Social Network Analysis and Mining 4, 1 (2014), 1--12.Google ScholarCross Ref
Jiancong Tong, Gang Wang, Douglas S. Stones, Shizhao Sun, Xiaoguang Liu, and Fan Zhang. 2013. Exploiting query term correlation for list caching in web search engines. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1817--1820. Google ScholarDigital Library
Yana Volkovich, Nelly Litvak, and Debora Donato. 2007. Determining factors behind the PageRank log-log plot. In Algorithms and Models for the Web-Graph. Springer, 108--123. Google ScholarDigital Library
Quang H. Vuong. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society (1989), 307--333.Google Scholar
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing web search using web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, 118--126. Google ScholarDigital Library
Yiming Yang, Jian Zhang, and Bryan Kisiel. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 96--103. Google ScholarDigital Library
Emmanuel J. Yannakoudakis, Ioannis Tsomokos, and Paul J. Hutton. 1990. N-Grams and their implication to natural language understanding. Pattern Recognition 23, 5 (1990), 509--528. Google ScholarDigital Library
Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 325--334. Google ScholarDigital Library
Haizheng Zhang and Victor Lesser. 2006. Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions. In Proceedings of the 5th International Joint Conference on Autonomous agents and Multiagent Systems. ACM, 305--312. Google ScholarDigital Library
Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and Clyde L. Giles. 2008. Exploring social annotations for information retrieval. In Proceedings of the 17th International Conference on World Wide Web. ACM, 715--724. Google ScholarDigital Library
George K. Zipf. 1935. The Psycho-Biology of Language. Houghton, Mifflin.Google Scholar

Index Terms

Power Law Distributions in Information Retrieval
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Revisiting Power-law Distributions in Spectra of Real World Networks
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

By studying a large number of real world graphs, we find empirical evidence that most real world graphs have a statistically significant power-law distribution with a cutoff in the singular values of the adjacency matrix and eigenvalues of the Laplacian ...
Read More
Power-Law Distributions in Empirical Data

Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the ...
Read More
Probability distributions conditioned by the available information: Gamma distribution and moments

Given a gamma probability distribution g as the observed distribution, and the information available on moments of the random variable, the probability distribution @? is derived such that the @g^2-distance between @? and g is minimum. The explicit ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 34, Issue 2
April 2016
220 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2891107
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2016
- Accepted: 1 August 2015
- Revised: 1 June 2015
- Received: 1 October 2014
Published in tois Volume 34, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Statistical model selection
power laws
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Power Law Distributions in Information Retrieval

ACM Transactions on Information Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Revisiting Power-law Distributions in Spectra of Real World Networks

Power-Law Distributions in Empirical Data

Probability distributions conditioned by the available information: Gamma distribution and moments