Abstract
Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Power Law Distributions in Information Retrieval
- Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, and Bernardo A. Huberman. 2001. Search in power-law networks. Physical Review E 64, 4 (2001), 046135.Google ScholarCross Ref
- Hirotugu Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automated Control 19, 6 (1974), 716--723.Google ScholarCross Ref
- Avi Arampatzis and Jaap Kamps. 2008. A study of query length. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 811--812. Google ScholarDigital Library
- Avi Arampatzis and Jaap Kamps. 2009. A signal-to-noise approach to score normalization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 797--806. Google ScholarDigital Library
- Taylor B. Arnold and John W. Emerson. 2011. Nonparametric goodness-of-fit tests for discrete null distributions. The R Journal 3, 2 (2011), 34--39.Google ScholarCross Ref
- Harshvardhan Asthana, Ruoxun Fu, and Ingemar J. Cox. 2011. On the feasibility of unstructured peer-to-peer information retrieval. In Advances in Information Retrieval Theory. Springer, 125--138. Google ScholarDigital Library
- Leif Azzopardi. 2009. Query side evaluation: An empirical analysis of effectiveness and effort. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, 556--563. Google ScholarDigital Library
- Harald Baayen. 2001. Word Frequency Distributions. Springer.Google Scholar
- Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih-Reza Amini. 2014. Re-ranking approach to classification in large-scale power-law distributed category systems. In Proceedings of the 37th International ACM SIGIR Conference on Research (SIGIR 2014). ACM, 1059--1062. Google ScholarDigital Library
- David F. Babbel, Vincent J. Strickler, and Ricki S. Dolan. 2009. Statistical string theory for courts: If the data don’t fit. Legal Technology Risk Management 4 (2009), 1.Google Scholar
- Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 183--190. Google ScholarDigital Library
- Ricardo Baeza-Yates, Javier Ruiz-del Solar, Rodrigo Verschae, Carlos Castillo, and Carlos Hurtado. 2004. Content-based image retrieval and characterization on specific web collections. In Image and Video Retrieval. Springer, 189--198.Google Scholar
- Ricardo Baeza-Yates and Felipe Saint-Jean. 2003. A three level search engine index based in query log distribution. In String Processing and Information Retrieval. Springer, 56--65.Google Scholar
- Ricardo Baeza-Yates and Alessandro Tiberi. 2007. Extracting semantic relations from query logs. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 76--85. Google ScholarDigital Library
- Albert-László Barabási, Réka Albert, and Hawoong Jeong. 1999. Mean-field theory for scale-free random networks. Physica A: Statistical Mechanics and Its Applications 272, 1 (1999), 173--187.Google ScholarCross Ref
- Heiko Bauke. 2007. Parameter estimation for power-law distributions by maximum likelihood methods. The European Physical Journal B-Condensed Matter and Complex Systems 58, 2 (2007), 167--173.Google ScholarCross Ref
- Michael A. Bean. 2001. Probability: The Science of Uncertainty with Applications to Investments, Insurance, and Engineering. Vol. 6. American Mathematical Society.Google Scholar
- Luca Becchetti and Carlos Castillo. 2006. The distribution of pagerank follows a power-law only for particular values of the damping factor. In Proceedings of the 15th International Conference on World Wide Web. ACM, 941--942. Google ScholarDigital Library
- Casper Beckman. 1999. Chinese character frequencies. http://casper.beckman.uiuc.edu/∼c-tsai4/chinese/charfreq.html. (1999). No longer available.Google Scholar
- Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef Teugels. 2006. Statistics of Extremes: Theory and Applications. John Wiley & Sons.Google Scholar
- Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. SpamRank--Fully automatic link spam detection work in progress. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Kerstin Bischoff, Claudiu S. Firan, Wolfgang Nejdl, and Raluca Paiu. 2008. Can all tags be used for search?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 193--202. Google ScholarDigital Library
- Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, and Fausto Rabitti. 2009. CoPhIR: A test collection for content-based image retrieval. arXiv preprint arXiv:0905.4627 (2009).Google Scholar
- Abraham Bookstein. 1990. Informetric distributions, part I: Unified overview. American Society for Information Science 41, 5 (1990), 368--375.Google ScholarCross Ref
- George E. P. Box and David R. Cox. 1964. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological) (1964), 211--252.Google Scholar
- Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’99), Vol. 1. IEEE, 126--134.Google ScholarCross Ref
- Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the web. Computer Networks 33, 1 (2000), 309--320. Google ScholarDigital Library
- Mark Buchanan. 2004. Power laws & the new science of complexity management. Strategy+ Business 34 (2004), 1--8.Google Scholar
- Kenneth P. Burnham and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.Google Scholar
- Pedro Cano, Oscar Celma, Markus Koppenberger, and Javier M. Buldu. 2006. Topology of music recommendation networks. Chaos: An Interdisciplinary Journal of Nonlinear Science 16, 1 (2006), 013107.Google ScholarCross Ref
- Domenico Cantone, Salvatore Cristofaro, Simone Faro, and Emanuele Giaquinta. 2009. Finite state models for the generation of large corpora of natural language texts. In Proceedings of the 7th International Workshop on Finite-state Methods and Natural Language Processing, Vol. 191. IOS Press, 175. Google ScholarDigital Library
- Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang Li. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 875--883. Google ScholarDigital Library
- Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph mining: Laws, generators, and algorithms. ACM Computing Surveys (CSUR) 38, 1 (2006), 2. Google ScholarDigital Library
- Michael Chau, Yan Lu, Xiao Fang, and Christopher C. Yang. 2009. Characteristics of character usage in Chinese Web searching. Information Processing & Management 45, 1 (2009), 115--130. Google ScholarDigital Library
- Surajit Chaudhuri, Kenneth Church, Arnd Christian König, and Liying Sui. 2007. Heavy-tailed distributions and multi-keyword queries. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 663--670. Google ScholarDigital Library
- Serena H. Chen and Carmel A. Pollino. 2012. Good practice in Bayesian network modelling. Environmental Modelling & Software 37 (2012), 134--145. Google ScholarDigital Library
- Pasquale Cirillo. 2013. Are your data really pareto distributed? Physica A: Statistical Mechanics and its Applications 392, 23 (2013), 5947--5962.Google Scholar
- Kevin A. Clarke. 2003. Nonparametric model discrimination in international relations. Journal of Conflict Resolution 47, 1 (2003), 72--93.Google ScholarCross Ref
- Kevin A. Clarke. 2007. A simple distribution-free test for nonnested model selection. Political Analysis 15, 3 (2007), 347--363.Google ScholarCross Ref
- Aaron Clauset, Cosma R. Shalizi, and Mark E. J. Newman. 2007. Power-law distributions in empirical data. SIAM review 51, 4 (2007), 661--703. Google ScholarDigital Library
- Maarten Clements, Arjen P. de Vries, and Marcel J. T. Reinders. 2010. The influence of personalization on tag query length in social media search. Information Processing & Management 46, 4 (2010), 403--412. Google ScholarDigital Library
- Will Cook, Paul Ormerod, and Ellie Cooper. 2004. Scaling behaviour in the number of criminal acts committed by individuals. Journal of Statistical Mechanics: Theory and Experiment 2004, 7 (2004), P07003.Google ScholarCross Ref
- Gregory W. Corder and Dale I. Foreman. 2009. Nonparametric Statistics for Non-Statisticians: A Step-By-Step Approach. John Wiley & Sons.Google Scholar
- Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval 14, 5 (2011), 441--465. Google ScholarDigital Library
- Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239--246. Google ScholarDigital Library
- Mark E. Crovella and Murad S. Taqqu. 1999. Estimating the heavy tail index from scaling properties. Methodology and Computing in Applied Probability 1, 1 (1999), 55--79. Google ScholarDigital Library
- Wang Dahui, Li Menghui, and Di Zengru. 2005. True reason for Zipf’s law in language. Physica A: Statistical Mechanics and its Applications 358, 2 (2005), 545--550.Google Scholar
- Russell Davidson and James G. MacKinnon. 1981. Several tests for model specification in the presence of alternative hypotheses. Econometrica: Journal of the Econometric Society (1981), 781--793.Google Scholar
- Shuai Ding, Josh Attenberg, Ricardo Baeza-Yates, and Torsten Suel. 2011. Batch query processing for web search engines. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. ACM, 137--146. Google ScholarDigital Library
- Sandor Dominich and Tamas Kiezer. 2005. Zipfs law, small world and Hungarian language. Alkalmazott Nyelvtudomány 1, 2 (2005), 5--24. In Hungarian.Google Scholar
- Joshua Drucker. 2007. Regional Dominance and Industrial Success: A Productivity-Based Analysis. ProQuest.Google Scholar
- Jan Eeckhout. 2004. Gibrat’s law for (all) cities. American Economic Review (2004), 1429--1451.Google Scholar
- Leo Egghe. 2000. The distribution of N-grams. Scientometrics 47, 2 (2000), 237--252.Google ScholarCross Ref
- Ramon Ferrer-i Cancho and Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One 5, 3 (2010), e9411.Google ScholarCross Ref
- Andrey Feuerverger and Peter Hall. 1999. Estimating a tail exponent by modelling departure from a Pareto distribution. The Annals of Statistics 27, 2 (1999), 760--781.Google ScholarCross Ref
- Catherine Forbes, Merran Evans, Nicholas Hastings, and Brian Peacock. 2011. Statistical distributions. John Wiley & Sons.Google Scholar
- Xavier Gabaix. 2009. Power laws in economics and finance. Annual Review of Economics 1 (2009), 255--93.Google ScholarCross Ref
- Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, and Wolfgang Kellerer. 2010. Outtweeting the Twitterers - Predicting information cascades in microblogs. In Proceedings of the 3rd Conference on Online Social Networks. Google ScholarDigital Library
- David Garcia, Pavlin Mavrodiev, and Frank Schweitzer. 2013. Social resilience in online communities: The autopsy of friendster. In Proceedings of the First ACM Conference on Online Social Networks. ACM, 39--50. Google ScholarDigital Library
- Wolfgang Gatterbauer. 2011. Rules of thumb for information acquisition from large and redundant data. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. 479--490. Google ScholarDigital Library
- Natalie Glance, Matthew Hurst, and Takashi Tomokiyo. 2004. Blogpulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging ecosystem: Aggregation, Analysis and Dynamics, Vol. 2004. ACM.Google Scholar
- Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. Technical Report. Google. http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html.Google Scholar
- Greg N. Gregoriou. 2009. Operational Risk Toward Basel III: Best Practices and Issues in Modeling, Management, and Regulation. Vol. 481. John Wiley & Sons.Google Scholar
- Peter Grünwald. 2007. The Minimum Description Length Principle. MIT press.Google Scholar
- Cathal Gurrin and Alan F. Smeaton. 2004. Replicating web structure in small-scale test collections. Information retrieval 7, 3--4 (2004), 239--263. Google ScholarDigital Library
- Matthias Hagen, Martin Potthast, Benno Stein, and Christof Braeutigam. 2010. The power of naive query segmentation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 797--798. Google ScholarDigital Library
- Harry Halpin, Valentin Robu, and Hana Shepherd. 2007. The complex dynamics of collaborative tagging. In Proceedings of the 16th International Conference on World Wide Web. ACM, 211--220. Google ScholarDigital Library
- Robert K. Hammond and James E. Bickel. 2013. Reexamining discrete approximations to continuous distributions. Decision Analysis 10, 1 (2013), 6--25. Google ScholarDigital Library
- Claudia Hauff and Leif Azzopardi. 2005. Age dependent document priors in link structure analysis. In Advances in Information Retrieval. Springer, 552--554. Google ScholarDigital Library
- Harold S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA. Google ScholarDigital Library
- Daniel Heesch and Stefan Rüger. 2004. NNk networks for content-based image retrieval. In Advances in Information Retrieval. Springer, 253--266.Google Scholar
- Joseph Hilbe. 2011. Negative Binomial Regression. Cambridge University Press.Google Scholar
- Bruce M. Hill. 1975. A simple general approach to inference about the tail of a distribution. The Annals of Statistics 3, 5 (1975), 1163--1174.Google ScholarCross Ref
- Andreas Hotho, Robert Jäschke, Christoph Schmitz, and Gerd Stumme. 2006. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications (2006), 411--426. Google ScholarDigital Library
- Bernardo A. Huberman and Lada A. Adamic. 1999. Evolutionary dynamics of the world wide web. arXiv Preprint Cond-Mat/9901071 (1999).Google Scholar
- Clifford M. Hurvich and Chih-Ling Tsai. 1989. Regression and time series model selection in small samples. Biometrika 76, 2 (1989), 297--307.Google ScholarCross Ref
- Hawoong Jeong, Bálint Tombor, Réka Albert, Zoltan N. Oltvai, and Albert-László Barabási. 2000. The large-scale organization of metabolic networks. Nature 407, 6804 (2000), 651--654.Google ScholarCross Ref
- Hai Jin, Xiaomin Ning, and Hanhua Chen. 2006. Efficient search for peer-to-peer information retrieval using semantic small world. In Proceedings of the 15th International Conference on World Wide Web. ACM, 1003--1004. Google ScholarDigital Library
- Shudong Jin and Azer Bestavros. 2000. Sources and characteristics of web temporal locality. In Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2000. IEEE, 28--35. Google ScholarDigital Library
- Norman L. Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan. 2002. Continuous Multivariate Distributions, Volume 1, Models and Applications. Vol. 59. New York: John Wiley & Sons.Google Scholar
- Jaeyeon Jung, Emil Sit, Hari Balakrishnan, and Robert Morris. 2002. DNS performance and the effectiveness of caching. IEEE/ACM Transactions on Networking 10, 5 (2002), 589--603. Google ScholarDigital Library
- Jaap Kamps and Marijn Koolen. 2008. The importance of link evidence in Wikipedia. In Advances in Information Retrieval. Springer, 270--282. Google ScholarDigital Library
- Noriaki Kawamae. 2014. Supervised N-gram topic model. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (Web Search and Data Mining’14). 473--482. Google ScholarDigital Library
- Noam Koenigstein, Yuval Shavitt, Ela Weinsberg, and Udi Weinsberg. 2010. On the applicability of peer-to-peer data in music information retrieval research. In International Society for Music Information Retrieval. 273--278.Google Scholar
- Leonid Kopylev. 2012. Constrained parameters in applications: Review of issues and approaches. International Scholarly Research Notices 2012 (2012).Google ScholarCross Ref
- Beate Krause, Robert Jäschke, Andreas Hotho, and Gerd Stumme. 2008. Logsonomy-social information retrieval with logdata. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia. ACM, 157--166. Google ScholarDigital Library
- Jérôme Kunegis and Julia Preusse. 2012. Fairness on the web: Alternatives to the power law. In Proceedings of the 4th Annual ACM Web Science Conference. ACM, 175--184. Google ScholarDigital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600. Google ScholarDigital Library
- Erich L. Lehmann and Joseph P. Romano. 2006. Testing Statistical Hypotheses. Springer.Google Scholar
- Mark Levy and Mark Sandler. 2009. Music information retrieval using social tags and audio. IEEE Transactions on Multimedia 11, 3 (2009), 383--395. Google ScholarDigital Library
- Christina Lioma. 2007. Part of Speech n-Grams for Information Retrieval. Ph.D. Dissertation. University of Glasgow.Google Scholar
- Christina Lioma and Iadh Ounis. 2007. Light syntactically-based index pruning for information retrieval. In Proceedings of the 29th European Conference on IR Research Advances in Information Retrieval (ECIR 2007), Rome, Italy, April 2--5, 2007, 88--100. Google ScholarDigital Library
- Christina Lioma and Iadh Ounis. 2008. A syntactically-based query reformulation technique for information retrieval. Information Processing & Management 44 (2008), 143--162. Google ScholarDigital Library
- Christina Lioma and Cornelis Joost van Rijsbergen. 2008. Part of speech N-grams and information retrieval. Revue française De Linguistique Appliquée 13, 1 (2008), 9--22.Google Scholar
- Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma. 2005. Support vector machines classification with a very large-scale taxonomy. ACM Knowledge Discovery and Data Mining: Explorations Newsletter 7, 1 (2005), 36--43. Google ScholarDigital Library
- Wuying Liu, Lin Wang, and Mianzhu Yi. 2013. Power law for text categorization. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 131--143.Google Scholar
- Roger Lowenstein. 2000. When Genius Failed: The Rise and Fall of Long-Term Capital Management. Random House Trade Paperbacks.Google Scholar
- Hans P. Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of Research and Development 2, 2 (1958), 159--165. Google ScholarDigital Library
- Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In Advances in Information Retrieval -32rd European Conference on IR Research (ECIR’10). Springer, 627--630. Google ScholarDigital Library
- Colin L. Mallows. 1973. Some comments on CP. Technometrics 15, 4 (1973), 661--675.Google Scholar
- Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953).Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press. Google Scholar
- Yuqing Mao and Zhiyong Lu. 2013. Predicting clicks of PubMed articles. In Proceedings of the AMIA Annual Symposium, Vol. 2013. American Medical Informatics Association, 947.Google Scholar
- Alberto Maydeu-Olivares and Carlos Garca-Forero. 2010. Goodness-of-fit testing. In International Encyclopedia of Education (3 ed.), Baker E. Peterson, P. and B. McGaw (Eds.). Elsevier, 190--196.Google Scholar
- Alberto Medina, Ibrahim Matta, and John Byers. 2000. On the origin of power laws in internet topologies. ACM SIGCOMM Computer Communication Review 30, 2 (2000), 18--28. Google ScholarDigital Library
- Mark M. Meerschaert and Hans-Peter Scheffler. 2001. Limit Distributions for Sums of Independent Random vectors: Heavy Tails in Theory and Practice. Vol. 321. John Wiley & Sons.Google Scholar
- Edgar Meij and Maarten de Rijke. 2007. Using prior information derived from citations in literature search. In Recherche d’Information et ses Applications.Google Scholar
- George A. Miller. 1957. Some effects of intermittent silence. American Journal of Psychology (1957), 311--314.Google Scholar
- Staša Milojević. 2010. Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2417--2425. Google ScholarDigital Library
- Gilad Mishne and Natalie Glance. 2006. Leave a reply: An analysis of weblog comments. In Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem.Google Scholar
- Michael Mitzenmacher. 2004. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 2 (2004), 226--251.Google ScholarCross Ref
- Saeedeh Momtazi and Dietrich Klakow. 2010. Hierarchical Pitman-yor language model for information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 793--794. Google ScholarDigital Library
- Fabrice Muhlenbach and Ricco Rakotomalala. 2005. Discretization of continuous attributes. Encyclopedia of Data Warehousing and Mining 1 (2005), 397--402.Google ScholarCross Ref
- Mark E. J. Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 5 (2005), 323--351.Google ScholarCross Ref
- Christopher R. Palmer and Greg Steffan. 2000. Generating network topologies that obey power laws. In Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM’00),Vol. 1. IEEE, 434--438.Google Scholar
- Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article 1. Google ScholarDigital Library
- David M. Pennock, Gary William Flake, Steve Lawrence, Eric J. Glover, and Clyde L. Giles. 2002. Winners don’t take all: Characterizing the competition for links on the web. In Proceedings of the National Academy of Sciences 99, 8 (2002), 5207--5211.Google ScholarCross Ref
- Matjaž Perc. 2010. Zipfs law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenias research as an example. Journal of Informetrics 4, 3 (2010), 358--364.Google ScholarCross Ref
- Isabella Peters and Wolfgang G. Stock. 2010. “Power tags” in information retrieval. Library Hi Tech 28, 1 (2010), 81--93.Google ScholarCross Ref
- Jim Pitman and Marc Yor. 1997. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability (1997), 855--900.Google Scholar
- David Posada and Thomas R. Buckley. 2004. Model selection and model averaging in phylogenetics: Advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Systematic Biology 53, 5 (2004), 793--808.Google ScholarCross Ref
- Le Quan Ha, Ji Ming, and Francis Jack Smith. 2003. Extension of Zipfs law to word and character n-grams for English and Chinese. Journal of Computational Linguistics and Chinese Language Processing 1, 77--102. Citeseer.Google Scholar
- Venugopalan Ramasubrama nian and Emin Gün Sirer. 2004. Beehive: Exploiting power law query distributions for O (1) lookup performance in peer to peer overlays. In Symposium on Networked Systems Design and Implementation. Usenix, San Francisco CA.Google Scholar
- Sidney Redner. 1998. How popular is your paper? An empirical study of the citation distribution. European Physical Journal B-Condensed Matter and Complex Systems 4, 2 (1998), 131--134.Google ScholarCross Ref
- William J. Reed. 2003. The Pareto law of incomes: An explanation and an extension. Physica A: Statistical Mechanics and Its Applications 319 (2003), 469--486.Google ScholarCross Ref
- William J. Reed and Murray Jorgensen. 2004. The double Pareto-lognormal distributiona new parametric model for size distributions. Communications in Statistics-Theory and Methods 33, 8 (2004), 1733--1753.Google ScholarCross Ref
- Matei Ripeanu and Ian T. Foster. 2002. Mapping the Gnutella network: Macroscopic properties of large-scale peer-to-peer systems. In IPTPS. Computing Research Repository, 85--93. Google ScholarDigital Library
- Seth Roberts and Harold Pashler. 2000. How persuasive is a good fit? A comment on theory testing. Psychological Review 107, 2 (2000), 358.Google ScholarCross Ref
- Issei Sato and Hiroshi Nakagawa. 2010. Topic models with power-law using Pitman-Yor process. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--682. Google ScholarDigital Library
- Christian D. Schunn and Dieter Wallach. 2005. Evaluating Goodness-of-Fit in Comparison of Models to Data. University of Saarland Press, Saarbrueken, 115--154.Google Scholar
- Gideon Schwarz. 1978. Estimating the dimension of a model. Annals of Statistics 6, 2 (1978), 461--464.Google ScholarCross Ref
- Ripunjai K. Shukla, Mohan Trivedi, and Manoj Kumar. 2010. On the proficient use of GEV distribution: A case study of subtropical monsoon region in India. Annals of Computer Science Series 8, 1 (2010).Google Scholar
- Börkur Sigurbjörnsson and Roelof van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th International Conference on World Wide Web. ACM, 327--336. Google ScholarDigital Library
- Herbert A. Simon. 1955. On a class of skew distribution functions. Biometrika (1955), 425--440.Google Scholar
- Ian Soboroff. 2002. Does wt10g look like the web? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 423--424. Google ScholarDigital Library
- Karen Spärck-Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 1 (1972), 11--21.Google ScholarCross Ref
- Laura Spierdijk and Mark Voorneveld. 2009. Superstars without talent? The Yule distribution controversy. Review of Economics and Statistics 91, 3 (2009), 648--652.Google ScholarCross Ref
- Kunwadee Sripanidkulchai, Bruce Maggs, and Hui Zhang. 2003. Efficient content location using interest-based locality in peer-to-peer systems. In Proceedings of the IEEE Societies’ 22nd Annual Joint Conference of the IEEE Computer and Communications (INFOCOM’03), Vol. 3. IEEE, 2166--2176.Google ScholarCross Ref
- Nitish Srivastava, Ruslan Salakhutdinov, and Geoffrey E. Hinton. 2013. Modeling documents with deep boltzmann machines. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 616--625.Google Scholar
- Alexandru Tatar, Panayotis Antoniadis, Marcelo D. De Amorim, and Serge Fdida. 2014. From popularity prediction to ranking online news. Social Network Analysis and Mining 4, 1 (2014), 1--12.Google ScholarCross Ref
- Jiancong Tong, Gang Wang, Douglas S. Stones, Shizhao Sun, Xiaoguang Liu, and Fan Zhang. 2013. Exploiting query term correlation for list caching in web search engines. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 1817--1820. Google ScholarDigital Library
- Yana Volkovich, Nelly Litvak, and Debora Donato. 2007. Determining factors behind the PageRank log-log plot. In Algorithms and Models for the Web-Graph. Springer, 108--123. Google ScholarDigital Library
- Quang H. Vuong. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica: Journal of the Econometric Society (1989), 307--333.Google Scholar
- Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Yong Yu, Wei-Ying Ma, WenSi Xi, and WeiGuo Fan. 2004. Optimizing web search using web click-through data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, 118--126. Google ScholarDigital Library
- Yiming Yang, Jian Zhang, and Bryan Kisiel. 2003. A scalability analysis of classifiers in text categorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 96--103. Google ScholarDigital Library
- Emmanuel J. Yannakoudakis, Ioannis Tsomokos, and Paul J. Hutton. 1990. N-Grams and their implication to natural language understanding. Pattern Recognition 23, 5 (1990), 509--528. Google ScholarDigital Library
- Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. 2011. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 325--334. Google ScholarDigital Library
- Haizheng Zhang and Victor Lesser. 2006. Multi-agent based peer-to-peer information retrieval systems with concurrent search sessions. In Proceedings of the 5th International Joint Conference on Autonomous agents and Multiagent Systems. ACM, 305--312. Google ScholarDigital Library
- Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and Clyde L. Giles. 2008. Exploring social annotations for information retrieval. In Proceedings of the 17th International Conference on World Wide Web. ACM, 715--724. Google ScholarDigital Library
- George K. Zipf. 1935. The Psycho-Biology of Language. Houghton, Mifflin.Google Scholar
Index Terms
- Power Law Distributions in Information Retrieval
Recommendations
Revisiting Power-law Distributions in Spectra of Real World Networks
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningBy studying a large number of real world graphs, we find empirical evidence that most real world graphs have a statistically significant power-law distribution with a cutoff in the singular values of the adjacency matrix and eigenvalues of the Laplacian ...
Power-Law Distributions in Empirical Data
Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the ...
Probability distributions conditioned by the available information: Gamma distribution and moments
Given a gamma probability distribution g as the observed distribution, and the information available on moments of the random variable, the probability distribution @? is derived such that the @g^2-distance between @? and g is minimum. The explicit ...
Comments