More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis

Recchia, Gabriel; Jones, Michael N.

doi:10.3758/BRM.41.3.647

More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis

Society for Computers in Psychology
Published: 01 August 2009

Volume 41, pages 647–656, (2009)
Cite this article

Behavior Research Methods Aims and scope Submit manuscript

Gabriel Recchia¹ &
Michael N. Jones¹

2333 Accesses
102 Citations
Explore all metrics

Abstract

Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LSAfun - An R package for computations based on Latent Semantic Analysis

Article 26 November 2014

Estimating the average need of semantic knowledge from distributional semantic models

Article 13 July 2017

The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics

Article 02 May 2016

References

Anderson, J. R., & Pirolli, P. L. (1984). Spread of activation. Journal of Experimental Psychology: Learning, Memory, & Cognition, 10, 791–798.
Google Scholar
Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Conference of the Association for Computational Linguistics (pp. 26–33). Stroudsburg, PA: Association for Computational Linguistics.
Google Scholar
Budiu, R., Royer, C., & Pirolli, P. L. (2007). Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of the 8th Annual Conference of the Recherche d’Information Assistée par Ordinateur (RIAO). Pittsburgh, PA: Centre des Hautes Études Internationales d’Informatique Documentaire.
Google Scholar
Bullinaria, J. A., & Levy, J. P. (2006). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.
Article Google Scholar
Burgess, C., & Lund, K. (2000). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual and representational change in humans and machines (pp. 117–156). Mahwah, NJ: Erlbaum.
Google Scholar
Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., & Wesley, G. (2006). Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6, 153–186.
Article Google Scholar
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.
Google Scholar
Cilibrasi, R., & Vitányi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge & Data Engineering, 19, 370–383.
Article Google Scholar
Deane, P., Sheehan, K. M., Sabatini, J., Futagi, Y., & Kostin, I. (2006). Differences in text structure and its implications for assessment of struggling readers. Scientific Studies of Reading, 10, 257–275.
Article Google Scholar
Dow Jones & Co. (2008). Available from Dow Jones Factiva Web site: http://factiva.com.
Farahat, A., Pirolli, P., & Markova, P. (2004). Incremental methods for computing word pair similarity (TR-04-6). Palo Alto, CA: Palo Alto Research Center, Inc.
Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.
Article Google Scholar
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25, 285–307.
Article Google Scholar
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244.
Article PubMed Google Scholar
Hare, M., Jones, M. N., Thomson, C., Kelly, S., & McRae, K. (in press). Activating event knowledge. Cognition.
Hayes, D. P. (1988). Speaking and writing: Distinct patterns of word choice. Journal of Memory & Language, 27, 572–585.
Article Google Scholar
Hoffman, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual Special Interest Group on Information Retrieval (SIGIR) Conference (pp. 50–57). New York: ACM Press.
Google Scholar
Infoture, Inc. (2008, September). Transcriptional analyses of the Infoture natural language corpus (Report ITR-06-2). Retrieved December 15, 2008, from www.infoture.org/TechReport.aspx/Transcription/ITR-06-2/ITR-06-2_Transcription.pdf.
Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space accounts of priming. Journal of Memory & Language, 55, 534–552.
Article Google Scholar
Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37.
Article PubMed Google Scholar
Kanerva, P., Kristoferson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In L. R. Gleitman & A. K. Joshi (Eds.), Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (pp. 103–106). Austin, TX: Cognitive Science Society. (Also available at www.rni.org/kanerva/cogsci2k-abstract.ps
Google Scholar
Kaur, I., & Hornof, A. J. (2005). A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In G. C. van der Veer & C. Gale (Eds.), Proceedings of the 2005 Conference on Human Factors in Computing Systems (CHI) (pp. 51–60). New York: ACM Press.
Chapter Google Scholar
Kintsch, W. (2001). Predication. Cognitive Science, 25, 173–202.
Article Google Scholar
Landauer, T. K., & Dumais, S. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Article Google Scholar
Landauer, T. K., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Article Google Scholar
Lemaire, B., & Denhière, G. (2004). Incremental construction of an associative network from a corpus. In K. Forbus, D. Gentner, & T. Regier (Eds.), Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 825–830). Austin, TX: Cognitive Science Society.
Google Scholar
Lin, M.-H. (2000). Out-of-core singular value decomposition (Report TR-83). New York: Stony Brook University, Experimental Computer Systems Laboratory.
Google Scholar
Martin, D. I., Martin, J. C., Berry, M. W., & Browne, M. (2007). Out-of-core SVD performance for document indexing. Applied Numerical Mathematics, 57, 1230–1239.
Article Google Scholar
Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005, September). Terms representation with generalized latent semantic analysis. Presentation at the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language & Cognitive Processes, 6, 1–28.
Article Google Scholar
Novell, Inc. (2008). Mono 2.0 [Computer software]. Retrieved August 1, 2008, from www.mono-project.com.
Onnis, L., & Christiansen, M. H. (2008). Lexical categories at the edge of the word. Cognitive Science, 32, 184–221.
Article PubMed Google Scholar
Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33, 161–199.
Article Google Scholar
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity: Measuring the relatedness of concepts. In D. L. McGuinness & G. Ferguson (Eds.), Proceedings of the 19th National Conference on Artificial Intelligence (Intelligent Systems Demonstrations) (pp. 1024–1025). Cambridge, MA: MIT Press.
Google Scholar
Perfetti, C. A. (1998). The limits of co-occurrence: Tools and theories in language research. Discourse Processes, 25, 363–377.
Article Google Scholar
Quesada, J. (2006). Creating your own LSA space. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis. Mahwah, NJ: Erlbaum.
Google Scholar
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the 9th Machine Translation Summit (pp. pp315–322). New Orleans.
Resnik, P. (1995). Using information content to evaluate semantic similarity. In C. S. Mellish (Ed.), Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 448–453). San Francisco: Morgan Kaufmann.
Google Scholar
Risley, T. R., & Hart, B. (2006). Promoting early language development. In N. F. Watt, C. Ayoub, R. H. Bradley, J. E. Puma, & W. A. LeBoeuf (Eds.), The crisis in youth mental health: Critical issues and effective programs: Vol. 4. Early intervention programs and policies (pp. 83–88). Westport, CT: Praeger.
Google Scholar
Rogers, T. T., & McClelland, J. L. (2004). Semantic cognition: A parallel distributed processing approach. Cambridge, MA: MIT Press.
Book Google Scholar
Rohde, D. (2005). SVDLIBC [Computer software]. Retrieved from http://tedlab.mit.edu/∼dr/svdlibc/.
Rohde, D., Gonnerman, L., & Plaut, D. (2006). An improved model of semantic similarity based on lexical co-occurence. Manuscript submitted for publication.
Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8, 627–633.
Article Google Scholar
Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Unpublished doctoral dissertation, Stockholm University.
Stone, B. P., Dennis, S. J., & Kwantes, P. J. (2008). A systematic comparison of semantic models on human similarity rating data: The effectiveness of subspacing. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp. 1813–1818). Austin, TX: Cognitive Science Society.
Google Scholar
Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of the 2003 Conference of the North American Chapter of HLT-NAACL (pp. 165–172). Edmonton, AL, Canada: HLT-NAACL.
Google Scholar
Turney, P. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the 12th European Conference on Machine Learning (pp. 491–502). Berlin: Springer.
Google Scholar
Turney, P., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21, 315–346.
Article Google Scholar
Veksler, V. D., Gray, W. D., Gamard, S., Grintsvayg, A., & Lindsey, R. (2008). Measures of semantic relatedness. Retrieved from Rensselaer MSR Web site: http://cwl-projects.cogsci.rpi.edu/msr/msr-about.html.
Veksler, V. D., Grintsvayg, A., Lindsey, R., & Gray, W. D. (2007). A proxy for all your semantic needs. In D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (p. 1878). Austin, TX: Cognitive Science Society.
Google Scholar
Willits, J. A., D’Mello, S. K., Duran, N. D., & Olney, A. (2007). Distributional statistics and thematic role relationships. In D. S. Mc-Namara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (pp. 707–712). Austin, TX: Cognitive Science Society.
Google Scholar
Zeno, S., Ivens, S., Millard, R., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone.
Google Scholar

Download references

Author information

Authors and Affiliations

Cognitive Science Program, Indiana University, 819 Eigenmann, 1910 E. 10th St., 47406-7512, Bloomington, IN
Gabriel Recchia & Michael N. Jones

Authors

Gabriel Recchia
View author publications
You can also search for this author in PubMed Google Scholar
Michael N. Jones
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Recchia.

Additional information

This work was presented at the 38th Meeting of the Society for Computers in Psychology, Chicago, IL. G.R.’s contribution received the John Castellan Award for best student paper. Correspondence concerning this article should be addressed to G. Recchia,

Rights and permissions

Reprints and permissions

About this article

Cite this article

Recchia, G., Jones, M.N. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behavior Research Methods 41, 647–656 (2009). https://doi.org/10.3758/BRM.41.3.647

Download citation

Received: 21 November 2008
Accepted: 11 December 2008
Published: 01 August 2009
Issue Date: August 2009
DOI: https://doi.org/10.3758/BRM.41.3.647

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis

Abstract

Access this article

Similar content being viewed by others

LSAfun - An R package for computations based on Latent Semantic Analysis

Estimating the average need of semantic knowledge from distributional semantic models

The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis

Abstract

Access this article

Similar content being viewed by others

LSAfun - An R package for computations based on Latent Semantic Analysis

Estimating the average need of semantic knowledge from distributional semantic models

The principals of meaning: Extracting semantic dimensions from co-occurrence models of semantics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation