Skip to main content
Log in

More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis

  • Society for Computers in Psychology
  • Published:
Behavior Research Methods Aims and scope Submit manuscript

Abstract

Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Anderson, J. R., & Pirolli, P. L. (1984). Spread of activation. Journal of Experimental Psychology: Learning, Memory, & Cognition, 10, 791–798.

    Google Scholar 

  • Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Conference of the Association for Computational Linguistics (pp. 26–33). Stroudsburg, PA: Association for Computational Linguistics.

    Google Scholar 

  • Budiu, R., Royer, C., & Pirolli, P. L. (2007). Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of the 8th Annual Conference of the Recherche d’Information Assistée par Ordinateur (RIAO). Pittsburgh, PA: Centre des Hautes Études Internationales d’Informatique Documentaire.

    Google Scholar 

  • Bullinaria, J. A., & Levy, J. P. (2006). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.

    Article  Google Scholar 

  • Burgess, C., & Lund, K. (2000). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual and representational change in humans and machines (pp. 117–156). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., & Wesley, G. (2006). Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6, 153–186.

    Article  Google Scholar 

  • Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.

    Google Scholar 

  • Cilibrasi, R., & Vitányi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge & Data Engineering, 19, 370–383.

    Article  Google Scholar 

  • Deane, P., Sheehan, K. M., Sabatini, J., Futagi, Y., & Kostin, I. (2006). Differences in text structure and its implications for assessment of struggling readers. Scientific Studies of Reading, 10, 257–275.

    Article  Google Scholar 

  • Dow Jones & Co. (2008). Available from Dow Jones Factiva Web site: http://factiva.com.

  • Farahat, A., Pirolli, P., & Markova, P. (2004). Incremental methods for computing word pair similarity (TR-04-6). Palo Alto, CA: Palo Alto Research Center, Inc.

    Google Scholar 

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.

    Article  Google Scholar 

  • Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25, 285–307.

    Article  Google Scholar 

  • Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244.

    Article  PubMed  Google Scholar 

  • Hare, M., Jones, M. N., Thomson, C., Kelly, S., & McRae, K. (in press). Activating event knowledge. Cognition.

  • Hayes, D. P. (1988). Speaking and writing: Distinct patterns of word choice. Journal of Memory & Language, 27, 572–585.

    Article  Google Scholar 

  • Hoffman, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual Special Interest Group on Information Retrieval (SIGIR) Conference (pp. 50–57). New York: ACM Press.

    Google Scholar 

  • Infoture, Inc. (2008, September). Transcriptional analyses of the Infoture natural language corpus (Report ITR-06-2). Retrieved December 15, 2008, from www.infoture.org/TechReport.aspx/Transcription/ITR-06-2/ITR-06-2_Transcription.pdf.

  • Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space accounts of priming. Journal of Memory & Language, 55, 534–552.

    Article  Google Scholar 

  • Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37.

    Article  PubMed  Google Scholar 

  • Kanerva, P., Kristoferson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In L. R. Gleitman & A. K. Joshi (Eds.), Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (pp. 103–106). Austin, TX: Cognitive Science Society. (Also available at www.rni.org/kanerva/cogsci2k-abstract.ps

    Google Scholar 

  • Kaur, I., & Hornof, A. J. (2005). A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In G. C. van der Veer & C. Gale (Eds.), Proceedings of the 2005 Conference on Human Factors in Computing Systems (CHI) (pp. 51–60). New York: ACM Press.

    Chapter  Google Scholar 

  • Kintsch, W. (2001). Predication. Cognitive Science, 25, 173–202.

    Article  Google Scholar 

  • Landauer, T. K., & Dumais, S. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

    Article  Google Scholar 

  • Lemaire, B., & Denhière, G. (2004). Incremental construction of an associative network from a corpus. In K. Forbus, D. Gentner, & T. Regier (Eds.), Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 825–830). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Lin, M.-H. (2000). Out-of-core singular value decomposition (Report TR-83). New York: Stony Brook University, Experimental Computer Systems Laboratory.

    Google Scholar 

  • Martin, D. I., Martin, J. C., Berry, M. W., & Browne, M. (2007). Out-of-core SVD performance for document indexing. Applied Numerical Mathematics, 57, 1230–1239.

    Article  Google Scholar 

  • Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005, September). Terms representation with generalized latent semantic analysis. Presentation at the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria.

  • Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language & Cognitive Processes, 6, 1–28.

    Article  Google Scholar 

  • Novell, Inc. (2008). Mono 2.0 [Computer software]. Retrieved August 1, 2008, from www.mono-project.com.

  • Onnis, L., & Christiansen, M. H. (2008). Lexical categories at the edge of the word. Cognitive Science, 32, 184–221.

    Article  PubMed  Google Scholar 

  • Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33, 161–199.

    Article  Google Scholar 

  • Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity: Measuring the relatedness of concepts. In D. L. McGuinness & G. Ferguson (Eds.), Proceedings of the 19th National Conference on Artificial Intelligence (Intelligent Systems Demonstrations) (pp. 1024–1025). Cambridge, MA: MIT Press.

    Google Scholar 

  • Perfetti, C. A. (1998). The limits of co-occurrence: Tools and theories in language research. Discourse Processes, 25, 363–377.

    Article  Google Scholar 

  • Quesada, J. (2006). Creating your own LSA space. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis. Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the 9th Machine Translation Summit (pp. pp315–322). New Orleans.

  • Resnik, P. (1995). Using information content to evaluate semantic similarity. In C. S. Mellish (Ed.), Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 448–453). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Risley, T. R., & Hart, B. (2006). Promoting early language development. In N. F. Watt, C. Ayoub, R. H. Bradley, J. E. Puma, & W. A. LeBoeuf (Eds.), The crisis in youth mental health: Critical issues and effective programs: Vol. 4. Early intervention programs and policies (pp. 83–88). Westport, CT: Praeger.

    Google Scholar 

  • Rogers, T. T., & McClelland, J. L. (2004). Semantic cognition: A parallel distributed processing approach. Cambridge, MA: MIT Press.

    Book  Google Scholar 

  • Rohde, D. (2005). SVDLIBC [Computer software]. Retrieved from http://tedlab.mit.edu/∼dr/svdlibc/.

  • Rohde, D., Gonnerman, L., & Plaut, D. (2006). An improved model of semantic similarity based on lexical co-occurence. Manuscript submitted for publication.

  • Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8, 627–633.

    Article  Google Scholar 

  • Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Unpublished doctoral dissertation, Stockholm University.

  • Stone, B. P., Dennis, S. J., & Kwantes, P. J. (2008). A systematic comparison of semantic models on human similarity rating data: The effectiveness of subspacing. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp. 1813–1818). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of the 2003 Conference of the North American Chapter of HLT-NAACL (pp. 165–172). Edmonton, AL, Canada: HLT-NAACL.

    Google Scholar 

  • Turney, P. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the 12th European Conference on Machine Learning (pp. 491–502). Berlin: Springer.

    Google Scholar 

  • Turney, P., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21, 315–346.

    Article  Google Scholar 

  • Veksler, V. D., Gray, W. D., Gamard, S., Grintsvayg, A., & Lindsey, R. (2008). Measures of semantic relatedness. Retrieved from Rensselaer MSR Web site: http://cwl-projects.cogsci.rpi.edu/msr/msr-about.html.

  • Veksler, V. D., Grintsvayg, A., Lindsey, R., & Gray, W. D. (2007). A proxy for all your semantic needs. In D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (p. 1878). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Willits, J. A., D’Mello, S. K., Duran, N. D., & Olney, A. (2007). Distributional statistics and thematic role relationships. In D. S. Mc-Namara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (pp. 707–712). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Zeno, S., Ivens, S., Millard, R., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Recchia.

Additional information

This work was presented at the 38th Meeting of the Society for Computers in Psychology, Chicago, IL. G.R.’s contribution received the John Castellan Award for best student paper. Correspondence concerning this article should be addressed to G. Recchia,

Rights and permissions

Reprints and permissions

About this article

Cite this article

Recchia, G., Jones, M.N. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behavior Research Methods 41, 647–656 (2009). https://doi.org/10.3758/BRM.41.3.647

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3758/BRM.41.3.647

Keywords

Navigation