Abstract
Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.
Similar content being viewed by others
References
Anderson, J. R., & Pirolli, P. L. (1984). Spread of activation. Journal of Experimental Psychology: Learning, Memory, & Cognition, 10, 791–798.
Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Conference of the Association for Computational Linguistics (pp. 26–33). Stroudsburg, PA: Association for Computational Linguistics.
Budiu, R., Royer, C., & Pirolli, P. L. (2007). Modeling information scent: A comparison of LSA, PMI and GLSA similarity measures on common tests and corpora. In Proceedings of the 8th Annual Conference of the Recherche d’Information Assistée par Ordinateur (RIAO). Pittsburgh, PA: Centre des Hautes Études Internationales d’Informatique Documentaire.
Bullinaria, J. A., & Levy, J. P. (2006). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.
Burgess, C., & Lund, K. (2000). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual and representational change in humans and machines (pp. 117–156). Mahwah, NJ: Erlbaum.
Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., & Wesley, G. (2006). Stanford WebBase components and applications. ACM Transactions on Internet Technology, 6, 153–186.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.
Cilibrasi, R., & Vitányi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge & Data Engineering, 19, 370–383.
Deane, P., Sheehan, K. M., Sabatini, J., Futagi, Y., & Kostin, I. (2006). Differences in text structure and its implications for assessment of struggling readers. Scientific Studies of Reading, 10, 257–275.
Dow Jones & Co. (2008). Available from Dow Jones Factiva Web site: http://factiva.com.
Farahat, A., Pirolli, P., & Markova, P. (2004). Incremental methods for computing word pair similarity (TR-04-6). Palo Alto, CA: Palo Alto Research Center, Inc.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25, 285–307.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244.
Hare, M., Jones, M. N., Thomson, C., Kelly, S., & McRae, K. (in press). Activating event knowledge. Cognition.
Hayes, D. P. (1988). Speaking and writing: Distinct patterns of word choice. Journal of Memory & Language, 27, 572–585.
Hoffman, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual Special Interest Group on Information Retrieval (SIGIR) Conference (pp. 50–57). New York: ACM Press.
Infoture, Inc. (2008, September). Transcriptional analyses of the Infoture natural language corpus (Report ITR-06-2). Retrieved December 15, 2008, from www.infoture.org/TechReport.aspx/Transcription/ITR-06-2/ITR-06-2_Transcription.pdf.
Jones, M. N., Kintsch, W., & Mewhort, D. J. K. (2006). High-dimensional semantic space accounts of priming. Journal of Memory & Language, 55, 534–552.
Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114, 1–37.
Kanerva, P., Kristoferson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In L. R. Gleitman & A. K. Joshi (Eds.), Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (pp. 103–106). Austin, TX: Cognitive Science Society. (Also available at www.rni.org/kanerva/cogsci2k-abstract.ps
Kaur, I., & Hornof, A. J. (2005). A comparison of LSA, WordNet and PMI-IR for predicting user click behavior. In G. C. van der Veer & C. Gale (Eds.), Proceedings of the 2005 Conference on Human Factors in Computing Systems (CHI) (pp. 51–60). New York: ACM Press.
Kintsch, W. (2001). Predication. Cognitive Science, 25, 173–202.
Landauer, T. K., & Dumais, S. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Landauer, T. K., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Lemaire, B., & Denhière, G. (2004). Incremental construction of an associative network from a corpus. In K. Forbus, D. Gentner, & T. Regier (Eds.), Proceedings of the 26th Annual Meeting of the Cognitive Science Society (pp. 825–830). Austin, TX: Cognitive Science Society.
Lin, M.-H. (2000). Out-of-core singular value decomposition (Report TR-83). New York: Stony Brook University, Experimental Computer Systems Laboratory.
Martin, D. I., Martin, J. C., Berry, M. W., & Browne, M. (2007). Out-of-core SVD performance for document indexing. Applied Numerical Mathematics, 57, 1230–1239.
Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005, September). Terms representation with generalized latent semantic analysis. Presentation at the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria.
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language & Cognitive Processes, 6, 1–28.
Novell, Inc. (2008). Mono 2.0 [Computer software]. Retrieved August 1, 2008, from www.mono-project.com.
Onnis, L., & Christiansen, M. H. (2008). Lexical categories at the edge of the word. Cognitive Science, 32, 184–221.
Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33, 161–199.
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity: Measuring the relatedness of concepts. In D. L. McGuinness & G. Ferguson (Eds.), Proceedings of the 19th National Conference on Artificial Intelligence (Intelligent Systems Demonstrations) (pp. 1024–1025). Cambridge, MA: MIT Press.
Perfetti, C. A. (1998). The limits of co-occurrence: Tools and theories in language research. Discourse Processes, 25, 363–377.
Quesada, J. (2006). Creating your own LSA space. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis. Mahwah, NJ: Erlbaum.
Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the 9th Machine Translation Summit (pp. pp315–322). New Orleans.
Resnik, P. (1995). Using information content to evaluate semantic similarity. In C. S. Mellish (Ed.), Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 448–453). San Francisco: Morgan Kaufmann.
Risley, T. R., & Hart, B. (2006). Promoting early language development. In N. F. Watt, C. Ayoub, R. H. Bradley, J. E. Puma, & W. A. LeBoeuf (Eds.), The crisis in youth mental health: Critical issues and effective programs: Vol. 4. Early intervention programs and policies (pp. 83–88). Westport, CT: Praeger.
Rogers, T. T., & McClelland, J. L. (2004). Semantic cognition: A parallel distributed processing approach. Cambridge, MA: MIT Press.
Rohde, D. (2005). SVDLIBC [Computer software]. Retrieved from http://tedlab.mit.edu/∼dr/svdlibc/.
Rohde, D., Gonnerman, L., & Plaut, D. (2006). An improved model of semantic similarity based on lexical co-occurence. Manuscript submitted for publication.
Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy. Communications of the ACM, 8, 627–633.
Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Unpublished doctoral dissertation, Stockholm University.
Stone, B. P., Dennis, S. J., & Kwantes, P. J. (2008). A systematic comparison of semantic models on human similarity rating data: The effectiveness of subspacing. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp. 1813–1818). Austin, TX: Cognitive Science Society.
Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of the 2003 Conference of the North American Chapter of HLT-NAACL (pp. 165–172). Edmonton, AL, Canada: HLT-NAACL.
Turney, P. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the 12th European Conference on Machine Learning (pp. 491–502). Berlin: Springer.
Turney, P., & Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21, 315–346.
Veksler, V. D., Gray, W. D., Gamard, S., Grintsvayg, A., & Lindsey, R. (2008). Measures of semantic relatedness. Retrieved from Rensselaer MSR Web site: http://cwl-projects.cogsci.rpi.edu/msr/msr-about.html.
Veksler, V. D., Grintsvayg, A., Lindsey, R., & Gray, W. D. (2007). A proxy for all your semantic needs. In D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (p. 1878). Austin, TX: Cognitive Science Society.
Willits, J. A., D’Mello, S. K., Duran, N. D., & Olney, A. (2007). Distributional statistics and thematic role relationships. In D. S. Mc-Namara & J. G. Trafton (Eds.), Proceedings of the 29th Annual Meeting of the Cognitive Science Society (pp. 707–712). Austin, TX: Cognitive Science Society.
Zeno, S., Ivens, S., Millard, R., & Duvvuri, R. (1995). The educator’s word frequency guide. Brewster, NY: Touchstone.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was presented at the 38th Meeting of the Society for Computers in Psychology, Chicago, IL. G.R.’s contribution received the John Castellan Award for best student paper. Correspondence concerning this article should be addressed to G. Recchia,
Rights and permissions
About this article
Cite this article
Recchia, G., Jones, M.N. More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis. Behavior Research Methods 41, 647–656 (2009). https://doi.org/10.3758/BRM.41.3.647
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/BRM.41.3.647