ABSTRACT
With recent research interest in the confounding roles of homophily and contagion in studies of social influence, there is a strong need for reliable content-based measures of the similarity between people. In this paper, we investigate the use of text similarity measures as a way of predicting the similarity of prolific weblog authors. We describe a novel method of collecting human judgments of overall similarity between two authors, as well as demographic, political, cultural, religious, values, hobbies/interests, personality, and writing style similarity. We then apply a range of automated textual similarity measures based on word frequency counts, and calculate their statistical correlation with human judgments. Our findings indicate that commonly used text similarity measures do not correlate well with human judgments of author similarity. However, various measures that pay special attention to personal pronouns and their context correlate significantly with different facets of similarity.
- Argamon, S., Dhawle, S., Koppel, M., and Pennebaker, J. Lexical predictors of personality type. In Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of North America (2005).Google Scholar
- Burton, K., Java, A., and Soboroff, I. The icwsm 2009 spinn3r dataset. In Proceedings of the Third Annual Conference on Weblogs and Social Media, ICWSM 2009 (San Jose, CA, 2009).Google Scholar
- Christakis, N., and Fowler, J. The spread of obesity in a large social network over 32 years. New England Journal of Medicine 357, 4 (2007), 370--379.Google ScholarCross Ref
- Christakis, N., and Fowler, J. The collective dynamics of smoking in a large social network. New England Journal of Medicine 358, 21 (2008), 2249--2258.Google ScholarCross Ref
- Cohn, M., Mehl, M., and Pennebaker, J. Linguistic markers of psychological change surrounding september 11, 2001. Psychological Science 15, 10 (2004), 687--693.Google ScholarCross Ref
- Fast, L., and Funder, D. Personality as manifest in word use: Correlations with self-report, acquaintance report, and behavior. Journal of Personality and Social Psychology 94, 2 (2008), 334.Google ScholarCross Ref
- Goldberg, L. An alternative "description of personality": the big-five factor structure. Journal of Personality and Social Psychology; Journal of Personality and Social Psychology 59, 6 (1990), 1216--1229.Google ScholarCross Ref
- Gordon, A., and Swanson, R. Identifying personal stories in millions of weblog entries. In Proceedings of the Third International Conference on Weblogs and Social Media, Data Challenge Workshop, ICWSM 2009 (San Jose, CA, 2009).Google Scholar
- Holmes, D. Authorship attribution. Computers and the Humanities 28, 2 (1994), 87--106.Google ScholarCross Ref
- Koppel, M., Argamon, S., and Shimoni, A. Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17, 4 (2002), 401--412.Google ScholarCross Ref
- Lerman, K., and Ghosh, R. Information contagion: An empirical study of the spread of news on digg and twitter social networks. In Proceedings of the Fourth International Conference on Weblogs and Social Media, ICWSM 2010 (Washington, DC, 2010).Google Scholar
- Lyons, R. The spread of evidence-poor medicine via flawed social-network analysis. Statistics, Politics, and Policy 2, 1 (2011), Article 2.Google Scholar
- Mairesse, F., Walker, M., Mehl, M., and Moore, R. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1 (2007), 457--500. Google ScholarDigital Library
- Nowson, S., and Oberlander, J. The identity of bloggers: Openness and gender in personal weblogs. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (2006).Google Scholar
- Oberlander, J., and Nowson, S. Whose thumb is it anyway?: classifying author personality from weblog text. In Proceedings of the Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics, COLING-ACL '06 (2006), 627--634. Google ScholarDigital Library
- Pennebaker, J., Francis, M., and Booth, R. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates (2001).Google Scholar
- Pennebaker, J., and King, L. Linguistic styles: language use as an individual difference. Journal of personality and social psychology 77, 6 (1999), 1296--1312.Google Scholar
- Pennebaker, J., and Lay, T. Language use and personality during crises: Analyses of mayor rudolph giuliani's press conferences. Journal of Research in Personality 36, 3 (2002), 271--282.Google ScholarCross Ref
- Pennebaker, J., and Stone, L. Words of wisdom: Language use over the life span. Journal of personality and social psychology 85, 2 (2003), 291--301.Google Scholar
- Rosenberg, S. Say Everything: How blogging began, what it's becoming, and why it matters. Crown Publishers, New York, 2009.Google Scholar
- Rosenthal, S., and McKeown, K. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011), 763--772. Google ScholarDigital Library
- Rude, S., Gortner, E., and Pennebaker, J. Language use of depressed and depression-vulnerable college students. Cognition & Emotion 18, 8 (2004), 1121--1133.Google ScholarCross Ref
- Shalizi, C., and Thomas, A. Homophily and contagion are generically confounded in observational social network studies. Sociological Methods & Research 40, 2 (2011), 211--239.Google ScholarCross Ref
- Yarkoni, T. Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of research in personality 44, 3 (2010), 363--373.Google Scholar
Index Terms
- Content-based similarity measures of weblog authors
Recommendations
Similarity measures on intuitionistic fuzzy sets
Intuitionistic fuzzy sets (IFSs), proposed by Atanassov, have gained attention from researchers for their applications in various fields. Then similarity measures between IFSs were developed. In this paper, firstly, some existing measures of similarity ...
When Similarity Measures Lie
SISAP 2015: Proceedings of the 8th International Conference on Similarity Search and Applications - Volume 9371Do similarity or distance measures ever go wrong? The inherent subjectivity in similarity discernment has long supported the view that all judgements of similarity are equally valid, and that any selected similarity measure may only be considered more ...
On efficient network similarity measures
Highlights- The approach is novel and application oriented.
- It outperforms classical graph ...
AbstractThis paper presents novel graph similarity measures which can be applied to simple directed and undirected networks. To define the graph similarity measures, we first map graphs to real numbers by utilizing structural graph measures. ...
Comments