skip to main content
10.1145/2065023.2065035acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Predicting age and gender in online social networks

Published:28 October 2011Publication History

ABSTRACT

A common characteristic of communication on online social networks is that it happens via short messages, often using non-standard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one's true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.

References

  1. Argamon, S., Koppel, M., Fine, J., and Shimoni, A. 2002. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing. 17, 4 (November 2002), 401--412. DOI=10.1093/llc/17.4.401.Google ScholarGoogle Scholar
  2. Argamon, S., Koppel, M., Pennebaker, W., and Schler, J. 2007. Mining the Blogosphere: Age, gender and the varieties of self-expression. First Monday.12, 9 (September 2007). DOI= http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. Burger, J. D., and Henderson, J. C. 2006. An exploration of observable features related to blogger age. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006).Google ScholarGoogle Scholar
  4. Burrows, J. 2007. All the way through: testing for authorship in different frequency strata. Literary and Linguistic Computing. 22, 1 (2007), 27--47. DOI= http://dx.doi.org/10.1093/llc/fqi067.Google ScholarGoogle ScholarCross RefCross Ref
  5. Caverlee, J., and Webb, S. 2008. A large-scale study of MySpace: observations and implications for online social networks. In Proceedings of the 2nd International Conference on Weblogs and Social Media (Seattle, USA, March 30 - April 2, 2008). ISWCM'08. International AAAI Conference on Weblogs and Social Media. DOI= http://www.aaai.org/Library/ICWSM/2008/icwsm08-012.php.Google ScholarGoogle Scholar
  6. Crystal, D. 2001. Language and the Internet. Cambridge University Press, Cambridge, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research. 9 (August, 2008), 1871--1874. DOI= http://doi.acm.org/10.1145/1390681.1442794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Goswami, S., Sarkar, S., and Rustagi, M. 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of the Third International ICWSM Conference (San Jose, USA, May 17 - 20, 2009). ISWCM'09. International AAAI Conference on Weblogs and Social Media. DOI= http://aaai.org/ocs/index.php/ICWSM/09/paper/view/208.Google ScholarGoogle Scholar
  9. Herring, S. C., and Paolillo, J. C. 2006. Gender and genre variation in weblogs. Journal of Sociolinguistics. 10, 4 (August, 2006), 439--459. DOI=10.1111/j.1467-9841.2006.00287.xGoogle ScholarGoogle ScholarCross RefCross Ref
  10. Hirst, G., and Feiguina, O. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing. 22, 4 (October, 2007), 405--417. DOI= 10.1093/llc/fqm023.Google ScholarGoogle ScholarCross RefCross Ref
  11. Holmes, J., and Meyerhoff, M. 2003. The Handbook of Language and Gender. Blackwell, Oxford, UK. DOI= 10.1111/b.9780631225034.2004.x.Google ScholarGoogle Scholar
  12. Luyckx, K., and Daelemans, W. 2010. The Effect of Author Set Size and Data Size in Authorship Attribution. Literary and Linguistic Computing. 26, 1 (August, 2010). DOI= 10.1093/llc/fqq013.Google ScholarGoogle Scholar
  13. Manning, C. D., and Schütze, H. 2001. Foundations of statistical natural language processing. MIT Press, Cambridge, Massachusetts, USA. DOI=10.1145/601858.601867. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mukherjee, A., and Liu, B. 2010. Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (Cambridge, USA, October 9 - 11, 2010). EMNLP '10. Association for Computational Linguistics, Stroudsburg, PA, USA, 207--217. DOI= http://www.aclweb.org/anthology/D10-1021. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nguyen, D., Smith, N., and Rosé C. 2011. Author Age Prediction from Text using Linear Regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 115--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nowson, S., and Oberlander, J. 2007. Identifying more bloggers. Towards large scale personality classification of personal weblogs. In Proceedings of the 1st International Conference on Weblogs and Social Media (Boulder, USA, March 26 - 28, 2007). ISWCM'07. International AAAI Conference on Weblogs and Social Media.Google ScholarGoogle Scholar
  17. Pennebaker, J. W., and Graybeal, A. 2001. Patterns of natural language use: disclosure, personality, and social integration. Current Directions in Psychological Science. 10, 3 (2001), 90--93. DOI= 10.1111/1467-8721.00123.Google ScholarGoogle ScholarCross RefCross Ref
  18. Pennebaker, J. W., and Stone, L. D. 2003. Words of wisdom: Language use over the lifespan. Journal of Personality and Social Psychology. 85, 2 (Aug 2003, 2003), 291--301. DOI=10.1037/0022-3514.85.2.Google ScholarGoogle ScholarCross RefCross Ref
  19. Rosenthal, S., and McKeown, K. 2011. Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 763--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ryan, C., Hall, W., and Hall, R. 2007. A profile of pedophilia: definition, characteristics of offenders, recidivism, treatment outcomes, and forensic issues. In Mayo Clinic Proceedings. 82, 4 (April, 2007), 457--471. DOI= 10.4065/82.4.457.Google ScholarGoogle Scholar
  21. Sanderson, C., and Guenter, S. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (Sydney, Australia, 22 - 23 July , 2006). EMNLP'06. Association for Computational Linguistics, Stroudsburg, PA, USA, 482--491. DOI= http://www.aclweb.org/anthology/W06-1657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sarawgi, R., Gajulapalli, K., and Choi, Y. 2011. Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 78--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006). DOI= http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.216.Google ScholarGoogle Scholar
  24. Snyder, H. N. 2000. Sexual assault of young children as reported to law enforcement: victim, incident, and offender characteristics. US Departement of Justice, Bureau of Justice Statistics. Washington, DC, USA. Publication NCJ 182990.Google ScholarGoogle Scholar
  25. Tam, J., and Martell, C. 2009. Age Detection in Chat. In Proceedings of the 3rd IEEE International Conference on Semantic Computing. (Berkeley, USA, September 14-16, 2009). DOI=10.1109/ICSC.2009.37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vandekerckhove, R., and Nobels, J. 2010. Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers. Journal of Sociolinguistics. 14, 5 (November, 2010), 657--677. DOI=10.1111/j.1467-9841.2010.00458.x.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yan, X., and Yan, L. 2006. Gender classification of weblog authors. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006).Google ScholarGoogle Scholar
  28. Zhang, C., and Zhang, P. 2010. Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA.Google ScholarGoogle Scholar

Index Terms

  1. Predicting age and gender in online social networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SMUC '11: Proceedings of the 3rd international workshop on Search and mining user-generated contents
      October 2011
      100 pages
      ISBN:9781450309493
      DOI:10.1145/2065023

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 October 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate15of25submissions,60%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader