Understanding the social evolution of the Java community in Stack Overflow: A 10-year study of developer interactions

https://doi.org/10.1016/j.future.2019.12.021Get rights and content

Highlights

  • Stack Overflow has accumulated a lot of development data and knowledge.

  • Java is one of today’s most used programming languages.

  • Textual contents of question–answer pairs are useful for understanding information flow.

  • Resource sharing is key to exchange and reuse knowledge.

  • User reputation can be linked to the quality of their social activity.

Abstract

Today, Social Media is a key information source for a wide range of domains as a means to gain a better understanding of information flows and user communities. This work introduces a methodology combining machine learning and graph mining approaches to address relevant aspects of quality of service and user intrinsic motivation from the user’s perspective. The focus of the present analysis is set on the social interactions among software developers via Stack Overflow. Over the last 10 years, software developers have become intensively involved in knowledge sharing and platforms such as Stack Overflow have accumulated a lot of development data and knowledge. The proposed methodology is applied to explore the social dynamics of the Java programming language community and bring forward relevant, non-trivial knowledge about developer interests, information flows and user engagement and reputation. The ultimate aim is to improve question preparation towards better question routing and voting outcome.

Introduction

In the last decades, community-based question-and-answer (Q&A) sites have become very popular and have enabled knowledge sharing at unprecedented levels. Stack Overflow is the de facto Q&A website for topics in Computer Science. Current platform statistics account for 10 million users, 18 million questions, and 27 million answers (71% of questions answered) [1]. Moreover, the number of programming languages in use has increased. In 2018, the Developer’s Survey of Stack Overflow listed 38 different programming languages within the most loved, dreaded, and wanted languages.

Recently, Stack Overflow released its posts as online archives (https://archive.org/download/stackexchange), which paved the way to streamline the analysis of these conversation threads in various useful ways.

The present work complements current literature by introducing a new methodology of analysis that tackles the quality of service and user reputation in Q&A platforms from the user’s perspective. More specifically, this methodology aims to improve the user experience by proposing practical recommendations on how the user can point his questions in the right direction, i.e. reducing the number of questions with no answers, routing questions to the right answers, and promoting the content quality of the platform by identifying low-quality contents. As a meaningful case study, this paper explores the social evolution experienced by the Java developer community on Stack Overflow, i.e. an in-depth look into the topics that have motivated more discussion over the years, the evolving of social dynamics, including user altruism and reputation, and the cross-reference of internal contents as well as external sources.

To the best of our knowledge, such an integrative analysis has not been presented before. A number of works exist addressing similar topics though. The related work section describes some of these works while pinpointing the new, hereby presented contributions.

Section snippets

Related work

Understanding the dynamics of participation in Q&A platforms is essential to improve the value of crowdsourced knowledge and the quality of service as well as to promote user engagement. Many works have focused on the platform’s needs and challenges, but few works address the user’s perspective, namely the duality of being information seekers and information producers.

The increasing number of low-value, unanswered questions has prompted the need to learn how to pose well-received questions [2].

Data retrieval and preparation

Conversation threads tagged as Java-related and posted from 2008 till 2018 were downloaded through the Stack Overflow archives (https://archive.org/details/stackexchange). Thisamounted in a total of 3.33 million posts, from which 1.8 million represented questions and, within this set, approximately 0.9 million had answers (i.e. closed questions, referred here as Q&A pairs). These communication threads were sustained by a total of 0.3 million unique users.

The textual information in Q&A pairs,

Results and discussion

The study of the social interplay of the Java community in Stack Overflow throughout the last decade enabled the evaluation of the proposed methodology, in terms of correctness and robustness, as well as scalability in practical domains. The next sections describe the community evolution in general terms and then, the modelling of Q&A contents and the modelling of user intrinsic motivation.

Contents modelling aims to bring forward valuable, actionable information towards improving question

Conclusions and future work

This paper presents a methodology that combines machine learning and graph mining techniques to analyse the quality of service and user reputation of communities in Q&A platforms. To be able to grasp how to formulate questions properly is beneficial not only for the information seekers, because it increases the likelihood of receiving support, but also for the whole community, since it enhances effective knowledge-sharing behaviour, and, most notably, the creation of long-lasting value pieces

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study was supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia), Spain under the scope of the strategic funding of ED431C2018/55-GRC Competitive Reference Group, and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2019 unit. SING group thanks CITI (Centro de Investigación, Transferencia e Innovación) from the University of Vigo for hosting its IT infrastructure.

Guillermo Blanco González is Ph.D. student of Computer Science of the University of Vigo. He is currently developing advanced computational methods for modelling social dynamics, namely in biological ecosystems.

References (31)

  • AlreshedyK. et al.

    Predicting the programming language of questions and snippets of StackOverflow using natural language processing

    (2018)
  • RagkhitwetsagulC. et al.

    Toxic code snippets on stack overflow

    IEEE Trans. Softw. Eng.

    (2019)
  • GrecoC. et al.

    StackInTheFlow: Behavior-driven recommendation system for stack overflow posts

  • FuminS. et al.

    Recommendflow: Use topic model to automatically recommend stack overflow Q & A in IDE

  • PonzanelliL. et al.

    Mining StackOverflow to turn the IDE into a self-confident programming prompter

  • Cited by (12)

    • Gender screening on question-answering communities

      2023, Expert Systems with Applications
      Citation Excerpt :

      An interesting discovery, made by Ford, Harkins, and Parnin (2017), indicates that women, who encounter female fellows, are more plausible to engage sooner than those who did not in Stack Overflow. Another key finding unveils that feminine members tend to ask more while masculine to respond more, resulting in less thumb-ups, and consequently, giving rise to lower average reputation scores for females (Blanco, Pérez-López, Fdez-Riverola, & cia Lourenço, 2020; May, Wachs, & Hannák, 2019; Wang, 2018). With this in mind, a reputation strategy was devised to reduce the gender gap via rewarding points for asking and answering to the same level.

    • New trends and applications in social media analytics

      2021, Future Generation Computer Systems
      Citation Excerpt :

      Next section provides a brief description of the main contents of each article. The papers selected for this issue are about new trends and applications in domains like social networks, big data, and Web of things (WoT) [23–28], sentiment analysis [29–31], community & component analysis, question & answering [25,32], network metrics [25,26,33], machine learning applications [25,27,30,31], metaheuristics [28,33,34] and data visualization [24]. The main contributions of each work are briefly summarized below.

    • The Age of Snippet Programming: Toward Understanding Developer Communities in Stack Overflow and Reddit

      2023, ACM Web Conference 2023 - Companion of the World Wide Web Conference, WWW 2023
    View all citing articles on Scopus

    Guillermo Blanco González is Ph.D. student of Computer Science of the University of Vigo. He is currently developing advanced computational methods for modelling social dynamics, namely in biological ecosystems.

    Roi Pérez López is a Master student of the Master in Computer Science of the University of Vigo. His main research interests include text mining, sentiment analysis, and topic modelling.

    Florentino Fdez-Riverola is a Full Professor of the Department of Computer Science at the University of Vigo (Spain) and Coordinator of the New Generation Computer Systems group (SING, http://sing-group.org), which is dedicated to the research and development of cutting-edge computational methodologies and applications.

    Anália Maria Garcia Lourenço is a faculty member of the Department of Computer Science and a researcher affiliated to the Biomedical Research Centre (CINBIO), at the University of Vigo and the Centre of Biological Engineering, at the University of Minho. Her main research interests include computational intelligence, bioinformatics and systems biology.

    View full text