GitHub repositories with links to academic papers: Public access, traceability, and evolution

https://doi.org/10.1016/j.jss.2021.111117Get rights and content
Under a Creative Commons license
open access

Highlights

  • A study of 20,278 links to create the prevalence of links to academic papers in GitHub repositories.

  • A mixed-methods study to identify OA, traceability, and evolutionary aspects of such links.

  • Availability of an online appendix, which contains our qualitative coding results in this study.

Abstract

Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of open-source scientific software which implements bleeding-edge science in its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the current practice of establishing and maintaining such links remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conduct a large-scale study of 20 thousand GitHub repositories that make references to academic papers. We use a mixed-methods approach to identify public access, traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are public access. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. We find that academic papers from top-tier SE venues are not likely to reference a repository, but when they do, they usually link to a GitHub software repository. In a network of arXiv papers and referenced repositories, we find that the most referenced papers are (i) highly-cited in academia and (ii) are referenced by repositories written in different programming languages.

Keywords

Software documentation
Open science
Open access
Traceability

Cited by (0)

Supatsara Wattanakriengkrai is a Ph.D. student in the Software Engineering Laboratory at the Department of Information Science, Nara Institute of Science and Technology (NAIST), Japan. She received a Master’s degree from NAIST in 2021. Her main research interests include Empirical Software Engineering, Mining Software Repositories, and Software Ecosystems. Contact her at [email protected]

Bodin Chinthanet is a specially appointed assistant professor at Nara Institute of Science and Technology, Japan. He received a Ph.D. from Nara Institute of Science and Technology in 2021. His research interests include empirical software engineering and mining software repositories. In detail, his research is focusing on the security vulnerabilities in software ecosystems, how developers react to vulnerabilities in their software projects. His ultimate goal is to mitigate the risk of security vulnerabilities in software ecosystems. More information at https://bchinthanet.com/

Hideaki Hata is an associate professor at the Shinshu University. His research interests include software ecosystems, human capital in software engineering, and software economics. He received a Ph.D. in information science from Osaka University. More about Hideaki and his work is available online at https://hideakihata.github.io/

Raula Gaikovina Kula is an assistant professor at the Nara Institute of Science and Technology (NAIST), Japan. He received his Ph.D. degree from NAIST in 2013 and was a Research Assistant Professor at Osaka University. He is active in the Software Engineering community, serving the community as a PC member for premium SE venues, some as organising committee, and reviewer for journals. His current research interests include library dependencies and security in the software ecosystem, program analysis such as code clones, and human aspects such as code reviews and coding proficiency. Find him at https://raux.github.io/ and @augaiko on Twitter. Contact him at [email protected]

Christoph Treude is a Senior Lecturer in Software Engineering in the School of Computing and Information Systems at the University of Melbourne. The goal of his research is to improve the quality of software and the productivity of those producing it, with a particular focus on getting information to software developers when and where they need it. He has authored more than 100 scientific articles with more than 200 co-authors, and his work has received an ARC Discovery Early Career Research Award (2018-2020), industry funding from Google, Facebook, and DST, as well as four best paper awards including two ACM SIGSOFT Distinguished Paper Awards. He currently serves as a board member on the Editorial Board of the Empirical Software Engineering journal and was general co-chair for the 36th IEEE International Conference on Software Maintenance and Evolution.

Jin L.C. Guo received the Ph.D. degree in computer science and engineering from the University of Notre Dame. She is an assistant professor with the School of Computer Science, McGill University. Her research interests include software traceability, software maintenance and evolution, and human aspects of software engineering. Her current research focuses on utilizing Natural Language Processing techniques to construct connections within and across heterogeneous software engineering data.

Kenichi Matsumoto is a professor in the Graduate School of Science and Technology at the Nara Institute of Science and Technology. His research interests include software measurement and software processes. He received a Ph.D. in information and computer sciences from Osaka University. He is a Senior Member of the IEEE and a member of the IEICE and the IPSJ. Contact him at [email protected]

Editor: Bram Adams.