ABSTRACT
Software engineers often resort to code search practices to support software maintenance and evolution tasks, in particular code reuse. An issue that affects code search is the vocabulary mismatch problem: while searching for a particular function, users have to guess the exact words that were chosen by original developers to name code entities. In this paper we present an automatic query expansion (AQE) approach that uses word relations to increase the chances of finding relevant code. The approach is applied on top of Test-Driven Code Search (TDCS), a promising code retrieval technique that uses test cases as inputs to formulate the search query, but can also be used with other techniques that handle interface definitions to produce queries (interface-driven code search). Since these techniques rely on keywords and types, the vocabulary mismatch problem is also relevant. AQE is carried out by leveraging WordNet, a type thesaurus for expanding types, and another thesaurus containing only software-related word relations. Our approach is general but was specifically designed for non-native English speakers, who are frequently unaware of the most common terms used to name functions in software. Our evaluation with 36 non-native subjects - including developers and senior Computer Science students - provides evidence that our approach can improve the chances of finding relevant functions by 41% (recall improvement of 30%, on average), without hurting precision.
- E. Arisholm and D. I. K. Sjøberg. A controlled experiment with professionals to evaluate the effect of a delegated versus centralized control style on the maintainability of object-oriented software. Technical Report 6, Simula Research Laboratory, June 2003.Google Scholar
- S. K. Bajracharya and C. V. Lopes. Analyzing and mining a code search engine usage log. Empirical Softw. Engg., 17(4-5):424–466, Aug. 2012. Google ScholarDigital Library
- S. K. Bajracharya, J. Ossher, and C. V. Lopes. Leveraging usage similarity for effective retrieval of examples in code repositories. In Proc. of the FSE 2010, pages 157–166, 2010. Google ScholarDigital Library
- V. R. Basili, F. Shull, and F. Lanubile. Building knowledge through families of experiments. IEEE Trans. Softw. Eng., 25:456–473, 1999. Google ScholarDigital Library
- L. Briand, P. Devanbu, and W. L. Melo. An investigation into coupling measures for C++. pages 412–421. ACM, 1997. Google ScholarDigital Library
- R. Burrows, F. C. Ferrari, O. A. L. Lemos, A. Garcia, and F. Taiani. The impact of coupling on the fault-proneness of aspect-oriented programs: An empirical study. In Proc. of the ISSRE 2010, pages 329–338, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1):1:1–1:50, Jan. 2012. Google ScholarDigital Library
- G. Fraser and A. Arcuri. Sound empirical evidence in software testing. In Proc. of the ICSE 2012, pages 178–188, Piscataway, NJ, USA, 2012. IEEE Press. Google ScholarDigital Library
- G. Gay, S. Haiduc, A. Marcus, and T. Menzies. On the use of relevance feedback in ir-based concept location. In Proc. of the ICSM 2009, pages 351–360. IEEE, 2009.Google ScholarCross Ref
- S. Gupta, S. Malik, L. Pollock, and K. Vijay-Shanker. Part-of-speech tagging of program identifiers for improved text-based software engineering tools. In Proc. of the ICPC 2013, pages 3–12, May 2013.Google ScholarCross Ref
- S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Automatic query reformulations for text retrieval in software engineering. In Proc. of the ICSE 2013, pages 842–851, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- R. Hoffmann, J. Fogarty, and D. S. Weld. Assieme: finding and leveraging implicit references in a web search interface for programmers. In Proc. of the UIST ’07, pages 13–22, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- R. Holmes and G. C. Murphy. Using structural context to recommend source code examples. In Proc. of the ICSE 2005, pages 117–125, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proc. of the MSR 2013, pages 377–386, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- O. Hummel and W. Janjic. Test-driven reuse: Key to improving precision of search engines for software reuse. In S. E. Sim and R. E. Gallardo-Valencia, editors, Finding Source Code on the Web for Remix and Reuse, pages 227–250. Springer New York, 2013.Google Scholar
- O. Hummel, W. Janjic, and C. Atkinson. Code conjurer: Pulling reusable software out of thin air. IEEE Softw., 25:45–52, September 2008. Google ScholarDigital Library
- B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. E. Emam, and J. Rosenberg. Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng., 28:721–734, 2002. Google ScholarDigital Library
- O. Laitenberger and J.-M. DeBaud. Perspective-based reading of code documents at Robert Bosch GmbH. Information and Software Technology, 39(11):781–791, 1997.Google ScholarDigital Library
- O. A. L. Lemos, S. Bajracharya, J. Ossher, P. C. Masiero, and C. Lopes. A test-driven approach to code search and its application to the reuse of auxiliary functionality. Inf. Softw. Technol., 53:294–306, April 2011. Google ScholarDigital Library
- O. A. L. Lemos, S. K. Bajracharya, and J. Ossher. Codegenie: a tool for test-driven source code search. In Companion to the 22nd ACM SIGPLAN OOPSLA, pages 917–918, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- O. A. L. Lemos, A. C. de Paula, G. Konishi, J. Ossher, S. Bajracharya, and C. Lopes. Using thesaurus-based tag clouds to improve test-driven code search. In Proc. of the SBCARS 2013, 2013. Google ScholarDigital Library
- O. A. L. Lemos, F. C. Ferrari, F. F. Silveira, and A. Garcia. Development of auxiliary functions: should you be agile? an empirical assessment of pair programming and test-first programming. In Proc. of the ICSE 2012, pages 529–539, Piscataway, NJ, USA, 2012. IEEE Press. Google ScholarDigital Library
- J. Li, R. Conradi, C. Bunse, M. Torchiano, O. P. N. Slyngstad, and M. Morisio. Development with off-the-shelf components: 10 facts. IEEE Softw., 26(2):80–87, Mar. 2009. Google ScholarDigital Library
- E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18:300–336, 2009. 10.1007/s10618-008-0118-x. Google ScholarDigital Library
- D. Mandelin, L. Xu, R. Bod´ık, and D. Kimelman. Jungloid mining: helping to navigate the api jungle. In Proc. of the PLDI 2005, pages 48–61, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. Google ScholarDigital Library
- G. A. Miller. Wordnet: a lexical database for english. Commun. ACM, 38(11):39–41, Nov. 1995. Google ScholarDigital Library
- P. Mohagheghi, R. Conradi, O. M. Killi, and H. Schwarz. An empirical study of software reuse vs. defect-density and stability. In Proc. of the ICSE 2004, pages 282–292. IEEE, 2004. Google ScholarDigital Library
- D. C. Montgomery. Design and Analysis of Experiments. John Wiley & Sons, 2006. Google ScholarDigital Library
- H. Ossher and P. Tarr. Hyper/j: multi-dimensional separation of concerns for java. In Proc. of the ICSE 2000, pages 734–737, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. ACM Trans. Softw. Eng. Methodol., 2(3):286–303, 1993. Google ScholarDigital Library
- D. Poshyvanyk, A. Marcus, and Y. Dong. JIRiSS - an eclipse plug-in for source code exploration. In Proc. of the ICPC 2006, pages 252–255, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- S. P. Reiss. Semantics-based code search. In Proc. of the ICSE 2009, pages 243–253, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- N. Sahavechaphan and K. Claypool. Xsnippet: mining for sample code. In Proc. of the OOPSLA 2006, pages 413–430, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented concerns. In Proc. of the AOSD 2007, pages 212–224, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- F. Shull, J. Singer, and D. I. Sjøberg. Guide to Advanced Empirical Software Engineering. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007. Google ScholarDigital Library
- R. Sindhgatta. Using an information retrieval system to retrieve source code samples. In L. J. Osterweil, H. D. Rombach, and M. L. Soffa, editors, Proc. of the ICSE 2006, pages 905–908. ACM, 2006. Google ScholarDigital Library
- J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software engineering work practices. In Proc. of the 1997 conference of the Centre for Advanced Studies on Collaborative research, CASCON ’97, pages 21–. IBM Press, 1997. Google ScholarDigital Library
- B. Sisman and A. C. Kak. Assisting code search with automatic query reformulation for bug localization. In Proc. of the MSR 2013, pages 309–318, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- M. Sojer and J. Henkel. License risks from ad hoc reuse of code from the internet. Commun. ACM, 54(12):74–81, Dec. 2011. Google ScholarDigital Library
- D. Spinellis and C. Szyperski. Guest editors’ introduction: How is open source affecting software development? IEEE Softw., 21(1):28–33, Jan. 2004. Google ScholarDigital Library
- G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker. Identifying word relations in software: A comparative study of semantic similarity tools. In Proc. of the ICPC 2008, pages 123–132, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- S. Thummalapenta and T. Xie. Parseweb: a programmer assistant for reusing open source code on the web. In Proc. of the ASE 2007, pages 204–213, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- C. Wohlin et al. Experimentation in Software Engineering: an Introduction. Kluwer, 2000. Google ScholarDigital Library
- J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proc. of the SIGIR 1996, pages 4–11, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- J. Yang and L. Tan. Inferring semantically related words from software context. In Prof. of the MSR 2012, pages 161–170, Zurich, 2012. IEEE.Google Scholar
- M. Zhao, C. Wohlin, N. Ohlsson, and M. Xie. A comparison between software design and code metrics for the prediction of software fault content. Inf. and Soft. Technology, 40(14):801–809, 1998.Google ScholarDigital Library
Index Terms
- Thesaurus-based automatic query expansion for interface-driven code search
Recommendations
Neural query expansion for code search
MAPL 2019: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming LanguagesSearching repositories of existing source code for code snippets is a key task in software engineering. Over the years, many approaches to this problem have been proposed. One recent tool called NCS, takes in a natural language query and outputs ...
Using Thesaurus-Based Tag Clouds to Improve Test-Driven Code Search
SBCARS '13: Proceedings of the 2013 VII Brazilian Symposium on Software Components, Architectures and ReuseTest-driven code search (TDCS) is an approach to code search and reuse that uses test cases as inputs to form the search query. Together with the test cases that provide more semantics to the search task, keywords taken from class and method names are ...
Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet
The selection of the most appropriate sense of an ambiguous word in a certain context is one of the main problems in Information Retrieval (IR). For this task, it is usually necessary to count on a semantic source, that is, linguistic resources like ...
Comments