ABSTRACT
Due to the popularity and pervasive use of regular expressions, researchers have created tools to support their creation, validation, and use. However, little is known about the context in which regular expressions are used, the features that are most common, and how behaviorally similar regular expressions are to one another.
In this paper, we explore the context in which regular expressions are used through a combination of developer surveys and repository analysis. We survey 18 professional developers about their regular expression usage and pain points. Then, we analyze nearly 4,000 open source Python projects from GitHub and extract nearly 14,000 unique regular expression patterns. We map the most common features used in regular expressions to those features supported by four major regex research efforts from industry and academia: brics, Hampi, RE2, and Rex. Using similarity analysis of regular expressions across projects, we identify six common behavioral clusters that describe how regular expressions are often used in practice. This is the first rigorous examination of regex usage and it provides empirical evidence to support design decisions by regex tool builders. It also points to areas of needed future work, such as refactoring regular expressions to increase regex understandability and context-specific tool support for common regex usages.
- F. Alkhateeb, J.-F. Baget, and J. Euzenat. Extending sparql with regular expression patterns (for querying rdf). Web Semant., 7(2):57–73, Apr. 2009. Google ScholarDigital Library
- S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, and P. Mcminn. An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw., 86(8):1978–2001, Aug. 2013. Google ScholarDigital Library
- A. Arslan. Multiple sequence alignment containing a sequence of regular expressions. In Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB ’05. Proceedings of the 2005 IEEE Symposium on, pages 1–7, Nov 2005.Google ScholarCross Ref
- R. Babbar and N. Singh. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND ’10, pages 43–50, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915–936, Nov. 1996. Google ScholarDigital Library
- F. Beck, S. Gulan, B. Biegel, S. Baltes, and D. Weiskopf. Regviz: Visual debugging of regular expressions. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 504–507, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- A. Begel, Y. P. Khoo, and T. Zimmermann. Codebook: Discovering and exploiting relationships in software repositories. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 125–134, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How developers use the dynamic features of programming languages: The case of smalltalk. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 23–32, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How (and why) developers use the dynamic features of programming languages: The case of smalltalk. Empirical Software Engineering, 18(6):1156–1194, Dec. 2013. Google ScholarDigital Library
- C. Chambers and C. Scaffidi. Smell-driven performance analysis for end-user programmers. In Proc. of VLH/CC ’13, pages 159–166, 2013.Google ScholarCross Ref
- C. Chambers and C. Scaffidi. Impact and utility of smell-driven performance tuning for end-user programmers. Journal of Visual Languages & Computing, 28:176–194, 2015. to appear. Google ScholarDigital Library
- T.-H. Chen, M. Nagappan, E. Shihab, and A. E. Hassan. An empirical study of dormant bugs. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 82–91, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- R. Dattero and S. D. Galup. Programming languages and gender. Commun. ACM, 47(1):99–102, Jan. 2004. Google ScholarDigital Library
- R. Dyer, H. Rajan, H. A. Nguyen, and T. N. Nguyen. Mining billions of ast nodes to study actual and potential usage of java language features. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 779–790, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- S. J. Galler and B. K. Aichernig. Survey on test data generation tools. Int. J. Softw. Tools Technol. Transf., 16(6):727–751, Nov. 2014. Google ScholarDigital Library
- I. Ghosh, N. Shafiei, G. Li, and W.-F. Chiang. Jst: An automatic test generation tool for industrial java applications with strings. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 992–1001, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
- M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale java open source code repository. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’10, pages 11:1–11:10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Kiezun, V. Ganesh, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D. Ernst. Hampi: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol., 21(4):25:1–25:28, Feb. 2013. Google ScholarDigital Library
- C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. GenProg: A generic method for automated software repair. Transactions on Software Engineering, 38(1):54–72, 2012. Google ScholarDigital Library
- J. Lee, M.-D. Pham, J. Lee, W.-S. Han, H. Cho, H. Yu, and J.-H. Lee. Processing sparql queries with regular expressions in rdf databases. In Proceedings of the ACM Fourth International Workshop on Data and Text Mining in Biomedical Informatics, DTMBIO ’10, pages 23–30, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 21–30, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di Penta, and D. Poshyvanyk. Mining energy-greedy api usage patterns in android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 2–11, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- B. Livshits, J. Whaley, and M. S. Lam. Reflection analysis for java. In Proceedings of the Third Asian Conference on Programming Languages and Systems, APLAS’05, pages 139–160, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
- L. A. Meyerovich and A. S. Rabkin. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 1–18, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- A. Møller. dk.brics.automaton – finite-state automata and regular expressions for Java, 2010. http://www.brics.dk/automaton/.Google Scholar
- The Bro Network Security Monitor. https://www.bro.org/, May 2015.Google Scholar
- C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of java generics. Empirical Softw. Engg., 18(6):1047–1089, Dec. 2013. Google ScholarDigital Library
- RE2. https://github.com/google/re2, May 2015.Google Scholar
- G. Richards, S. Lebresne, B. Burg, and J. Vitek. An analysis of the dynamic behavior of javascript programs. SIGPLAN Not., 45(6):1–12, June 2010. Google ScholarDigital Library
- E. Spishak, W. Dietl, and M. D. Ernst. A type system for regular expressions. In Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs, FTfJP ’12, pages 20–26, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- N. Tillmann, J. de Halleux, and T. Xie. Transferring an automated test generation tool to practice: From pex to fakes and code digger. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 385–396, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- M.-T. Trinh, D.-H. Chu, and J. Jaffar. S3: A symbolic string solver for vulnerability detection in web applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, pages 1232–1243, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- M. Veanes, P. d. Halleux, and N. Tillmann. Rex: Symbolic regular expression explorer. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST ’10, pages 498–507, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- W. Weimer, S. Forrest, C. Le Goues, and T. Nguyen. Automatic program repair with evolutionary computation. Communications of the ACM Research Highlight, 53(5):109–116, May 2010. Google ScholarDigital Library
Index Terms
- Exploring regular expression usage and context in Python
Recommendations
Computation of regular expression derivatives
The conversion of regular expressions into finite state automata and finite state automata into regular expression is an important area of research in automata theory. The notion of derivatives of regular expressions has been introduced to make the ...
Construction of fuzzy automata from fuzzy regular expressions
Li and Pedrycz have proved fundamental results that provide different equivalent ways to represent fuzzy languages with membership values in a lattice-ordered monoid, and generalize the well-known results of the classical theory of formal languages. In ...
Regular Expressions for Languages over Infinite Alphabets
In this paper we introduce a notion of a regular expression over infinite alphabets and show that a language is definable by an infinite alphabet regular expression if and only if it is accepted by finite-state unification based automaton - a model of ...
Comments