research-article

Exploring regular expression usage and context in Python

Authors:
Carl Chapman

Iowa State University, USA

Iowa State University, USA
View Profile

,
Kathryn T. Stolee

North Carolina State University, USA

North Carolina State University, USA
View Profile

ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and AnalysisJuly 2016Pages 282–293https://doi.org/10.1145/2931037.2931073

Published:18 July 2016Publication History

ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis

Pages 282–293

ABSTRACT

Due to the popularity and pervasive use of regular expressions, researchers have created tools to support their creation, validation, and use. However, little is known about the context in which regular expressions are used, the features that are most common, and how behaviorally similar regular expressions are to one another.

In this paper, we explore the context in which regular expressions are used through a combination of developer surveys and repository analysis. We survey 18 professional developers about their regular expression usage and pain points. Then, we analyze nearly 4,000 open source Python projects from GitHub and extract nearly 14,000 unique regular expression patterns. We map the most common features used in regular expressions to those features supported by four major regex research efforts from industry and academia: brics, Hampi, RE2, and Rex. Using similarity analysis of regular expressions across projects, we identify six common behavioral clusters that describe how regular expressions are often used in practice. This is the first rigorous examination of regex usage and it provides empirical evidence to support design decisions by regex tool builders. It also points to areas of needed future work, such as refactoring regular expressions to increase regex understandability and context-specific tool support for common regex usages.

References

F. Alkhateeb, J.-F. Baget, and J. Euzenat. Extending sparql with regular expression patterns (for querying rdf). Web Semant., 7(2):57–73, Apr. 2009. Google ScholarDigital Library
S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, and P. Mcminn. An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw., 86(8):1978–2001, Aug. 2013. Google ScholarDigital Library
A. Arslan. Multiple sequence alignment containing a sequence of regular expressions. In Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB ’05. Proceedings of the 2005 IEEE Symposium on, pages 1–7, Nov 2005.Google ScholarCross Ref
R. Babbar and N. Singh. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND ’10, pages 43–50, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915–936, Nov. 1996. Google ScholarDigital Library
F. Beck, S. Gulan, B. Biegel, S. Baltes, and D. Weiskopf. Regviz: Visual debugging of regular expressions. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 504–507, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
A. Begel, Y. P. Khoo, and T. Zimmermann. Codebook: Discovering and exploiting relationships in software repositories. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 125–134, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How developers use the dynamic features of programming languages: The case of smalltalk. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 23–32, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How (and why) developers use the dynamic features of programming languages: The case of smalltalk. Empirical Software Engineering, 18(6):1156–1194, Dec. 2013. Google ScholarDigital Library
C. Chambers and C. Scaffidi. Smell-driven performance analysis for end-user programmers. In Proc. of VLH/CC ’13, pages 159–166, 2013.Google ScholarCross Ref
C. Chambers and C. Scaffidi. Impact and utility of smell-driven performance tuning for end-user programmers. Journal of Visual Languages & Computing, 28:176–194, 2015. to appear. Google ScholarDigital Library
T.-H. Chen, M. Nagappan, E. Shihab, and A. E. Hassan. An empirical study of dormant bugs. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 82–91, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
R. Dattero and S. D. Galup. Programming languages and gender. Commun. ACM, 47(1):99–102, Jan. 2004. Google ScholarDigital Library
R. Dyer, H. Rajan, H. A. Nguyen, and T. N. Nguyen. Mining billions of ast nodes to study actual and potential usage of java language features. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 779–790, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
S. J. Galler and B. K. Aichernig. Survey on test data generation tools. Int. J. Softw. Tools Technol. Transf., 16(6):727–751, Nov. 2014. Google ScholarDigital Library
I. Ghosh, N. Shafiei, G. Li, and W.-F. Chiang. Jst: An automatic test generation tool for industrial java applications with strings. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 992–1001, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarDigital Library
M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale java open source code repository. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’10, pages 11:1–11:10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Kiezun, V. Ganesh, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D. Ernst. Hampi: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol., 21(4):25:1–25:28, Feb. 2013. Google ScholarDigital Library
C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. GenProg: A generic method for automated software repair. Transactions on Software Engineering, 38(1):54–72, 2012. Google ScholarDigital Library
J. Lee, M.-D. Pham, J. Lee, W.-S. Han, H. Cho, H. Yu, and J.-H. Lee. Processing sparql queries with regular expressions in rdf databases. In Proceedings of the ACM Fourth International Workshop on Data and Text Mining in Biomedical Informatics, DTMBIO ’10, pages 23–30, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 21–30, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di Penta, and D. Poshyvanyk. Mining energy-greedy api usage patterns in android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 2–11, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
B. Livshits, J. Whaley, and M. S. Lam. Reflection analysis for java. In Proceedings of the Third Asian Conference on Programming Languages and Systems, APLAS’05, pages 139–160, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
L. A. Meyerovich and A. S. Rabkin. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &#38; Applications, OOPSLA ’13, pages 1–18, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
A. Møller. dk.brics.automaton – finite-state automata and regular expressions for Java, 2010. http://www.brics.dk/automaton/.Google Scholar
The Bro Network Security Monitor. https://www.bro.org/, May 2015.Google Scholar
C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of java generics. Empirical Softw. Engg., 18(6):1047–1089, Dec. 2013. Google ScholarDigital Library
RE2. https://github.com/google/re2, May 2015.Google Scholar
G. Richards, S. Lebresne, B. Burg, and J. Vitek. An analysis of the dynamic behavior of javascript programs. SIGPLAN Not., 45(6):1–12, June 2010. Google ScholarDigital Library
E. Spishak, W. Dietl, and M. D. Ernst. A type system for regular expressions. In Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs, FTfJP ’12, pages 20–26, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
N. Tillmann, J. de Halleux, and T. Xie. Transferring an automated test generation tool to practice: From pex to fakes and code digger. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 385–396, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
M.-T. Trinh, D.-H. Chu, and J. Jaffar. S3: A symbolic string solver for vulnerability detection in web applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, pages 1232–1243, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
M. Veanes, P. d. Halleux, and N. Tillmann. Rex: Symbolic regular expression explorer. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST ’10, pages 498–507, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
W. Weimer, S. Forrest, C. Le Goues, and T. Nguyen. Automatic program repair with evolutionary computation. Communications of the ACM Research Highlight, 53(5):109–116, May 2010. Google ScholarDigital Library

Index Terms

Exploring regular expression usage and context in Python
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

Computation of regular expression derivatives

The conversion of regular expressions into finite state automata and finite state automata into regular expression is an important area of research in automata theory. The notion of derivatives of regular expressions has been introduced to make the ...
Read More
Construction of fuzzy automata from fuzzy regular expressions

Li and Pedrycz have proved fundamental results that provide different equivalent ways to represent fuzzy languages with membership values in a lattice-ordered monoid, and generalize the well-known results of the classical theory of formal languages. In ...
Read More
Regular Expressions for Languages over Infinite Alphabets

In this paper we introduce a notion of a regular expression over infinite alphabets and show that a language is definable by an infinite alphabet regular expression if and only if it is accepted by finite-state unification based automaton - a model of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis
July 2016
452 pages
ISBN:9781450343909
DOI:10.1145/2931037
General Chair:
Andreas Zeller
Saarland University, Germany
,
Program Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
developer survey
regular expressions
repository analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate58of213submissions,27%
Upcoming Conference
ISSTA '24

Sponsor:

sigsoft

33rd ACM SIGSOFT International Symposium on Software Testing and Analysis

September 16 - 20, 2024

Vienna , Austria
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 819
  Total Downloads
- Downloads (Last 12 months)129
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploring regular expression usage and context in Python

ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Computation of regular expression derivatives

Construction of fuzzy automata from fuzzy regular expressions

Regular Expressions for Languages over Infinite Alphabets