skip to main content
10.1145/2931037.2931073acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article

Exploring regular expression usage and context in Python

Published:18 July 2016Publication History

ABSTRACT

Due to the popularity and pervasive use of regular expressions, researchers have created tools to support their creation, validation, and use. However, little is known about the context in which regular expressions are used, the features that are most common, and how behaviorally similar regular expressions are to one another.

In this paper, we explore the context in which regular expressions are used through a combination of developer surveys and repository analysis. We survey 18 professional developers about their regular expression usage and pain points. Then, we analyze nearly 4,000 open source Python projects from GitHub and extract nearly 14,000 unique regular expression patterns. We map the most common features used in regular expressions to those features supported by four major regex research efforts from industry and academia: brics, Hampi, RE2, and Rex. Using similarity analysis of regular expressions across projects, we identify six common behavioral clusters that describe how regular expressions are often used in practice. This is the first rigorous examination of regex usage and it provides empirical evidence to support design decisions by regex tool builders. It also points to areas of needed future work, such as refactoring regular expressions to increase regex understandability and context-specific tool support for common regex usages.

References

  1. F. Alkhateeb, J.-F. Baget, and J. Euzenat. Extending sparql with regular expression patterns (for querying rdf). Web Semant., 7(2):57–73, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, and P. Mcminn. An orchestrated survey of methodologies for automated software test case generation. J. Syst. Softw., 86(8):1978–2001, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arslan. Multiple sequence alignment containing a sequence of regular expressions. In Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB ’05. Proceedings of the 2005 IEEE Symposium on, pages 1–7, Nov 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Babbar and N. Singh. Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text. In Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND ’10, pages 43–50, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915–936, Nov. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Beck, S. Gulan, B. Biegel, S. Baltes, and D. Weiskopf. Regviz: Visual debugging of regular expressions. In Companion Proceedings of the 36th International Conference on Software Engineering, ICSE Companion 2014, pages 504–507, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Begel, Y. P. Khoo, and T. Zimmermann. Codebook: Discovering and exploiting relationships in software repositories. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pages 125–134, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How developers use the dynamic features of programming languages: The case of smalltalk. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 23–32, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Calla´ u, R. Robbes, E. Tanter, and D. Röthlisberger. How (and why) developers use the dynamic features of programming languages: The case of smalltalk. Empirical Software Engineering, 18(6):1156–1194, Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Chambers and C. Scaffidi. Smell-driven performance analysis for end-user programmers. In Proc. of VLH/CC ’13, pages 159–166, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. Chambers and C. Scaffidi. Impact and utility of smell-driven performance tuning for end-user programmers. Journal of Visual Languages & Computing, 28:176–194, 2015. to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T.-H. Chen, M. Nagappan, E. Shihab, and A. E. Hassan. An empirical study of dormant bugs. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 82–91, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Dattero and S. D. Galup. Programming languages and gender. Commun. ACM, 47(1):99–102, Jan. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Dyer, H. Rajan, H. A. Nguyen, and T. N. Nguyen. Mining billions of ast nodes to study actual and potential usage of java language features. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 779–790, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. J. Galler and B. K. Aichernig. Survey on test data generation tools. Int. J. Softw. Tools Technol. Transf., 16(6):727–751, Nov. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Ghosh, N. Shafiei, G. Li, and W.-F. Chiang. Jst: An automatic test generation tool for industrial java applications with strings. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 992–1001, Piscataway, NJ, USA, 2013. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale java open source code repository. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’10, pages 11:1–11:10, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Kiezun, V. Ganesh, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D. Ernst. Hampi: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol., 21(4):25:1–25:28, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer. GenProg: A generic method for automated software repair. Transactions on Software Engineering, 38(1):54–72, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lee, M.-D. Pham, J. Lee, W.-S. Han, H. Cho, H. Yu, and J.-H. Lee. Processing sparql queries with regular expressions in rdf databases. In Proceedings of the ACM Fourth International Workshop on Data and Text Mining in Biomedical Informatics, DTMBIO ’10, pages 23–30, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 21–30, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di Penta, and D. Poshyvanyk. Mining energy-greedy api usage patterns in android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pages 2–11, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Livshits, J. Whaley, and M. S. Lam. Reflection analysis for java. In Proceedings of the Third Asian Conference on Programming Languages and Systems, APLAS’05, pages 139–160, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. A. Meyerovich and A. S. Rabkin. Empirical analysis of programming language adoption. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 1–18, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Møller. dk.brics.automaton – finite-state automata and regular expressions for Java, 2010. http://www.brics.dk/automaton/.Google ScholarGoogle Scholar
  26. The Bro Network Security Monitor. https://www.bro.org/, May 2015.Google ScholarGoogle Scholar
  27. C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of java generics. Empirical Softw. Engg., 18(6):1047–1089, Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. RE2. https://github.com/google/re2, May 2015.Google ScholarGoogle Scholar
  29. G. Richards, S. Lebresne, B. Burg, and J. Vitek. An analysis of the dynamic behavior of javascript programs. SIGPLAN Not., 45(6):1–12, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Spishak, W. Dietl, and M. D. Ernst. A type system for regular expressions. In Proceedings of the 14th Workshop on Formal Techniques for Java-like Programs, FTfJP ’12, pages 20–26, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Tillmann, J. de Halleux, and T. Xie. Transferring an automated test generation tool to practice: From pex to fakes and code digger. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 385–396, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M.-T. Trinh, D.-H. Chu, and J. Jaffar. S3: A symbolic string solver for vulnerability detection in web applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, pages 1232–1243, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Veanes, P. d. Halleux, and N. Tillmann. Rex: Symbolic regular expression explorer. In Proceedings of the 2010 Third International Conference on Software Testing, Verification and Validation, ICST ’10, pages 498–507, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Weimer, S. Forrest, C. Le Goues, and T. Nguyen. Automatic program repair with evolutionary computation. Communications of the ACM Research Highlight, 53(5):109–116, May 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploring regular expression usage and context in Python

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis
      July 2016
      452 pages
      ISBN:9781450343909
      DOI:10.1145/2931037

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate58of213submissions,27%

      Upcoming Conference

      ISSTA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader