Ensuring the Correctness of Regular Expressions: A Review

  • Share:
Release Date: 2021-08-12 Visited: 

This paper provides a review of the recent works for ensuring the correct usage of regular expressions. It classifies those works into different categories, including the empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, this paper reviews the main results, compares different approaches, and discusses their advantages and disadvantages. It also discusses some potential future research directions.



Regular expressions are widely used within and even outside of computer science due to their expressiveness and flexibility. The importance of regular expressions for constructing the scanners of compilers is well known. Nowadays, their applications extend to more areas such as network protocol analysis, MySQL injection prevention, network intrusion detection, XML data specification, and database querying, or more diverse applications like DNA sequence alignment Regular expressions are commonly used in computer programs for pattern searching and string matching. They are a core component of almost all modern programming languages, and frequently appear in software source codes. Studies have shown that more than a third of JavaScript and Python projects contain at least one regular expression.


However, recent research has found that regular expressions are hard to understand, hard to compose, and error-prone. Indeed, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand even for very short regular expressions. For example, it is not easy for users to capture immediately what strings the regular expression “

([^([^]+)\]|([^([^]+)\)” specifies. It becomes much more difficult for complex regular expressions containing more than 100 characters or may have more than ten nested levels. This is a real situation for software developers. For example, on the popular website stackoverflow.com, where developers learn and share their programming knowledge, more than 235000 questions are tagged with “regex”.


Faulty regular expressions may cause failures in the corresponding applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. In fact, the importance of ensuring the correctness of regular expressions or other structural description models has already been recognized by some researchers. Klint et al. used the term “grammarware” to refer to all software that involves grammar knowledge in an essential manner. Here, grammar is meant in the sense of all structural descriptions or descriptions of structures used in software systems, including regular expressions, context-free grammars, etc. They noted that “In reality, grammarware is treated, to a large extent, in an ad-hoc manner with regard to design, implementation, transformation, recovery, testing, etc.” Take the testing of regular expressions as an example: A survey of professional developers reveals that developers test their regular expressions less than the rest of their codes. Indeed, an empirical study shows that about 80% of the regular expressions used in practical projects are not tested, and among those tested, about half use only one test string that is far from sufficient. Hence, sound and systematical methods and techniques are necessary to improve the quality of such software components.


The importance and necessity of checking the correctness and thus improving the quality of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this article, we provide a review of the recent works related to this issue. We classify the related works into different categories, including empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages.


The rest of this article is organized as follows: Section 2 introduces the preliminary knowledge on regular expressions, their different dialects, the meaning of correctness and finite automata. Sections 3−8 review the relevant works on the correctness assurance of regular expressions according to different categories, respectively. Section 9 concludes with a summary and a discussion for further work.




Download full text:

Ensuring the Correctness of Regular Expressions: A Review

Li-Xiao Zheng, Shuai Ma, Zu-Xi Chen, Xiang-Yu Luo






For more up-to-date information:

1) WeChat: IJAC

2) Twitter:IJAC_Journal

3) Facebook:International Journal of Automation and Computing

4) Linkedin: Int.J. of Automation and Computing

5) Sina Weibo:IJAC-国际自动化与计算杂志

  • Share:
Release Date: 2021-08-12 Visited: