This paper provides a review of the recent works for ensuring the correct usage of regular expressions. It classifies those works into different categories, including the empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, this paper reviews the main results, compares different approaches, and discusses their advantages and disadvantages. It also discusses some potential future research directions.
However, recent research has found that regular expressions are hard to understand, hard to compose, and error-prone. Indeed, regular expressions have a quite compact and rather tolerant syntax that makes them hard to understand even for very short regular expressions. For example, it is not easy for users to capture immediately what strings the regular expression “
([^([^]+)\]|([^([^]+)\)” specifies. It becomes much more difficult for complex regular expressions containing more than 100 characters or may have more than ten nested levels. This is a real situation for software developers. For example, on the popular website stackoverflow.com, where developers learn and share their programming knowledge, more than 235000 questions are tagged with “regex”.
Faulty regular expressions may cause failures in the corresponding applications that use them. Therefore, ensuring the correctness of regular expressions is a vital prerequisite for their use in practical applications. In fact, the importance of ensuring the correctness of regular expressions or other structural description models has already been recognized by some researchers. Klint et al. used the term “grammarware” to refer to all software that involves grammar knowledge in an essential manner. Here, grammar is meant in the sense of all structural descriptions or descriptions of structures used in software systems, including regular expressions, context-free grammars, etc. They noted that “In reality, grammarware is treated, to a large extent, in an ad-hoc manner with regard to design, implementation, transformation, recovery, testing, etc.” Take the testing of regular expressions as an example: A survey of professional developers reveals that developers test their regular expressions less than the rest of their codes. Indeed, an empirical study shows that about 80% of the regular expressions used in practical projects are not tested, and among those tested, about half use only one test string that is far from sufficient. Hence, sound and systematical methods and techniques are necessary to improve the quality of such software components.
The importance and necessity of checking the correctness and thus improving the quality of regular expressions have attracted extensive attention from researchers and practitioners, especially in recent years. In this article, we provide a review of the recent works related to this issue. We classify the related works into different categories, including empirical study, test string generation, automatic synthesis and learning, static checking and verification, visual representation and explanation, and repairing. For each category, we review the main results, compare different approaches, and discuss their advantages and disadvantages.
The rest of this article is organized as follows: Section 2 introduces the preliminary knowledge on regular expressions, their different dialects, the meaning of correctness and finite automata. Sections 3−8 review the relevant works on the correctness assurance of regular expressions according to different categories, respectively. Section 9 concludes with a summary and a discussion for further work.
Download full text:
Ensuring the Correctness of Regular Expressions: A Review
Li-Xiao Zheng, Shuai Ma, Zu-Xi Chen, Xiang-Yu Luo
For more up-to-date information:
1) WeChat: IJAC
3) Facebook:International Journal of Automation and Computing
4) Linkedin: Int.J. of Automation and Computing
5) Sina Weibo:IJAC-国际自动化与计算杂志