Enhancing Language Model with Unit Test Techniques for Efficient Regular Expression Generation

,


Introduction
Regular expressions are an essential tool for processing text in an efficient, flexible, and powerful manner (Friedl, 2006).For instance, an individual whose work involves reviewing the language used in an application to prevent the display of violent or pornographic content to underage users.Manually checking each line can be a time-consuming task.Therefore, the use of regular expressions can greatly streamline this process.Nevertheless, writing and debugging regular expressions can be a daunting task for those without expertise, as the syntax can often be obscure and unintuitive (Karttunen et al., 1996).
The use of natural language to generate regular expressions has been explored in several works to Figure 1: Pipeline of our works.And the whole pipeline consists of 3 steps: the first step will generate prompt from the original context; followed by the SFT with the prompt generated from the first step; finally Unit-Test Driven Reinforcement Learning is implemented bridge the gap for the public in utilizing regular expressions.For instance, Ranta et al. (Ranta, 1998) developed a rule-based system that generates regular expressions from template input.Subsequently, Locascio et al. (Locascio et al., 2016) proposed the use of LSTM-based Sequence to Sequence models to generate regular expressions based on contextual inputs.Furthermore, with the advancement of large language models, researchers have discovered that the performance can be improved by employing Supervised Fine-tuning (SFT) (Ouyang et al., 2022) on Large Language Models (LLMs).Nev-ertheless, regular expressions generated by these models often encounter compilation failures and inadequately capture the intended functionality of the input requirements, which is a critical aspect in practical applications.To address this, researchers have explored the use of semantic correctness (Park et al., 2019) as a criterion.However, adopting such a method does not completely resolve the aforementioned issues.We posit that the disregard for functional significance in input specification may be a significant factor contributing to these challenges.
Therefore, this paper emphasizes the importance of functional correctness.To enhance the functional correctness of the generated regular expression, it is important to consider the practical context in which it will be used.Generally, assessing its practical applicability requires conducting "Unit Test".Specifically, if the generated regular expression can accurately extract the desired results from a given sequence of inputs, it can be considered to meet the functional requirements of the user.Therefore, in this paper, we propose Unit-Test Driven Reinforcement Learning (UTD-RL).This approach utilizes policy gradient techniques (Sutton et al., 1999) to learn from the feedback provided by the unit test results, enabling the model to adjust its pattern generation process to better align with the intended functionality.As a result, it shows promise in improving the effectiveness of regular expression generation in practical applications.Experimental results demonstrate that regular expressions generated by this method can better adhere to the input requirements, resulting in a significant improvement in the performance of the generated regular expression with respect to Unit Test.
As mentioned earlier, we consider functional correctness to be the most crucial factor in this task.However, we have observed that the previous evaluation method, which computes equivalence by converting each regular expression to a minimal deterministic finite automaton (DFA) and leveraging the fact that minimal DFAs are guaranteed to be the same for semantically equivalent regular expressions, is inadequate for assessing the functional correctness of the generated regular expression in relation to the input requirements.Therefore, in this paper, we propose the adoption of "Unit Test" as an alternative method for evaluating the generated regular expression, in addition to utilizing DFA.
To sum up our contributions: 1. we came up with the UTD-RL approach that utilizes the outcomes of "Unit Test" to enhance the functional correctness of the generated regular expression in alignment with input specifications.
2. we propose the use of "Unit Test" for evaluation, as it can better reflect the degree of fulfillment of the input requirements.
3. we conducted several experiments to validate the efficacy of the UTD-RL approach.

Related Work
Recent research has focused on automating the generation of regular expressions from natural language, employing both non-deep learning and deep learning approaches.Early researchers highlighted the ability to encode regular expressions into finite state networks (Karttunen et al., 1996).Ranta et al. (Ranta, 1998) capitalized on this property and developed a rule-based technique for converting formatted language specifications into regular expressions.Sequentially, Locascio et al. (Locascio et al., 2016) first introduced an LSTM-based sequenceto-sequence model (Deep Regex) that translates contextual information into regular expressions using a syntax-based objective: maximum likelihood estimation (MLE).Zhong and Bhatia (Zhong et al., 2018) optimized performance by employing policy gradient techniques (Sutton et al., 1999)to train the model with a semantics-based objective.Similarly, Park et al. (Park et al., 2019) applied semantic correctness as the reinforcement learning reward.However, experiments conducted on these models revealed significant overfitting on public datasets resulting in limited generalizability to other input requirements.We speculate that LSTM lacking the capacity for induction and deduction compared to the advanced large language models available today.
Recently, Large language models (LLMs) trained on extensive text corpora from diverse domains have exhibited their capability to perform zero-shot tasks, including code generation.This zero-shot ability emerged when models reached an adequate scale (Brown et al., 2020).Researchers utilizing pre-trained LLMs and fine-tuning them on pertinent datasets have achieved remarkable outcomes.For example, CodeX (Chen et al., 2021), a fine-tuned model on GPT-3 (Brown et al., 2020), outperforms prior state-of-the-art models on code generation.Copilot, a highly renowned code suggestion tool within the GitHub community, employs CodeX as its foundational model.Furthermore, CodeGeeX (Zheng et al., 2023), a multilingual code generation model equipped with 13 billion parameters, attains the highest average performance on publicly available datasets.

Language Model
We conducted experiments on large language models, such as llama, GPT-3, and text-davinci-003, to evaluate their performance in solving public regular expression problems.The results demonstrate their ability to generate regular expressions, although their performance may not be on par with prior research advancements on public datasets.This finding is significant, particularly because these models are pretrained on a vast corpus rather than being specifically designed for regular expression generation.Consequently, it is essential to finetune these language models specifically for the task of regular expression generation to improve their effectiveness.Ensuring functional correctness is a critical aspect of regular expressions.To clarify, in practical applications, validating the correctness of a regular expression usually involves unit test.If all the intended patterns are successfully extracted from the test cases and all of the extracted patterns match the desired patterns, then the regular expression is considered valid.Unfortunately, previous researches employing SFT on language models overlooked this aspect.As a solution, we propose utilizing policy gradient method (Sutton et al., 1999), which optimizes parameterized policies through gradient descent based on the expected return (reward) to convert functional correctness into a differentiable gradient.

Unit-Test Driven Reinforcement Learning
Our approach aims to improve the functional correctness of the model by highlighting the unique functionality of regular expressions and encouraging the production of functionally correct regular expressions, especially in challenging scenarios where the generation process failed to compile.The reinforcement phase will facilitate the model in learning to generate regular expressions that are both semantically and functionally correct, leading to improved performance on "Unit Test".Specifically, for a given problem context C i , a desired ground truth regular expression R i and several valid test cases T i , we want to maximize the expected reward r(y, R i , T i ) for every regular expression y generated by language model p θ , namely improving the ratio of the generated regular expression y that can pass the unit test.
During the training process, it is still desirable for the regular expressions generated by the model to have a minimal discrepancy with ground truth annotated regular expressions.Therefore, we incorporate the supervise loss with the ground truth regular expressions into the final objective function, aiming to mitigate the disparity.
In this context, D is a regular expression problem set.The reward coefficient, β, and supervise loss coefficient, γ, control the magnitude of importance between the reward and the supervise loss.Setting γ to 0 would make the gradient depend solely on the functional correctness of the generated regular expression.
Measurement of Functional Correctness.Since we have utilized the policy gradient method (Sutton et al., 1999) to transform functional correctness into a differentiable signal, it is crucial to define a criterion for evaluating functional correctness.In practical terms, a regular expression is considered valid if it can successfully extract the desired string pattern from a provided set of inputs.This concept shares similarities with the pass@k metric employed in code evaluation (Chen et al., 2021).To accomplish this, we employ dedicated unit test designed for regular expressions to assess their functional correctness.These unit test, specifically tailored to regular expressions, are illustrated in Figure 2. The pseudo code in Algorithm 1 illustrates the process of the reward function.If a generated regular expression passes the current test case t j , a positive value is added to the reward.Otherwise, a negative value is added to the reward.Test Case Generation.Generating appropriate test cases is a crucial aspect of unit test.Although manual generation is possible, it is often unnecessary due to the availability of automated tools like rstr, which can generate test cases automatically based on the provided regular expression.For thorough testing, it is essential to include both positive test cases, denoted as {t + i }, which match the regular expression pattern, and negative test cases, denoted as {t − i }, which do not produce any matches.Accordingly, we define our set of test cases as T i = {t + 1 , t + 2 , ..., t − 1 , t − 2 , ...}, comprising positive cases generated using rstr and negative cases randomly selected from pre-generated test case pools.

Evaluation
DFA Equivalence.We assessed the effectiveness of our approach in generating regular expressions by testing it with DFA Equivalence, a method that converts a given regular expression into a minimal DFA.As noted by Karttunen (1996) (Karttunen et al., 1996), regular expressions can be represented by finite state networks.This approach is grounded in the fact that two equivalent regular expressions possess identical minimal DFAs, irrespective of their structural dissimilarities (Hopcroft et al., 2001).
However, DFA Equivalence falls short when dealing with large and complex regular expressions.While DFA Equivalence converts a regular expression into a Deterministic Finite Automaton, its primary focus is on syntactical equivalence between the generated regular expression and the reference solution.However, functionally equivalent regular expression may have syntactically different forms.For example, the regular expressions in Table 1 capture the pattern of non-zero positive integers; nevertheless, DFA Equivalence fails to identify these regular expressions as representing the same input specification.This limitation is especially significant in complex real-world scenarios where different experts might create distinct regular expressions for the same specification.
Unit Test.In Section 3.2, we introduced the use of unit test to capture functional correctness during the reinforcement learning process.At the evaluation stage, this technique can be employed to assess the functional correctness of the generated regular expression.For better clarity, we have created a dedicated test case pool for each regular expression problem, as depicted in Figure 2. The problem is considered solved only if the generated regular expression passes all the test cases.Therefore we can define the metric as the number of solved regular expression problems out of the total numbers.
4 Experimental Setup In this section, we evaluate our work on different pre-trained language models to verify its effectiveness.Additionally, we conduct test case analysis and present case studies to provide further insights.

Model Configuration
We conducted experiments to evaluate the effectiveness of UTD-RL on large language models: GPT-3 (Brown et al., 2020) and LLaMA (Touvron et al., 2023).The pretrained GPT-3 models were provided by ModelsScope 1 , a platform developed by the Alibaba DAMO team.The pretrained LLaMA weights can be found on Hugging Face 2 .

Reinforcement Learning Setup
We perform a hyper-parameter search to determine the best hyper-parameters: β and γ were set to 0.01 and 1.0, respectively.The number of test cases was set to 10. Out of these test cases, 9 were derived from positive cases, and 1 was derived from a negative case.

Dataset
Our experiments are conducted on the following datasets.
In order to avoid data leakage problem, the division is followed by the target regular expression.NL-RX-ST3 , In order to test the generalizability on public regular expression problems, we manually mount 100 regular expression problems from public resources including but not limited to github, wikipedia, and stackoverflow.To be noted this dataset should only be used for testing.

Results and Analysis
We demonstrate the effectiveness of our approach by comparing it to the existing approaches including Deep Regex (Locascio et al., 2016) and Soft-Regex (Park et al., 2019).Moreover, we fine-tune text-davinci-003 (SFT API provided by OpenAI) on same data.We also conduct the ablation experiments to compare the results obtained from different language models with and without UTD-RL.
Baseline Comparison. 2. text-davinci-003 It is widely acknowledged that scaling up language models, such as increasing training compute and model parameters, can significantly improve performance and sample efficiency across various downstream NLP tasks (Wei et al., 2022).Textdavinci-003, as one of the current state-ofthe-art large language model provided by openai, shows promising performance across all datasets.It even demonstrates some ability to generalize to unseen problems.However, the model treats the problem as a black box, only leveraging the syntax similarity of regular expressions.Therefore, by better utilizing the inherent functionality of the regular expression, we can further enhance the effectiveness of the model.This point has been proven in subsequent ablation studies.
generating regular expressions, In addition, the use of UTD-RL can effectively improve the model's generalization ability in other regular expression problems.

Practical application
In our context, the app hosts numerous registered merchants.In compliance with market regulatory requirements, these registered merchants are obligated to undergo internal compliance reviews before publishing new advertisement landing pages or text content.This is done to ensure that the content does not contain any non-compliant elements.
Given the large number of merchants involved and the complexity of the rules, the conventional approach relied heavily on manual creation of regular expressions to identify non-compliant text scenarios.For instance, one requirement for advertisement landing pages was the exclusion of promotional expressions.Unfortunately, this approach often resulted in significant time and labor costs associated with the development and testing of regular expressions.Now a new solution has been introduced: an automated workflow that utilizes the large language model trained with UTD-RL.
To make it more specific, This language model is capable of generating production-ready regular expressions and automatically conducting unit test, thereby enabling an automated workflow that greatly facilitates the public's use of regular expressions.The process is depicted in Figure 4.

Conclusion
In conclusion, ensuring the functional correctness of regular expressions is crucial in practical appli-cations.This paper proposes the use of UTD-RL to effectively utilize the outcomes of unit test as rewards for the model, thereby enhancing the functional correctness.Furthermore, "unit test" are employed to assess the functional correctness of the generated regular expressions.This paper solely focuses on evaluating the effectiveness of the proposed method in the generation of regular expressions.However, it is believed that this approach can be extended to generate any corpus that necessitates functional specifications (e.g., Python code generation, SQL generation, etc.).Future research will investigate the applicability of this method in these domains, and we encourage interested researchers to experiment with this approach.

Figure 2 :
Figure 2: Unit test.Unit test are conducted on both the generated regular expression and the target regular expression.If the extracted outcome is the same, the test case is considered passed.Otherwise, the test case fails.

Figure 4 :
Figure 4: Pipeline for generating a valid regular expression in a practical application.Language model generates a regular expression based on users' requests.Subsequently, a unit test is implemented to assess the validity of the regular expression.If the outcome of the unit test exceeds the threshold, the regular expression is considered valid.Conversely, the input prompt is concatenated with the failed cases to regenerate the regular expression.

Table 1 :
Example on regular expression A common regular expression problem that can be found on Stack-Overflow: match non-zero positive integer