Smart “ Chef ”: Verifying the Effect of Role-based Paraphrasing for Aspect Term Extraction

,


Introduction
ATE is a natural language processing task, which aims to extract aspect terms from sentences (Jakob and Gurevych, 2010).The aspect term refers to a word, phrase or named entity depicting a certain domain-specific attribute.For example, the text span "al di la" in ( 1) is specified as a restaurantdomain aspect term because it appears as the sign of an Italian trattoria.
(2) We take pride in every dish we serve at al di la (Rewritten by ChatGPT with a role of "chef").
The current studies leverage PLMs as backbones to construct ATE models (extractors), including BERT (Devlin et al., 2019), BART (Lewis et al., 2020) and T5 (Raffel et al., 2020) as mentioned in Section 2. They yield significant improvements, compared to conventional neural networks like CNN (LeCun et al., 1998).The advantage is primarily attributed to the strong perception ability on noteworthy contexts, as well as context-aware representation learning ability.
However, such extractors frequently suffer from uninformative or low-quality contexts.For example, the context "Love" in (1) is uninformative for recognizing "al di la".By contrast, the substitution containing a knowledge-rich context makes it easier to recognize aspect terms, such as the case in (2).Accordingly, we propose a ChatGPT-based Edition Fictionalization (CHEF) method to assist the current PLM-based extractors.CHEF acts as a domain-specific virtual expert with different roles to rewrite sentences, with the aim to refine contexts of potential aspect terms.ChatGPT is utilized for both expert generation and sentence rewriting.A series of post-processing methods are coupled with CHEF, including redundancy elimination, synonym replacement and multi-role voting.
We experiment on the benchmark datasets L14 and R14-16 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)).The well-trained BERT base and PST (Wang et al., 2021) (SoTA) are adopted in the experiments.During test, they perform over the rewritten sentences by CHEF, without retraining and fine-tuning.The test results demonstrate the effectiveness of CHEF in recalling aspects and expanding the predictions.

Related Work
Context-aware encoding contributes to ATE.It brings domain-specific contextual information into token-level representations.Therefore, CNN (Le-Cun et al., 1998;Karimi et al., 2021) and BERT (Devlin et al., 2019;Karimi et al., 2021;Klein et al., 2022) are widely used for ATE due to the abilities of convolving local contexts or absorbing attentive contextual information.Their expanded versions DECNN (Xu et al., 2018;Wei et al., 2020;Li et al., 2020;Chen and Qian, 2020) and BERT-PT (Xu et al., 2019;Wang et al., 2021;Chen et al., 2022b) are generally adopted as backbones in the subsequent studies.In addition, BERT is utilized as the pedestal for context-aware encoding in a series of more complex tasks, including Aspect-Sentiment Triplet Extraction (ASTE) (Chen et al., 2022a;Zhang et al., 2022b;Chen et al., 2022c;Zhang et al., 2022a;Chen et al., 2022d;Zhao et al., 2022b) and MRC-based ASTE (Yang and Zhao, 2022;Zhai et al., 2022).Recently, the generative framework is introduced into the studies of ASTE, and accordingly BART (Yan et al., 2021;Zhao et al., 2022a) and T5 (Zhang et al., 2021;Hu et al., 2022) are used.They are constructed with the transformer encoder-decoder architecture in the seq2seq paradigm, where attentive contextual information absorption is conducted on both sides.

Approach
We aim to provide knowledge-rich and high-quality sentences for ATE.Specifically, we use CHEF to rewrite sentences, and feed the rewritten sentences into an extractor to predict aspect terms.Finally, we combine the aspect terms which are respectively extracted from the original and rewritten sentences.

Extractors
We follow the common practice (Chernyshevich, 2014;Toh and Wang, 2014;San Vicente et al., 2015) to treat ATE as a sequence labeling task.B/I/O labels are used, which respectively signal Beginning, Inside and Outside tokens relative to aspect terms.Therefore, an extractor essentially classifies each token into one of the B/I/O labels in terms of the token-level hidden state.We use PLM to compute the hidden states of tokens, and use a Fully-connected (FC) layer for classification.
We consider two PLMs for hidden-state computation in the experiments, including BERT base and BERT pt -based PST (Wang et al., 2021).

CHEF
CHEF comprises two stages, including role generation of domain-specific virtual experts, as well as role-based rewriting.It is coupled with three post-processors, including redundancy elimination, synonym replacement and multi-role voting.
Role Generation-CHEF induces ChatGPT1 to generate a series of virtual experts playing different roles.The generation is prompted by the targetdomain name D i like "Restaurant".The query we use is as follows: "Output the roles of experts in the domain of [D i ] according to the different responsibilities.".Table 1 shows the experts.
Sentence Rewriting-Given a sentence S k in the ATE datasets, CHEF induces ChatGPT to rewrite the sentence from the perspective of a role-specific expert R j .Zero-shot prompting (Wei et al., 2022) is used during generation.In other words, there isn't any example provided for prompting Chat-GPT.The query we use is as follows: "Rewrite the sentence [S k ] from the perspective of [R j ] in the domain of [D i ].".We rewrite all the instances in the test sets using Algorithm 1.
Post-Processing-We drive the extractors to pre-  dict aspect terms over the role-specific rewritten sentences.Redundant results may obtained, which are specified as the aspect terms never occurring in the original sentences.For example, although the token "dish" in the rewritten sentence in ( 2) is correctly predicted as an aspect term, it is redundant due to non-occurrence in the original sentence in (1).We filter the redundant results during test.
In addition, we use a soft synonym replacement method to reduce false positive rates.Specifically, given an original sentence S i and the rewritten case Si , we segment both of them into n-grams.Assume a set U i of n-grams (1 ≤ n ≤ 5) in S i share a part of the predicted aspect term with the set Ůi of n-grams in Si , we calculate the similarity between each gram u ij in U i and every gram ůik in Ůi .We rank all pairs of {u ij ,ů ik } in terms of their similarities, and select the top-1 ranked n-gram pair for synonym replacement, i.e., ůik ⇐ u ij .Meanwhile, the replaced n-gram is specified as the unabridged aspect term, as shown in the example in (3).Note that Cosine similarity is computed over the embeddings of each pair {u ij ,ů ik }.The embedding of each n-gram is obtained by conducting mean pooling (Reimers and Gurevych, 2019) over the token-level hidden states.
(3) Original: Best Indian food I have ever eaten.
Rewritten: Assume an extractor refuses to extract a text span t i as an aspect from the original sentence, though it would like to do so from the sentences rewritten by different roles of experts, thus we regard t i as a controversial result.In this case, we define the behavior of extracting t i as the voting for acceptance, otherwise rejection.On this basis, we compute the acceptance rate v i over the rewritten sentences of all experts: v i =N c /N all , where N c denotes the number of voting for acceptance, while N all is the number of experts.N all is set to 10 in our experiments.For the Restaurant domain, a controversial result t i is finally adopted only if v i is no less than a threshold of 0.7.For the Laptop domain, the threshold is set to 0.8.

Datasets and Evaluation
We experiment on the SemEval datasets, including L14 and R14-16 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)).All the instances in L14 are selected from the Laptop domain, while those in R14-16 derive from the Restaurant domain.We follow the common practice (Dai and Song, 2019;Wang et al., 2021) to split the datasets into training, validation and test sets.The statistics in them are shown in Table 2.
It is noteworthy that only the extractors are trained and developed using the above datasets.CHEF has nothing to do with training and development.It is conducted only on the test sets, providing rewritten sentences for the extractors and post-processes the predictions.We evaluate all the models using F 1-score (Chernyshevich, 2014).

Hyperparameters
We respectively use BERT base and BERT pt -based PST (Wang et al., 2021) to construct the extractors, where PST achieved the best performance so far.
For BERT base , we set the maximum sequence length to 128 and carry out training in 4 epochs.The optimization of model parameters is obtained using AdamW with a learning rate of 3e-5 and a weight decay of 0.01.We set the batch size to 10.For PST, we adopt their initial hyperparameters, setting the first round of training to 5 epochs and performing 4 rounds of self-training.The learning rate is set to 5e-5.AdamW is used as the optimizer.All the other hyperparameters remain consistent with the reported ones.

Comparison Result
In Table 3, we report the performance of PST which is enhanced by CHEF, along with other state-of-  the-art ATE models.In this case, the extraction results are obtained by combining the predictions of PST on both original and rewritten sentences, where multi-role voting is used.It can be found that CHEF enables PST to achieve better performance, slightly increasing the performance gap relative to other strong competitors.
In Table 4, we report the effects of CHEF on both the BERT base and BERT pt -based PST, where multirole voting is used for prediction combination.It can be observed that CHEF enables both extractors to achieve better performance on all the test sets, without retraining and fine-tuning.

Discussion
In two separate experiments, we demonstrate that CHEF contributes to the salvage of the missed aspect terms, and yields more substantial improvements on short or long sentences.
Salvage Rate-We select all the incompletelysolved sentences from the test sets, each of which contains at least one aspect term neglected by the extractor.On this basis, we use CHEF to rewrite the sentences and drive the extractor to rework them.In this experiment, we consider the BERT base extractor, and verify the changes in recall rates.Figure 1 shows the experimental results, where only the performances yielded by the best and worst experts are provided.It can be observed that CHEF substantially improves the recall rates for all test sets, no matter whether it plays the role of best expert or worst.The most significant improvement in recall rate reaches about 9%.Besides, the change of precision, recall and F 1 score on the full test set can be found in Appendix B.
Adaptability-CHEF applies more to short and long sentences.The former generally contains uninformative contexts, while the latter contains lowerquality contexts due to noises.We split each test set into different subsets according to the lengths of sentences.There are five subsets obtained for each, including the sentences having a length within the ranges of [1,10], [11,15], [16,20], [21,30] and no less than 30.The statistics in the subsets are shown in Table 5. Figure 2 shows the ATE performance at the original sentences of different lengths, and that at the rewritten cases by CHEF (without multirole prediction combination).It can be observed that CHEF only yields improvements for relatively short or long sentences.The most significant improvement reaches about 4% F 1-score at R16.

Ablation Study
CHEF consists of four components, including Combination (Comb), Redundancy Elimination (RE), Synonym Replacement (SR) and Multi-role Voting, as presented in Section 3. To verify the effects of the components, we conduct an ablation experiment.Figure 3 illustrates the verification results over best and worst experts, where ALL indicates the complete CHEF method.It can be observed that the simple combination (Comb) causes significant performance degradation, although many terms the baseline missed are salvaged.When redundancy elimination (RE) is used, the comparable (a little worse) performance to the baseline is achieved.At this time, the precision is still lower because the synonymous terms are regarded as negative examples during evaluation.When synonym replacement (SR) is used, the performance is increased for some roles while not for others.When multi-role voting is used, the disagreement among the roles is resolved, and thus the performance is increased to a relatively higher level.
We provide examples in Appendix C to facilitate the understanding of our ablation study.

Conclusion
We utilize ChatGPT to rewrite sentences with different roles of domain-specific experts, so as to provide informative and high-quality contexts for PLM-based ATE models.Experiments show that the proposed method contributes to the salvage of the neglected aspect terms, and applies more to the short and long sentences.In the future, we will use the rewritten sentences for contrastive learning.To reduce the reliance on ChatGPT, we will develop an offline context rewriting method by knowledge distillation and domain-specific pretraining.

Limitations
We propose to use ChatGPT as an auxiliary toolkit to produce informative and high-quality contexts for context-aware token-level encoding, so as to enhance PLM encoders for aspect term recognition.Our experiments show that the proposed method yields slight improvements when coupled with strong domain-specific models, and it is not only effective in recalling neglected cases, but performs better for short and long instances.Unavoidably, the proposed method has the limitations in building a self-contained model, due to the lack of training and fine-tuning.To overcome the problem, we will develop a lite comparable generator to Chat-GPT by knowledge distillation and domain-specific pretraining.Furthermore, we will incorporate rolebased rewritten sentences into the training process, with the support of contrastive learning.
put a lot of effort into perfecting our Indian dishes [Aspect Term].(Rewritten by CHEF with a role of "chef") Replacement: Indian dishes ⇐ Indian food Output: Indian food Multi-role Voting-We conduct multi-role voting only if an extractor obtains controversial results from the original and rewritten sentences.It facilitates the combination of the extraction results.

Figure 1 :Figure 2 :
Figure 1: Significant improvements of Recall rates yielded by CHEF on the incompletely-solved sentences.

Table 1 :
Experts of Restaurant and Laptop domains.

Table 2 :
Statistics of ATE datasets.#Sentence and #Aspect denote the number of sentences and aspects.

Table 3 :
The performance (F 1-score) of various methods."ICL" refers to in-context learning, while "COT" stands for chain-of-thought.The sign "+" represents a certain method combined with the baseline model.