End-to-end Aspect-based Sentiment Analysis with Combinatory Categorial Grammar

End-to-end Aspect-based Sentiment Analysis (EASA) is a natural language processing (NLP) task that involves extracting aspect terms and identifying the sentiments for them, which provides a fine-grained level of text analysis and thus requires a deep understanding of the running text. Many previous studies leverage advanced text encoders to extract context information and use syntactic information, e.g., the dependency structure of the input sentence, to improve the model performance. However, such models may reach a bottleneck since the dependency structure is not designed to provide semantic information of the text, which is also important for identifying the sentiment and thus leave room for further improvement. Considering that combinatory categorial grammar (CCG) is a formalism that expresses both syntactic and semantic information of a sentence, it has the potential to be beneficial to EASA. In this paper, we propose a novel approach to improve EASA with CCG supertags, which carry the syntactic and semantic information of the associated words and serve as the most important part of the CCG derivation. Specifically, our approach proposes a CCG su-pertag decoding process to learn the syntactic and semantic information carried by CCG su-pertags and use the information to guide the attention over the input words so as to identify important contextual information for EASA. Furthermore, a gate mechanism is used in incorporating the weighted contextual information into the backbone EASA decoding process. We evaluate our approach on three publicly available English datasets for EASA, and show that it outperforms strong baselines and achieves state-of-the-art results on all datasets. 1


Introduction
End-to-end aspect-based sentiment analysis (EASA) is an important task that provides the † Corresponding author. 1 The code involved in this paper is released at https: //github.com/synlp/ASA-CCG.Figure 1: An example sentence for EASA where the sentiments of two aspect terms, namely, "environments" and "bar service", are positive and negative, respectively.The supertag and EASA label of each word are also presented for better illustration.The EASA labels aggregate the BIO labels for aspect term extraction with the positive/negative labels for sentiment polarity.
understanding of attitudes and emotions of individuals towards specific topics or entities on a fine-grained level.In general, EASA is normally required to identify aspect terms in the running text and then predict their sentiment polarities.Therefore, understanding sentence structure is of great importance in locating different parts of a sentence and thus finding the aspect and associated sentiment terms for EASA.For example, in Figure 1, a sentence "Total environment is fantastic although I hate bar service" contains two aspect terms, namely, "environment" and "bar service" and the sentiment polarities towards them are positive and negative, respectively.
Most of the previous studies (Hu et al., 2019;He et al., 2019;Luo et al., 2019;Wang et al., 2021;Bie and Yang, 2021;Li et al., 2019a,b;Hu et al., 2019;Chen et al., 2020;Liang et al., 2021) on EASA are generally categorized into three categories, namely, pipeline, multi-task, and joint-label approaches, based on how they formalize the task.Among the three types of approaches, the joint-label approaches aggregate labels for the sub-tasks rather than directly conducting them, and they achieve the best performance.Figure 1 shows the aggregated EASA labels where the first part indicates the position of a word in an aspect term following the BIO schema and the second part indicates the sentiment of the aspect term. 2 In performing EASA, previous studies try advanced encoders, such as LSTM, Transformer (Vaswani et al., 2017), and BERT (Devlin et al., 2019), to capture contextual information and achieve outstanding performance.Since most aspect terms are noun phrases and the syntactic structure of the text is able to provide additional clues about the sentiment expressed towards an aspect, previous studies leverage syntactic information, e.g., the dependency of the sentence, to further enhance sentiment analysis (Huang and Carley, 2019;Tian et al., 2021a,b;Wu et al., 2021;Liang et al., 2021).However, approaches enhanced by conventional syntactic information, especially the dependencies, may reach a bottleneck since it is unable to provide semantic information about the sentence, which is also important for EASA.
Combinatory categorial grammar (CCG) offers an alternative to phrase structure grammar for describing the syntax and build transparent connections for syntax and semantics (Steedman, 1987;Baldridge, 2002;Hockenmaier and Steedman, 2005).In using combinatory rules over simple syntactic categories, a key feature of CCG is its use of type-logical semantics, which provides a systematic way of associating meanings with the syntactic structures generated by the grammar.This allows for a more precise and intuitive representation of the meanings of sentences as well as each word in it.Particularly, the lexical category (which is also known as CCG supertags) associated with each word conveys both syntactic and semantic information of the word.Therefore, learning the CCG supertagging process allows the model to learn the syntactic and semantic function of each word in the running text and thus shows its effectiveness in many tasks (Lewis et al., 2015;Kasai et al., 2019;Tian and Song, 2022).
In this study, we hypothesize that CCG could be useful in enhancing the performance of the model for EASA as well.We propose a joint-label approach following the encoding-decoding paradigm to enhance EASA with CCG supertags.In doing so, we enhance EASA with a CCG supertag decoding process to learn from the CCG supertags automatically annotated by an off-the-shelf CCG supertagger.An attention mechanism is performed over all input words to identify the ones that contribute to the EASA task, where the attention weights are guided by the supertag decoding process.This allows our model to learn the syntactic and semantic information carried by CCG supertags through the CCG decoding process rather than using them as additional input features and thus makes our model run faster in inference.Furthermore, considering that there could be noise in the auto-generated CCG supertags, we introduce a gate mechanism to balance the contributions between the context information obtained from the text encoder and the attention module.Experimental results on three English benchmark datasets for EASA present the effectiveness of our approach, where our approach outperforms strong baselines and achieves state-ofthe-art performance.

Related Work
The EASA task has drawn much attention in recent years and existing approaches can be categorized into three groups, namely, pipeline approaches, multi-task approaches, and joint approaches.Specifically, the pipeline approaches (Mitchell et al., 2013;Zhang et al., 2015;Hu et al., 2019) contains two steps, where the first performs the aspect term extraction and the second aims to predict the sentiment polarities of the extracted aspect terms.The multi-task approaches (Ma et al., 2018;Luo et al., 2019;He et al., 2019;Wang et al., 2021) normally use a text encoder to model the input and employ multi-task learning with two separately decoding processes to extract the aspect terms and predict the sentiments.Joint-label approaches (Li and Lu, 2017;Li et al., 2019a;Chen et al., 2020) perform the aspect term extraction and the sentiment polarity prediction tasks simultaneously through a unified labeling scheme.
Most recent approaches to EASA apply advanced encoders (e.g., LSTM, Transformer, and BERT) and achieve promising performance.To further enhance EASA, Chen et al. (2020) proposed a joint-label model to leverage the dependencies between words through graph-based models; Wang et al. (2021) used a hierarchical architecture to perform multi-task learning and outperforms existing multi-task based approaches.Overall, the pipeline approaches often suffer from error propagation issues in the cascade steps and the multi-task approaches have label mismatching problems, where the decoding results usually do not match each other.In comparison, joint-label approaches are superior in avoiding error propagation and label mismatch problems.
Compared with previous studies, this paper proposes a joint-label approach for EASA with the enhancement of using CCG information (rather than the dependencies which are widely used in previous studies) through CCG supertag decoding, attention mechanism, and gate modules.

Combinatory Categorial Grammar
Combinatory categorial grammar (CCG) is a linguistic framework that describes the syntax and semantics of natural language using mathematical functions.It is based on the idea that the meaning of a sentence can be derived from the combination of its individual words, and each word is assigned a specific category (i.e., supertag) based on its grammatical function.CCG uses a set of rules to combine these categories in order to generate the meaning of a sentence.
The advantage of CCG for NLP is its ability to handle the complex structure of natural language.CCG provides a systematic and rigorous way to analyze the syntactic and semantic structure of sentences, which is essential for building accurate and efficient NLP systems.It also allows for the integration of syntactic and semantic information, which is crucial for language understanding.For example, in the clause "I hate bar service" in Figure 1, its CCG derivation presents that "bar service" is a nominal phrase that serves as the object and patient of the predicate "hate" that convey the negative sentiment, which implies a negative sentiment towards the aspect term "bar service".
Comparing with widely used phrase structure grammar (PSG) or dependency grammar, CCG is potentially useful for EASA in two aspects.First, CCG is normally lexicalized3 , where the syntactic categories of words and phrases are determined not only by their syntactic function, but also by their specific meaning and usage.For example, verbs can be divided into intransitive, transitive, and ditransitive verbs depending on the number of arguments they take.If the verb's supertag is "S\NP", it means the verb requires only an NP argument from the left (i.e., the subject) and thus it is an intransitive verb; if its supertag is "(S\NP)/NP", then it first requires an NP from the right (i.e., an object) and then an NP from the left (i.e., a subject); thus, it is a transitive one.Therefore, the supertag of a word specifies not only the word's syntactic category but also other relevant information, such as its subcategorization frame and the types of its arguments, which is of great importance in analyzing semantic relations among words.Second, CCG is able to handle long-distance dependencies in an elegant way through combinatory rules, especially when it uses "type raising", a process that changes the syntactic categories of a word or a phrase for representing the complex syntactic structure of a sentence, and thus captures important contextual information for EASA.

Joint-label Approaches for EASA
Previous joint-label approaches formalize EASA as a standard sequence labeling task, where an input sentence with n words, namely, with y i denotes the joint label for x i .Herein, a joint label consists of two parts: the first part refers to the BIO label with respect to aspect term boundary and the second part indicates the sentiment polarity (i.e., positive, negative, neutral) of that aspect.

The Approach
Our approach for EASA follows the standard sequence labeling paradigm, whose architecture is illustrated in Figure 2. Specifically, our approach consists of four parts: (1) the backbone model to predict EASA labels, (2) the supertag decoding process to learn syntactic and semantic information carried by supertags, (3) the attention module to weigh different contextual information, and (4) the gate mechanism to balance the contribution of the backbone model and the attention module to the EASA task.The process of our approach (which is denoted as f ) for EASA can be formalized as where n stands for the supertag sequence obtained from the supertag decoding process (denoted as S), G is the gate module, and A refers to the attention module.In training, the predicted EASA joint labels Y are compared with the gold standard Y * to obtained the EASA loss L e and the predicted supertags Y S are compared with the silver standard Y * obtained from a CCG supertagger to compute the supertag loss L s .The total loss is the sum of EASA loss and supertag loss and the model is updated accordingly.In this section, we will first introduce the overall tagging process for EASA labels, next present the supertag decoding process, then elaborate the attention module, and finally illustrate the gate mechanism.

Overall Tagging Process
The overall tagging process follows the standard encoding-decoding paradigm.First, a text encoder f e (such as BiLSTM, Transformer, and BERT) is applied to the input sentence and obtains the hidden vector sequence where h i is the hidden vector of x i which stores contextual information learnt by the encoder.Then, the hidden vector for each word is fed into a multi-layer perceptron (denoted by MLP e ) to obtain the hidden vector h e i for EASA through h e i = MLP e (h i ).Afterwards, h e i and the output from the attention module a i are fed into the gate module to obtain the output o i through o i = G(h e i , a i ).Finally, o i is fed into a fully connected layer to produce ing as its weight matrix and bias vector.Finally, a softmax classifier is applied to the resulting vector u i to predict the joint label ŷi for EASA.

Supertag Decoding
To leverage the information carried by CCG supertags, one straightforward approach is to use an off-the-shelf CCG supertagger to tag the input sentence and then use the supertags as extra word-level features by concatenating them with the input words before sending them to the text encoder.However, such approaches require the CCG supertagging as a pre-processing step in inference, which is not efficient, especially when the data to be processed is relatively large.Considering combining several decoding processes serves as an effective approach to learning from different tasks and does not require the label from different tasks as an extra input, we propose to learn the CCG information through an additional CCG supertag decoding process and then use the CCG information to guide EASA through an attention mechanism over all input words.
For CCG supertag decoding, we take the hidden  vector h i of the word x i obtained from the encoder and pass it through an MLP (denoted by MLP s ) to get h s i = MLP s (h i ).The resulting h s i is then mapped to the vector u s i = W s • h s i + b s in the output space by a trainable matrix W s and a bias vector b s .Finally, a softmax classifier is applied to u s i to predict the supertag y s i .

Supertag-driven Attentions
In the attention module, we use two trainable matrix W k and W v to map h s j to the key vector k j and value vector v j , respectively: Then, for each word x i , we compute the attention weight p i,j assigned to the value v j through Afterwards, we apply p i,j to the value vector v j and obtain the weighted sum vector a i via Finally, a i is fed into the gate module.
In training, the model is optimized on EASA and CCG supertagging, which allows our model to learn CCG information and use it to enhance the en-tity representation through the attention mechanism with the attention weights assigned to different input words guided by the learned CCG information.

The Gate Mechanism
We observe that the contribution of the obtained contextual information to the EASA task could vary in different contexts and a gate module (denoted by G) is naturally desired to weight such information in varying contexts.Thus, to improve the capability of EASA with the semantic information, we propose a gate module to aggregate such information to the backbone NER model.Particularly, we use a reset gate to control the information flow by where W 1 and W 2 are trainable matrices and b the corresponding bias term.Afterwards, we use to balance the information from the backbone model and the attention module, where ⊕ denotes vector concatenation operation, o i is the derived output of the gate module; • represents the elementwise multiplication operation and 1 is a 1-vector with its all elements equal to 1.
5 Experimental Settings

Datasets
In experiments, we follow previous studies for EASA and evaluate models on three English benchmark datasets, including restaurant (REST) dataset from SemEval ABSA challenges (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016)) Table 3: Experimental results (average F1 scores of five runs and the standard deviation) of different models on three English benchmark datasets.The last two columns show the number of parameters and the inference speed (in terms of the number of sentences processed per second) of the models, respectively.The improvement of our approach over the baselines is significant under all settings with p < 0.005.
their sentiment polarities.Following the convention in previous studies for aspect-level sentiment analysis (Li et al., 2019a,b;He et al., 2019;Hu et al., 2019;Qin et al., 2021Qin et al., , 2022)), we only consider three sentiment polarities, namely, positive, negative, and neutral, where all cases labeled by a conflict sentiment in REST and LPTP dataset are filtered out.For all datasets, we report in Table 1 their numbers of sentences and aspect terms, as well as the numbers of aspect terms with positive, neutral, and negative sentiment polarities.The TWTR dataset does not have a standard train-test split and thus we only report its overall statistics.

Baselines
To explore the effect of our approach to leverage CCG information, we compare our approach with the following baselines: Base: This baseline follows the standard encoding-decoding paradigm, which corresponds to the backbone model of our approach.
Concat: This baseline uses the auto-generated CCG supertags as additional input.Specifically, the embeddings of the supertags are concatenated with the hidden vectors of the corresponding words and the resulting vectors are fed into a two-layer Transformer to further encode contextual information before passing through the decoder.
Attention + Gate: This baseline uses the same attention and gate module as our proposed model but does not use the CCG decoding process to learn the CCG supertag information.

Implementation
To obtain the auto-generated CCG supertags for the baselines and our approach, we use an off-theshelf CCG supertagger called NeST-CCG5 (Tian et al., 2020) to supertag the text.Because a highquality text representation is significantly important for a model to obtain high performance in many tasks (Mikolov et al., 2013;Song et al., 2017;Bojanowski et al., 2017;Song et al., 2018;Song and Shi, 2018;Peters et al., 2018;Diao et al., 2020;Lewis et al., 2020;Song et al., 2021), in the experiments, we choose two commonly used pretrained language models, namely BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019).We use the base and large versions of the models following the default hyper-parameter settings, i.e., 12 layers of self-attention with 768-dimensional hidden vectors for the base version and 24 layers of selfattention with 1024 dimensional hidden vectors for the large version. 6e randomly initialize all trainable parameters in our approach and update them during training.For REST and LPTP which do not have an offi-  2).Once the hyper-parameters are tuned, we train our final model on the whole training set and evaluate it on the test set.For TWTR, we follow the convention in previous studies (Mitchell et al., 2013;Zhang et al., 2015;Li et al., 2019a;Luo et al., 2019;Hu et al., 2019) to use ten-fold cross-validation on it, where the hyper-parameter tuning process is the same as the experiments on REST and LPTP.For evaluation, we follow previous studies (Li et al., 2019a,b;Luo et al., 2019;He et al., 2019;Hu et al., 2019) and evaluate all models with F1 scores.

Models
6 Results and Analysis

Overall Results
We run our models and baselines using the base and large versions of BERT and XLNet.Table 3 shows the average results (F1 scores) of five runs on the development and test set of the three English datasets, where the number of parameters ("Para.#") and the inference speed ("Speed") in terms of the number of sentences processed per second are shown in the last two columns.
There are several observations from Table 3. First, our approach outperforms the "Base" model with base and large versions of BERT and XLNet on all datasets without requiring many additional model parameters, which is promising given that BERT and XLNet baselines have already achieved outstanding performance on the datasets.Second, compared with the "Concat" baselines that leverage CCG supertags by concatenating the supertag embeddings with the hidden vectors, our approach consistently achieves better performance under all settings, which demonstrates the effectiveness of the attention and gate modules, as well as the CCG supertag decoding process, to leverage CCG information to improve EASA.Third, our model significantly outperforms the "Att+Gate" models without the CCG supertag decoding process to learn the CCG information, which confirms the effectiveness of the CCG supertags in providing necessary information to guide the model to distinguish important context information.
Next, we compare our best-performing models using the large version of BERT and XLNet with previous studies for EASA.The results are in Table 4, showing that our approach outperforms all the previous studies and achieves state-of-the-art performance on the three benchmark datasets.Particularly, different from existing approaches that leverage dependencies for EASA, our approach proposes an alternative by incorporating syntactic and semantic information from CCG supertags into the EASA task.Our model integrates a CCG supertag decoding process to automatically learn the syntactic and semantic information contained within CCG supertags.This eliminates the need for using CCG supertags as additional input features, which is computationally expensive in inference.

Comparison with Dependencies
Since the dependency structure of the input sentence is widely used in previous studies for EASA,  we compare our approach which leverages CCG supertags with previous studies that leverage dependencies.Since graph attention networks (GAT) (Veličković et al., 2017) is a widely used architecture with an attention mechanism to encode dependencies and is demonstrated to be effective in many NLP tasks, we use GAT to encode the dependency of the input sentence obtained from the dependency parser DMPar (Tian et al., 2022).The average F1 scores of GAT and our approach, as well as the "Base" model, with base and large versions of BERT and XLNet are reported in Table 5, where our approach consistently outperforms the GAT model under all settings.This observation shows that in addition to dependencies, CCG could be another effective linguistic information that could be beneficial for EASA.

Ablation Study
To determine the effect of the attention module and the gate module in leveraging CCG supertag information, we conduct an ablation study where either the attention module or the gate module is ablated from our full model.Specifically, when the attention module is ablated, the attention weights are equal for all the words; when the gate module is ablated, the output from the attention module is directly concatenated with the hidden vector of each word. of the "Base" model with base and large versions of BERT and XLNet is included for reference.The table shows that the ablation of either the attention module or the gate module hurts the model's performance.Furthermore, the performance decrease due to ablating the attention module is much severe than that due to ablating the gate module, indicating the importance of the attention module in leveraging CCG supertags, as the attentions allow the model to identify the important context features for EASA and leverage them accordingly to improve system performance.

The Effect of Supertags
To investigate the contribution of individual CCG supertags to the EASA task, for each word that is a part of an aspect term, we collect the attention weights over the input words and their attached CCG supertags and compute the average weight of each supertag.Figure 3 presents the top 5 ranking CCG supertags when using our model with BERT-base encoder on the test set of REST, LPTP, and TWTR.It is observed that, "N/N", "S\NP", "(S\NP)/NP", and "(S\NP)/(S\NP)" are the top rank- ing CCG supertags shared by all datasets.One possible explanation could be the following."N/N" is often the supertag of a noun modifier (e.g., the word "beautiful" in "beautiful view"), which could express sentiment toward the noun being modified.The supertag "S\NP" is for both intransitive verbs and adjectives in the predicative position (e.g., "fantastic" in "Total environment is fantastic" shown in Figure 1); some words in that position could provide sentiment information toward the subject of the clause.The supertag "(S\NP)/NP" is normally associated with transitive verbs and some of them (such as like, hate, and prefer) may express sentiment toward the object of the clause.The supertag "(S\NP)/(S\NP)" is normally for adverbs that modify VPs.Some of those adverbs (e.g., happily and not) can be important for identifying the sentiment.

Conclusion
In this paper, we propose to enhance joint-label EASA by leveraging CCG information, which contains both syntactic and semantic information of the running text and thus shows its superior to the conventional phrase and dependency style grammars.Specifically, we learn the CCG supertag information through a CCG supertag decoding process, where such information is used to guide the attention weights over the input words in the attention module so that important context information is distinguished and leverage accordingly.To further enhance the model performance, we employ the gate module to balance the contribution of the context information obtained from the backbone text encoder and the attention module.Experimental results and further analysis on three English benchmark datasets demonstrate the effectiveness of the proposed model, where our model outperforms strong baselines and achieves state-of-the-art performance on all datasets.For future study, we plan to explore effective approaches to leverage the CCG derivation of the input text to improve EASA.

ACL 2023 Responsible NLP Checklist
A For every submission: A1. Did you describe the limitations of your work?
Section 8 A2. Did you discuss any potential risks of your work?
We don't think there is any risk of our work.
A3. Do the abstract and introduction summarize the paper's main claims?
Abstract and Section 1 (the last paragraph).
A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 5 B1. Did you cite the creators of artifacts you used?
Section 5 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
Section 5.In the footnotes.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Section 5.In the footnotes.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Section 5. Table 1.
C Did you run computational experiments?
Section 5, 6 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 6.Table 3, footnote 7.

Figure 2 :
Figure 2: The overall architecture of our model.The right-hand side illustrates the supertag decoding process that learns contextual information by predicting CCG supertags and the attention mechanism that obtains important contextual information guided by the CCG supertags.The left-hand side presents the backbone model for EASA and the gate module to incorporate the contextual information from the attention module into the backbone model.

Figure 3 :
Figure 3: The average attention weight assigned to the top 5 ranking CCG supertags by our model that uses BERT-base encoder on the test set of (a) REST, (b) LPTP, and (c) TWTR.

Table 1 :
The statistics of three benchmark datasets, showing the numbers of sentences and aspect terms, as well as the numbers of aspect terms with positive (POS), negative (NEG), and neutral (NEU) sentiment polarities.

Table 2 :
The hyper-parameters that were tested in our experiments.The best ones are highlighted in boldface.

Table 4 :
Comparison of our best-performing models with previous studies.

Table 5 :
Comparison of the results (average F1 scores) of models with GAT to encode dependency information and our models to leverage CCG supertags, where the performance of the "Base" model with BERT and XL-Net encoder are also reported for reference.

Table 6 :
Table 6 summarizes the average performance of different models, where the performance Experimental results (average F1 scores) of ablation studies, where either the attention module ("− Att") or the gate module ("− Gate") is ablated from our full model.The performance of the "Base" model with base and large versions of BERT and XLNet are also reported for reference.