A Sequence-to-Structure Approach to Document-level Targeted Sentiment Analysis

,


Introduction
Aspect-based sentiment analysis (ABSA) has received wide attention in NLP for nearly two decades (Hu and Liu, 2004).Most of the previous studies have focused on sentence-level ABSA.However, a review text often consists of multiple sentences, and the opinion targets expressed in these sentences are often interrelated.Conducting sentence-level ABSA on each individual sentence cannot capture the interrelated opinion targets in the entire document.In comparison, documentlevel ABSA is more suitable for practical appli-Input : Such a great restaurant .Service is quick and friendly , even when it 's crowded .Staff was really welcoming and recommended the lobster roll , but price was very expensive .cations, yet it has not received enough attention.Only a limited amount of work attempted to identify the sentiments towards all aspect categories in a document (e.g., the Food Quality category in a Restaurant domain), without involving explicit opinion target terms (e.g., entities or aspects) (Titov and McDonald, 2008;Pontiki et al., 2016;Li et al., 2018;Wang et al., 2019;Bu et al., 2021).Recently, Luo et al. (2022) introduced a new task called document-level Targeted Sentiment Analysis (document-level TSA), aiming to discover the opinion target consisting of multi-level entities (or aspects) in a review document, and predict the sentiment polarity label (Positive, Negative or Mixed) towards the target.In sentence-level ABSA, an opinion target is usually a single entity or aspect.While in document-level TSA, it often involves multiple entities and aspects throughout the review document, and their relation is affiliated rather than flat.As shown in Figure 1, "restaurant -lobster_roll -price" is an opinion target consisting of three-level entities1 , indicating the price of the lobster roll sold in the restaurant.Luo et al. (2022) accordingly proposed a sequence-tosequence (Seq2Seq) framework to solve the task.By using BART as the backbone, they took the review document as the input, and output a sequence indicating a set of target and sentiment tuples.For example, the tuple "<b> restaurant <i> lobster_roll <i> price <e> Negative <se>" denotes that the multi-level opinion target is "restaurant -lobster_roll -price" and its corresponding sentiment is Negative.
As shown in Figure 1(b), there is actually a hierarchical structure among multiple opinion targets in the document: the first layer is the "restaurant" entity, the second layer contains three entities (or aspects) affiliated to "restaurant" ("service", "lobster_roll" and "staff "), and the third layer is the "price" aspect of "lobster_roll".Although the Seq2Seq method appears simple and straightforward, it is imperfect to model such complex hierarchical structure.On one hand, it outputs the structural information by a sequence of tuples, where the previous tuples affects the generation of subsequent ones.On the other hand, the inherent encoder-decoder architecture is less flexible and effective to model the long-distance dependencies among affiliated entities/aspects across sentences.
To address the aforementioned issues, we propose a Sequence-to-Structure (Seq2Struct) approach in this work for the document-level TSA task.Our approach still takes a document as the input, but the output is no longer a sequence, but a structure as shown in Figure 1(b).It is a hierarchical structure with multiple layers of related entities, where each entity is assigned a predicted sentiment polarity.Seq2Struct contains four mains steps.We firstly identify flat opinion entities and their sentiments from the document.Secondly, we propose a multi-grain graphical model based on graph convolutional network (GCN), to better learn the semantic relations between document, sentences and entities.Thirdly, we employ a table-filling method to identify the affiliation relations in flat entities and consequently get the hierarchical opinion target structure.We finally incorporate the sentiments of the flat entities into the hierarchical structure and parse out the target-sentiment pairs as defined in (Luo et al., 2022).
We evaluate our approach on the document-level TSA dataset containing six domains.In addition to the existing approach, we further construct four strong baselines with different pretrained models.The experimental results show that our Seq2Struct approach outperforms all the baselines significantly on average F1 of the target-sentiment pairs.Aside from the advantage of performance, our approach can further explicitly display the hierarchical structure of the opinion targets in a document.We also make in-depth discussions from the perspectives of document length, the number of levels in opinion target, etc., verifying the effectiveness of our approach in capturing the long-distance dependency among across-sentence entities in document-level reviews.

Task Description
In document-level TSA task, an opinion target often consists of multi-level entities with affiliated relations (e.g., "restaurant -lobster_roll -price").We call it "multi-level opinion target" in contrast to the opinion target at the sentence level.A document normally contain a set of multi-level opinion targets, which constitute a hierarchical structure as shown in Figure 1(b).
Similar as (Luo et al., 2022), we formulate document-level TSA as a task to detect a set of target-sentiment pairs from a document D = [x 1 , x 2 , ..., x N ] with N tokens: where t = e 1 -e 2 -. . .-e m is a multi-level opinion target with m denoting the number of its levels, and s ∈ {Positive, Negative, Mixed} is the corresponding sentiment.
The document-level TSA task is challenging, as the opinion target consisting of multi-level entities, and a predicted opinion target is considered to be correct if and only if entities at all levels match the ground-truth exactly.For instance, "restaurantlobster-price" is the correct target only if the first, second, and third levels are predicted as "restaurant","lobster", and "price", respectively.

Approach
As shown in Figure 2, we propose a Sequenceto-Structure (Seq2Struct) approach to address the document-level TSA task, which consists of four main modules.

Flat Entity Extraction and Sentiment Classification
We adopt DeBERTa (He et al., 2021) as the encoder of the input document D = {x 1 , x 2 , ..., x N }: where We then employ the BIO (Begin, Inside, Outside) tagging scheme to extract flat entities 2 from D: where y e i ∈ {B, I, O}, W e ∈ R d model ×3 is model parameter, and d model is the dimension of the hidden representation of each token.
Let E = {e i } |E| i=1 represent the extracted entity set, where e i = (x start , ..., x end ).The representation of e i is the mean pooling of its tokens h e i = MeanPooling(h start , ..., h end ).Furthermore, we perform sentiment prediction on the extracted flat entities.Specifically, for an entity e i in E , we utilize an entity-context cross-attention module to capture the context information: 2 Flat entities in this paper refer to all individual entities in the document, regardless of the hierarchy.where Q = h e i , K = H, V = H.Then, ĥe i is fed into the softmax layer to predict the sentiment towards e i :

Dataset
where y s i ∈ {Positive, Negative, Mixed} and W s ∈ R d model ×3 is the model parameter.

Multi-grain Graphical Model
The affiliated entities in a multi-level opinion target often exist in multiple sentences.In Table 1, we report the number and proportion of across-sentence opinion targets in four domains of the dataset (Luo et al., 2022).It can be seen that the proportion of across-sentence opinion targets to all multi-level opinion targets reaches 80.0%, 72.5%, 88.2% and 77.5% in four domains respectively.To better model the affiliated relations and longdistance dependencies between different entities, we construct a multi-grain graphical model, to enhance entity representation learning.
The graph contains three different types of nodes: 1) Document node, 2) Sentence nodes, and 3) Entity nodes.These nodes are linked with three types of edges: 1) Document-to-Sentence Edges: The document node is linked to all sentence nodes; 2) Sentence-to-Entity edges: Each sentence node is linked to all entity nodes it contains; 3) Entity-to-Entity edges: We maintain a global entity relation map to capture affiliated relations between entities.As long as entity e k and entity e l were annotated as adjacent upper-level and lower-level entities (i.e., e l is affiliated with e k ) in any document of the training and validation corpus, e k is linked to e l .
On this basis, we construct the adjacency matrix A where A ij = 1 if the i-th node and j-th node have an edge, otherwise A ij = 0. We then employ the Graph Convolutional Network (Kipf and Welling, 2017) to update the representation of nodes: where l is the index of GCN sub-layers, ) is an activation function, i.e., RELU.

Hierarchical Opinion Target Structure Identification
Till now, we have extracted the flat entities E and learned better entity representations He .In this subsection, we further propose a table filling based method to identify the hierarchical structure among multiple opinion targets in a document.Firstly, we construct a Affiliated Relation Table T , as shown in Figure 3 flat entities from the input document.The row e i represents the upper-level entity, and the column e j represents the lower-level entity.We concatenate the representations of e i and e j as the representation of the cell and then send it to a binary classifier to predict the affiliated relation: where W r ∈ R 2d model ×2 is the model parameter.y r ij ∈ {1, 0} indicates the affiliated relation between e i and e j .T ij = 1 means that e j is affiliated with e i .
Secondly, based on the predictions on each cell of T , we can finally obtain a hierarchical opinion target structure (a directed acyclic graph G), as shown in Figure 3(b).The cell whose value is 1 constitutes an affiliated two-level entity (e.g., "lob-ster_roll -price"), and after decoding all cells on the entire table, we get the hierarchical structure.
Note that when the hierarchical structure contains a self-loop, we delete the edge with the smallest value of p(y t ij = 1|(e i , e j )) to ensure the output is a directed acyclic graph.

Target-Sentiment Pair Parsing
In this section, we introduce a set of rules to parse out the target-sentiment pairs based on the predicted sentiments of flat entities and the hierarchical opinion target structure.
As shown in Figure 4(a), we firstly incorporate the flat entity sentiments obtained in Equation ( 5) to the hierarchical structure G. Considering that the sentiment of lower-level entity should be embodied in the upper-level one, as the lower-level entities are part of the upper-level entities, we then introduce Table 2: Dataset statistics.#D, #P and #Sentence respectively denote the number of documents, target-sentiment pairs and average number of sentences in each domain.#1-T, #2-T, #3-T denote the number of single-level targets, two-level targets and three-level targets respectively.
Algorithm 1 to update the sentiments of entities in the hierarchical structure.Specifically, as shown in line 1, we traverse the hierarchical opinion target structure G to obtain the path set P from the root node entity to the leaf node entity.For each path p i , if the sentiment of the upper entity is conflict with that of its lower entity, the sentiment of the upper entity will be updated to "Mixed", as shown in lines 2 to 10. Finally, we traverse this structure to output a set of (multi-level) target-sentiment pairs, as shown in Figure 4(b).
Algorithm 1 Multi-level Entity Sentiment Updating Input: The predicted flat entity-sentiment pair set S and hierarchical opinion target structure G Output: The updated sentiment set for each entity in G, denoted as Ŝ 1: Traverse G to obtain the path set P 2: for path p i = {e u , ..., e j , ..., e j+k , ..., e l } ∈ P do 3: for the upper-level entity e j ∈ p i do 4: for the lower-level entity e j+k ∈ p i do 5: if S(e j ) is not equal to S(e j+k ) then 6: Ŝ(e j ) = Mixed 7: end if 8: end for 9: end for 10: end for 11: return Ŝ

Model Training
Our approach is a multi-task learning framework of three components.We employ cross-entropy of the ground truth and the prediction as our loss function for each component, and learn them jointly.
The loss functions for 1) flat entity extraction, 2) flat entity sentiment classification, and 3) entity affiliated relation prediction are: where p is the golden one-hot distribution, p is the predicted distribution, and T and E denote the example set and entity set, respectively.
The joint training loss is the sum of three parts: Table 3: The main experimental results of our approach and five baselines on the six domains.Seq2Seq (Luo et al., 2022) represents the state-of-the-art approach in the document-level TSA task.We report the results in their paper.
In addition, we construct the other four strong baselines as described in Subsection 4.1.2.
sentence-level ABSA (Zhang et al., 2021b), using BART and T5 as backbones respectively.BART-Paraphrase and T5-Paraphrase are adapted from the Paraphrase-based seq2seq approach in sentencelevel ABSA (Zhang et al., 2021a), using BART and T5 as backbones respectively.Following (Luo et al., 2022), we evaluate the document-level TSA task based on the output of a set of target-sentiment pairs given the input document.The precision and recall scores are calculated based on exact match of the predicted targetsentiment pairs and the ground-truth.The F1 score is taken as the final evaluation metric.

Implementation Details
We employ DeBERTaV3 base (He et al., 2021) as the backbone encoder of our approach, whose hidden size is 768 and maximum length of the input is 512.With a commitment to equitable model parameters, we have chose T5 small (Raffel et al., 2020) as the backbone of the baselines we designed in our paper.During training, the learning rate for fine-tuning the pre-trained language model is set to 3e-5, other learning rates are set to 5e-5, and the dropout rate is 0.1.We set batch size to 8 and training epochs to 30.We save the model parameters with the highest F1 value on the validation set.During testing, we report F1 score for each domain that are averaged over five different random seeds.

Main Results
In Table 3, we report the results of our approach and five baselines on the six domains.It can be observed that our method outperforms all the baselines on average F1.Furthermore, in comparison with the four strong baselines we designed, our approach can still achieve an average F1 score improvement larger than 2.1%.Specifically, the improvements are 4.58%, 1.94%, 3.05%, 0.39%, 0.51% on the Books, Clothing, Restaurant, Hotel and News, respectively, compared with the best results for all baselines.All the improvements are significant based on a paired t-test.
Furthermore, it can be seen that our approach gains more improvement in the Books and Restaurant domains, which have longer document length and more across-sentence targets.This indicates the strength of our approach in identifying complex opinion targets from long documents.
An exception is that our approach on the Phrase-Bank domain is not optimal (slightly lower than T5-Extraction and T5-Paraphrase).According to Table 2, the document length of PhraseBank is the shortest in six domains and its average number of sentences is only 1.02.It is reasonable our approach does not show a significant advantage in this case.

The Impact of Document Length
We further investigate the performance of our approach on test subsets with different document lengths.Based on the number of sentences in the document, we divide the test set of each domain into subsets of short document (1 or 2 sentences), medium document (3 or 4 sentences), and long document (5 or more sentences).
As shown in Figure 5, in the Books, Clothing,  Restaurant, and Hotel domains, the corresponding F1 scores consistently decrease when the document length increases.This illustrates the challenge of the Document-level TSA task, the longer the document length, the more difficult it is to accurately extract the multi-level opinion target and sentiment from it.

The Impact of Levels in Opinion Target
Table 4 reports the performance of our approach in extracting opinion targets with different levels.The same as Table 2, 1-T, 2-T, and 3-T represent the number of levels in an opinion target.We report the results on test subsets divided into 1-T, 2-T, and 3-T, respectively.We have not reported the results on News and PhraseBank as they contain relatively fewer multi-level opinion targets.
It can be seen that, when the number of levels increases, the F1 score decreases significantly.It indicates the challenge of the Document-level TSA task from another aspect.The more levels of entities the opinion target has, the more difficult the task will be.
In comparison with the BART-Extraction seq2seq method, our approach achieves consistent and stable improvements at different levels.The average improvements are 7.13%, 6.86%, and 3.88% on 1-T, 2-T, and 3-T, respectively.Table 6: The performance of our approach with and without the Multi-grain Graphical Model (on opinion target with multi-level entities).

The Effect of the Multi-grain Graphical Model
In this part, we conduct ablation study on GCN to examine the effect of the multi-grain graphical model.Firstly, in Table 5 we report the performance of our approach with and without the Multi-grain Graphical Model on the entire test set.It shows that removing the multi-grain graphical model from our approach causes an average of 0.84% decrease across six domains.The decrease is significant according to paired t-test.
Secondly, to analyze the effect of GCN on opinion target with multi-level entities, we divide multilevel opinion targets in the test set into a Within-Sentence subset and a Across-Sentence subset, where Within-Sentence denotes that multi-level entities are within a sentence, and Across-Sentence denotes that are across multiple sentences.The results in Table 6 shows that removing GCN causes a 2.59% and 6.16% drop in Within-Sentence and Across-Sentence respectively.It confirms the effectiveness of the advantage of multi-grain graphical model in capturing long-distance dependencies among affiliated entities, especially the acrosssentence ones.
Finally, in Figure 6 we display the performance of our approach with and without GCN as the num-   ber of sentences in the document increases.It can observed that the performance of our approach with GCN is insistently higher than that without GCN.As the number of sentences increases, the improvement becomes larger in general.Both suggest the effectiveness of the multi-grain graphical model of our approach in modeling long documents.

Discussion on the Place of Sentiment Classification
In our approach, we perform sentiment classification at the stage of flat entity extraction.A corresponding question is then raised: Which place is the most suitable for sentiment classification?
To answer this question, we design a variant of our approach Seq2Struct variant , which predicts the sentiment after obtaining the hierarchical opinion target structure, and report its performance in Table 7.It can be seen that the F1 score of Seq2Struct variant has an average decrease of 0.7%.We speculate that the possible reason is that the opinion expression often appears near the entity, and has little relation with the structure of opinion targets.It may hence be more effective to perform sentiment classification towards the flat entities.

Case Study
In Figure 7, we conduct the case study by displaying the outputs of our approach (Seq2Struct) and Seq2Seq.In comparison, Seq2Struct can more explicitly display the hierarchical structure of opinion target in a document and more accurately predict the corresponding sentiments, across different document length.For example, in short document 1, Seq2Struct can predict the upper entity "Danskin" of "quality".In document 2, which is slightly longer, Seq2Struct can predict what Seq2Seq cannot predict ("read-personalities", Positive).In the longer document 3, Seq2Seq predicts the wrong pair ("tights-size B", Negative), while Seq2Struct predicts them all correctly.Furthermore, Seq2Struct can accurately predict distant hierarchical entities.For example, in document 5, Seq2Struct predicts the pair ("Alexander Cipher-character", Positive), where "Alexander Cipher" and "character" are separated by four long sentences.
In addition, Seq2Struct can recall more entities.For example, in document 4, the predicted entities of Seq2Struct are "Heel color", "Front toes area", and "appearance".

Related Work
ABSA is a broad research area which includes various tasks.Schouten and Frasincar (2016); Zhang et al. (2022a) provided comprehensive survey to these subtasks.In this paper, due to space limitation, we only review the related tasks.
End-to-end ABSA, the task of joint aspect extraction and aspect-based sentiment classification (also called targeted sentiment analysis in some references), has received wide attention (Mitchell et al., 2013;Zhang et al., 2015;Poria et al., 2016;Hu et al., 2019;Li et al., 2019a,b;Jiang et al., 2019;Chen and Qian, 2020;Yu et al., 2021b;Hamborg and Donnay, 2021).However, all these studies were performed at the sentence level.In this paper, we focused on targeted sentiment analysis at the document level.Unlike the opinion target at the sentence level, which is normally an entity or aspect, the opinion target studied in this work often contains multi-level entities.
Among massive ABSA studies, only a few focused on the document level.Titov and McDonald (2008) proposed a statistical model to extract textual evidence for aspect category and predict sentiment rating for different categories in a review document.Lei et al. (2016) proposed an encodergenerator framework to extract rationales for aspect category and predict aspect category sentiment rating .Yin et al. (2017) modelled aspect category sentiment rating as a machine comprehension problem.Li et al. (2018) designed a hierarchical network for aspect category sentiment considering both user preference and overall rating.Wang et al. (2019) proposed a hierarchical reinforcement learning approach to interpretably predict aspect category sentiment rating.However, all these studies focused on identifying the sentiments of aspect categories in a document.By contrast, we extract the explicit opinion entity terms in a document and organize them in a hierarchical structure.From the perspective of methodology, graph Convolutional network (GCN) has been widely used in ABSA (Zhang et al., 2019;Sun et al., 2019;Cai et al., 2020;Liang et al., 2021;Hou et al., 2021;Tian et al., 2021;Zhang et al., 2022b;Chen et al., 2022).However, most of these studies employed GCN to model the relations between different entities within a single sentence.Different from that, in this work a multi-grain graphical model is proposed to learn the affiliated relations among entities across multiple sentences in a document.
Table filling, the method to predict the relation between any two targets by filling a table, has received much attention on entity and relation extraction task (Miwa and Sasaki, 2014;Gupta et al., 2016;Wang and Lu, 2020) and open information extraction task (Yu et al., 2021a).In the ABSA task, Wu et al. (2020); Jing et al. (2021) proposed to use table filling to tag aspect terms, opinion terms and the relations between them.In contrast, in this work we use table filling to model the affiliated relation between two entities.

Conclusion
In this work, we focus on a document-level ABSA task, called document-level TSA, which aims to extract the opinion targets consisting of multilevel entities from a review document and predict the corresponding sentiments.Different from the existing Seq2Seq mythology, we propose a Sequence-to-Structure (Seq2Struct) approach to address this task, to model the hierarchical structure among multiple opinion targets and capture the long-distance dependencies among affiliated entities.Experiments have verified the advantages of our Seq2Struct approach in more accurately extracting multi-level opinion targets and predicting their sentiments, and more explicitly displaying the hierarchical structure of the opinion targets in a document.

Limitations
This paper focuses on addressing the task of document-level TSA, which, along with its dataset, has been recently introduced.Our approach is primarily designed to tackle the challenge of extracting the affiliated relations among entities over an in-domain setting.Nevertheless, this task remains challenges, particularly in aspects such as long document encoding, coreference problem, and opendomain setting.We welcome more researchers to explore this task.

Figure 1 :
Figure 1: Comparison of two different approaches for the document-level TSA task.Text chunks in blue represent flat entities, and multi-level entities are connected with "-" to form an opinion target.Text chunks in red indicate the sentiment polarities of opinion targets.

Figure 3 :
Figure 3: An illustration of Hierarchical Opinion Target Structure Identification.

Figure 4 :
Figure 4: The process of Target-Sentiment Pair Parsing.For the left figure, each connected pair of blue and red nodes represents a flat entity and its corresponding sentiment, respectively.For the right figure, multi-level entities within the dashed box constitute an opinion target, and the red node indicates the updated sentiment polarities of opinion target.

Figure 5 :
Figure5: The performance of our approach on three types of document length.

Figure 6 :
Figure 6: The average performance of six domains on different document lengths.

Figure 7 :
Figure7: The case study by comparing the Seq2Seq approach and our Seq2Struct approach.

Table 4 :
The performance on different levels of targets, where "−" means that the corresponding domain does not have corresponding targets.

Table 5 :
The performance of our approach with and without the Multi-grain Graphical Model (on the entire test set).

Table 7
Danskin is always a great choice .Good quality .I am in Jamaica and I was excited to receive my tights today .HOWEVER , it was EXTREMELY SMALL !It could not pass my thighs .I am 5' 5 and 128lbs , I read the reviews and the sizing chart and thought that size B would be ideal but it does not fit .The color was a perfect match but I can not wear these tights anywhere because of the incorrect sizing guidelines .Extremely uncomfortable .Front toes area too narrow and heel area to large .No liner so no support at all .Heel color and appearance awful .Not complimentary to the shoe .Will Adams creates a new thriller character destined to be a long-running series of best sellers .Daniel Knox is a knockaround archaeologist in Egypt with a history of interesting relationships with his brethren and sponsors .The supervisor ...... Alexander Cipher ...... adventure .The ...... beautiful .This is an excellent , well-crafted story .It cries out for a sequel in the adventures of Daniel Knox .