A Self-enhancement Multitask Framework for Unsupervised Aspect Category Detection

Our work addresses the problem of unsupervised Aspect Category Detection using a small set of seed words. Recent works have focused on learning embedding spaces for seed words and sentences to establish similarities between sentences and aspects. However, aspect representations are limited by the quality of initial seed words, and model performances are compromised by noise. To mitigate this limitation, we propose a simple framework that automatically enhances the quality of initial seed words and selects high-quality sentences for training instead of using the entire dataset. Our main concepts are to add a number of seed words to the initial set and to treat the task of noise resolution as a task of augmenting data for a low-resource task. In addition, we jointly train Aspect Category Detection with Aspect Term Extraction and Aspect Term Polarity to further enhance performance. This approach facilitates shared representation learning, allowing Aspect Category Detection to benefit from the additional guidance offered by other tasks. Extensive experiments demonstrate that our framework surpasses strong baselines on standard datasets.


Introduction
Aspect Category Detection (ACD), Aspect Term Extraction (ATE), and Aspect Term Polarity (ATP) are three challenging sub-tasks of Aspect Based Sentiment Analysis, which aim to identify the aspect categories discussed (e.i., FOOD), all aspect terms presented (e.i., 'fish', 'rolls'), and determine the polarity of each aspect term (e.i., 'positive', 'negative') in each sentence, respectively.To better understand these problems consider the example in Table 1.
Unsupervised ACD has mainly been tackled by clustering sentences and manually mapping these clusters to corresponding golden aspects using top representative words (He et al., 2017;Luo et al., 2019;Tulkens and van Cranenburgh, 2020;Shi et al., 2021).However, this approach faces a major problem with the mapping process, requiring manual labeling and a strategy to resolve aspect label inconsistency within the same cluster.Recent works have introduced using seed words to tackle this problem (Karamanolakis et al., 2019;Nguyen et al., 2021;Huang et al., 2020).These works direct their attention to learning the embedding space for sentences and seed words to establish similarities between sentences and aspects.As such, aspect representations are limited by a fixed small number of the initial seed words.(Li et al., 2022) is one of the few works that aims to expand the seed word set from the vocabulary of a pretrained model.However, there is overlap among the additional seed words across different aspects, resulting in reduced discriminability between the aspects.Another challenge for the unsupervised ACD task is the presence of noise, which comes from out-of-domain sentences and incorrect pseudo labels.This occurs because data is often collected from various sources, and the pseudo labels are usually generated using a greedy strategy.Current methods handle noisy sentences by either treating them as having a GENERAL aspect (He et al., 2017;Shi et al., 2021) or forcing them to have one of the desired aspects (Tulkens and van Cranenburgh, 2020;Huang et al., 2020;Nguyen et al., 2021).To address incorrect pseudo labels, (Huang et al., 2020;Nguyen et al., 2021) attempt to assign lower weights to uncertain pseudo-labels.However, these approaches might still hinder the performance of the model as models are learned from a large amount of noisy data.
In this paper, we propose A Self-enhancement Multitask (ASeM) framework to address these limitations.Firstly, to enhance the understanding of aspects and reduce reliance on the quality of initial seed words, we propose a Seedword Enhancement Component (SEC) to construct a high-quality set {While the fish [pos] is unquestionably fresh, rolls [neg] tend to be inexplicably bland.}FOOD of seed words from the initial set.The main idea is to add keywords that have limited connections with the initial seed words.Connections are determined by the similarity between the keyword's context (the sentence containing the keywords) and the initial seed words.In this way, our task is simply to find sentences with low similarity to the initial seed words and then extract their important keywords to add to the seed word set.Since pseudo-label generation relies on the similarity between sentences and seed words, we expect that the enhanced seed word set would provide sufficient knowledge for our framework to generate highly confident pseudo-labels for as many sentences as possible.Secondly, to reduce noise in the training data, instead of treating them as having a GEN-ERAL aspect (He et al., 2017;Shi et al., 2021) or forcing them to have one of the desired aspects (Tulkens and van Cranenburgh, 2020;Huang et al., 2020;Nguyen et al., 2021), we propose to leverage a retrieval-based data augmentation technique to automatically search for high-quality data from a large training dataset.To achieve this, we leverage a paraphrastic encoder (e.g.Arora et al. (2017); Ethayarajh (2018); Du et al. (2021)), trained to output similar representations for sentences with similar meanings, to generate sentence representations that are independent of the target task (ACD).Then, we generate task embeddings by passing the prior knowledge (e.g., seed words) about the target task to the encoder.These embeddings are used as a query to retrieve similar sentences from the large training dataset (data bank).In this way, our framework aims to effectively extract domain-specific sentences that can be well understood based on seed words, regardless of the quality of the training dataset.
Another contribution to our work concerns the recommendation of multi-tasking learning for unsupervised ACD, ATE and ATP in a neural network.Intuitively, employing multi-task learning enables ACD to leverage the benefits of ATE and ATP.Referring to the example in Figure 1, ATE extracts Opinion Target Expressions (OTEs): 'fish' and 'rolls', which requires understanding the emotion 'positive' (expressed as 'unquestionably fresh') for 'fish' and 'negative' (expressed as 'inexplica-bly bland') for 'rolls'.Meanwhile, ACD wants to detect the aspect category of the sentence, requiring awareness of the presence of "fish" and "rolls" (terms related to the aspect of "food") within the sentence.Despite the usefulness of these relationships for ACD, the majority of existing works do not utilize this information.(Huang et al., 2020) is one of the few attempts to combine learning the representations of aspect and polarity at the sentence level before passing them through separate classifiers.
Our contributions are summarized as follows: • We propose a simple approach to enhance aspect comprehension and reduce the reliance on the quality of initial seed words.Our framework achieves this by expanding the seed word set with keywords, based on their uncertain connection with the initial seed words.
• A retrieval-based data augmentation technique is proposed to tackle training data noise.In this way, only data that connect well with the prior knowledge (e.g., seed words) about the target task is used during training.As a result, the model automatically filters out-of-domain data and low-quality pseudo labels.
• We propose to leverage a multi-task learning approach for unsupervised ACD, ATE, and ATP, aiming to improve ACD through the utilization of additional guidance from other tasks during the shared representation learning.
• Comprehensive experiments are conducted to demonstrate that our framework outperforms previous methods on standard datasets.

Related Works
Topic models were once the dominant approach (Brody and Elhadad, 2010;Mukherjee and Liu, 2012;Chen et al., 2014) for unsupervised Aspect Category Detection.However, they can produce incoherent aspects.Recently, neural network-based methods have been developed to address this challenge.
Cluster Mapping-based resolvers: These methods utilize neural networks to cluster effectively and manually map (many-to-many mapping) the clusters to their corresponding aspects.They employ attention-based autoencoders (He et al., 2017;Luo et al., 2019) or contrast learning approach (Shi et al., 2021) for clustering.Shi et al. (2021) further enhance performance by using knowledge distillation to learn labels generated after clustering.
Seed words-based resolvers: These approaches automate the aspect category mapping process by utilizing seed words that indicate aspect appearance.Angelidis and Lapata (2018) use the weighted sum of seed word representations as aspect representations, allowing mapping one-to-one in the auto-encoder model.Recent works focus on learning embedding spaces for sentences and seed words, generating pseudo labels for weakly supervised learning.They use Skip-gram (Mikolov et al., 2013) for embedding space learning and convolutional neural networks or linear layers for classification (Huang et al., 2020;Nguyen et al., 2021).Huang et al. (2020) jointly learn ACD with Sentence-level ATP, while Nguyen et al. (2021) consider the uncertainty of the initial embedding space.Without any human supervision, (Tulkens and van Cranenburgh, 2020;Li et al., 2022) rely solely on label names, similar to seed words.Tulkens and van Cranenburgh (2020) detect aspects using cosine similarity between pre-trained aspect and label name representations, while Li et al. (2022) train the clustering model with instancelevel and concept-level constraints.

Method
Our framework addresses three tasks for which no annotated data is available: Aspect Category Detection (ACD), Aspect Term Extraction (ATE), and Aspect Term Polarity (ATP).ACD involves assigning a given text to one of K pre-defined aspects of interest.ATE extracts OTEs in the text.ATP assigns a sentiment to each OTE.Note that, during training, we do not use any human-annotated samples, but rather rely on a small set of seed words to provide supervision signals.
Our framework called ASeM (short for A Selfenhancement Mutitask Framework), consists of three key components: (i) Pseudo-label generation, (ii) Retrieval-based data augmentation, and (iii) Classification.Figure 1 presents an overview of the framework.Initially, we extract a small subset of the training data to serve as the task-specific in-domain data.Based on the quality of the initial seed words in this dataset, we utilize SEC to expand the set of seed words in order to enhance its quality.By feeding the task-specific in-domain data and enhanced seed words to the pseudo-label generation, we obtain high-quality pseudo labels for the task-specific in-domain data.Then, we leverage the retrieval-based augmentation to enhance the number of training samples from the data bank (the remaining part of the training data), based on our prior knowledge of the target task (seed words, taskspecific in-domain data with high-quality pseudo labels).To this end, the high-quality pseudo labels and augmented data are passed through a multitask classifier to predict the task outputs.

Pseudo-label generation
The first step in our framework is to generate pseudo-labels for the three subtasks ACD, ATE, and ATP, on a small unannotated in-domain dataset.In detail, the pseudo-labels for the tasks are created as follows: Aspect Category Detection: First, we map dictionary words into an embedding space by training CBOW (Mikolov et al., 2013) on the unlabeled training data.Second, we embed sentences from the task-specific in-domain data as s = sum(w 1 , w 2 , .., w n ), in which w i is the representation of the i th word and n is the sentence length.Similarly, the aspect category representation a i = sum(w ij is the representation of the j th seed word of the i th aspect, and l i is the number of seed words in the i th aspect.To this end, aspect category pseudo label of a sentence s is defined as follows: where sim(s, a i ) is the similarity between sentence s and aspect a i .Given set is the set of given initial seed words, T a i is the set of additional seed words.The similarity is calculated as follows: where s and a i are sentences and aspect representations, respectively.w and w are a word in a sentence and its representation.s ′ is the set of words in the sentence s.
As discussed in the introduction, our framework proposes SEC as described in Algorithm 1 to obtain T a i .To begin, we generate temporary pseudo labels for all given sentences using the initial seed words.Based on the obtained pseudo-labels, we extract nouns and adjectives (called keywords) in the sentences for each aspect label and then extract keywords that appear in multiple aspects (called boundary keywords) and obtained T b .At line 6, we calculate the connection between sentences and initial seed words based on the difference between the similarity of the sentence with its two most similar aspects.Note that, if Connection(s) ≥ γ, in which γ is a hyper-parameter, sentences s are considered to have a certain connection with seed words, and if Connection(s) < γ, there are uncertain connections.At line 12, we extract keywords from the sentences with uncertain connections and obtain T u .Finally, the intersection of T b and T u will be mapped to the relevant aspect.We utilize a variant of clarity scoring function (Cronen-Townsend et al., 2002) for the automatic mapping.Clarity measures the likelihood of observing a word w in the subset of sentences related to aspect a i , as 16 end compared to a j .A higher score indicates a greater likelihood of word w being related to aspect a i .
where t a i (w) and t a j (w) correspond to the l 1normalized TF-IDF scores of w in the sentences annotated pseudo-label with aspect a i and a j , respectively.
In the training process, after obtaining pseudo labels, SEC recalculates the certainty of connections similar to lines 5 to 10 of Algorithm 1, then removes uncertain connections S u out of S.
Aspect Term Extraction: We extract aspect terms by considering all nouns that appear more than m times in the corpus.
Aspect Term Polarity: After generating aspect term pseudo-labels, we find polarity pseudo-labels of terms based on the context window around them.In detail, the generation will be carried out similarly to the ACD subtask with the input being the context window and polarity seed words.

Retrieval-based Data Augmentation
To select high-quality data from an unannotated data bank containing noise, we select sentences with content similar to the reliable knowledge we have about the task-specific in-domain data, for example, seed words/a small in-domain dataset having certain connections with seed words.To do this, we first utilize a paraphrastic sentence encoder to create representations for sentences in the data bank and the target task.The task embedding will be used as a query to find high-quality sentences in the sentence bank.Figure 2 illustrates our retrieval-based data augmentation.The sentence encoder, task embedding, and data retrieval process occur as follows: Sentence Encoder: We leverage a paraphrastic encoder to generate similar representations for semantically similar sentences.In detail, the encoder is a Transformer pre-trained with masked language modeling (Kenton and Toutanova, 2019; Conneau and Lample, 2019), it is finetuned by a triplet loss L(x, y) = max(0, α − cos(x, y) + cos(x, y n )) on paraphrases from Natural Language Inference entailment pairs (Williams et al., 2018), Quora Question Pairs, round-trip translation (Wieting and Gimpel, 2018), web paraphrases (Creutz, 2018), Open-Subtitles (Lison et al., 2018), and Europarl (Koehn, 2005) to maximize cosine similarity between similar sentences.Positive pairs (x, y) are either paraphrases or parallel sentences (Wieting et al., 2019), and the negative y n is selected to be the hardest in the batch.
Task embedding: The task embedding uses a shared paraphrastic encoder with sentence embeddings, which is used to embed the prior knowledge about the target task (seed words, task-specific indomain data with high-quality pseudo labels).In this task, given the prior knowledge, each representation of a seed word or sentence in the prior knowledge is considered a representation of the target task.
Unsupervised Data Retrieval: We use task embedding as queries to retrieve a subset of the large sentence bank.For each task embedding, we select k nearest neighbors based on cosine similarity.

Classification
In this component, we train a neural network of multitask learning ATE, ATP, and ATE.We expect that multitask learning can provide additional guidance for ACD from other tasks during the shared representation learning, resulting in improved ACD performance.In detail, given an input sentence W = [w 1 , ...w n ] with n words, we employ a pretrained language model, e.g.RoBERTa (Liu et al., 2019) to generate a shared sequence of contextualized embedding W = [w 1 , ..., w n ] for the words (using the average of hidden vectors for word-pieces in the last layer of the pretrained language model).Then, for each task t i ∈ {ACD, AT E, AT P }, we feed the vector sequence W into a feed-forward network to find the corresponding probability scores p t i W , ∀t i ∈ {ACD, AT E, AT P }.Note that, we treat the ACD problem as a sentence classification problem, while ATE and ATP are treated as BIO sequence labeling problems.
Given the training dataset X l , which consists of task-specific in-domain data and augmented data, along with their corresponding pseudo labels y t l generated by Pseudo label Generation.The loss function is defined as follows: in which C t i is the set of labels of task t i and λ i is a hyperparameter determining the weight of task t i .

Datasets
Following previous works (Huang et al., 2020;Shi et al., 2021;Nguyen et al., 2021) we conduct experiments on three benchmark datasets with different domains as described in Table 2.
Restaurant/Laptop containing reviews about restaurant/laptop (Huang et al., 2020).We use the training dataset as the sentence bank and the Se-mEval training set (Pontiki et al., 2015(Pontiki et al., , 2016) ) as task-specific in-domain data and dev set with a ratio of 0.85.The test set is taken from SemEval test set (Pontiki et al., 2015(Pontiki et al., , 2016)).Restaurant contains labeled data for all three tasks, where ACD has five aspect category types and ATP has two polarity types.Laptop only has labels for ACD with eight aspect category types, and Sentence-level ATP with two polarity types.For initial seed words, following (Huang et al., 2020;Nguyen et al., 2021) we have five manual seed words and five automatic seed words for each label of ACD and Sentencelevel ATP.
CitySearch containing reviews about restaurants (Ganu et al., 2009), in which the test set only contains labeled ACD data with three aspect category types.Similarly to Restaurant/Laptop, we use Se-mEval training set (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016) ) as task-specific in-domain data and dev set with a ratio of 0.85.For seed words, similar to (Tulkens and van Cranenburgh, 2020), for Citysearch we use the aspect label words food, ambience, staff as seed words.
Following previous works, we remove the multiaspect sentences from all datasets and only evaluate ATE and ATP on sentences with at least one OTE; multi-polarity sentences are removed when evaluating Sentence-level ATP.

Dataset
Train

Hyper-Parameters
For the learning framework, the best parameters and hyperparameters are selected based on the validation sets.For pseudo label generation, we use nltk1 for tokenizing and Gensim2 to train CBOW with embedding size of 200 and number of epochs of 10, the window of 10, negative sample size of 5.The connection measure γ is tuned within the range of [0, 700], ATP context window size is tuned within the range of [20,100], and m is set to 2. For the ATE task, we use nltk for POS tagging and textblob3 to extract nouns, noun phrases and adjectives.For retrieval-based data augmentation, we select: paraphrastic encoder pretrained by Du et al. (2021), and tune the number of nearest neighbors k in the range of [1,20] with step 1.For the classification network, we use Roberta-large (Liu et al., 2019) with batch size 16, learning rate 1e-5, AdamW optimizer, weight decay 1e-5, task weights λ t i is set to 1, 0.8, 0.6 for ACD, ATE, and ATP respectively.We report the average performances of five different runs with random seeds.
Concerning the utilization of seed words to guide inference for larger LLMs, such as generative models like GPT-3.5, we conducted an experiment involving model gpt-3.5-turbo(from OpenAI) to infer aspect labels.The prompt used is described in detail in section A.

Evaluation
First, we report the results of our framework on the ACD task in Table 3, and ATP and ATE in Table 4.During training, we utilize multitask learning for ACD, ATP, and ATE.However, due to limited labeled tasks in the test datasets, we only evaluate tasks with labels.Further details can be found in subsection 4.1.The results of prior methods were collected from the respective works.For initial seed words, we employ the seed words recommended by (Huang et al., 2020) for the Restaurant and Laptop, as well as the seed words suggested by (Tulkens and van Cranenburgh, 2020) for City-Search.Following (Huang et al., 2020), we evaluate the ACD performance by accuracy and macro-F1.Similarly, we evaluate ATP(s) (shorted for Sentence-level ATP) and ATP (shorted for Termlevel ATP) by accuracy and macro-F1; and ATE by F1-score.As can be seen, despite using less human supervision compared to manual mapping-based methods, seed word-based methods yield competitive results.The results compared with GPT-3.5 are reported in Table 5.In detail, while GPT-3.5 performs well in terms of food, drinks, and service aspects and demonstrates superior accuracy, our model outperforms in terms of location and ambience aspects as well as overall macro-F1 score.It is not surprising that GPT-3.5 demonstrates strong performance on this dataset due to its immense underlying knowledge base.However, when considering factors like scalability, computational demands, complexity, and inference time, our method exhibits competitive results.

Method
Overall, ASeM demonstrates state-of-the-art performance in Aspect Category Detection across various domains, providing clear evidence of the effectiveness of the proposed framework.
Ablation Study: To investigate the impact of each proposed component for ASeM, we evaluate our ablated framework over the Restaurant dataset.Table 6 allows us to assess the contribution of each proposed component to the overall performance of ASeM.W/o Seedword Enhanment Component.Ablating the SEC includes eliminating the search for additional seed words T a and then assigning T a = ∅.It can be observed that removing SEC significantly impairs the accuracy of ACD compared to other tasks, clearly demonstrating the benefits of expanding seed words for ACD.W/o Retrieval-based Data Augmentation.We ignore data augmentation and follow previous works Huang et al. (2020); Nguyen et al. (2021) by training our classification on the entire training data.As presented in the table, the -aug model exhibits a significant performance decrease compared to ASeM, particularly in 3 out of 4 tasks where it performs the worst among the ablated versions.This clearly demonstrates the adverse impact of noise on the framework's accuracy and the limitations it imposes on the contributions of other components within the framework, as well as the effectiveness of the proposed method in addressing noise.W/o Multi-task Learning.Next, we ablate multitask learning and train the three tasks (ACD, ATP, Figure 3: ACD performance of seed words-based methods with different initial seed word sets.Label names is derived from the aspect name (e.g.'food'), Auto extraction is automatically extracted from a small labeled dataset (Angelidis and Lapata, 2018), Manual extraction is manually extracted by experts (Huang et al., 2020).

Analysis
In this section, we thoroughly analyze the performance of our framework concerning the quality of seed words and training data.

The quality of seed words
Firstly, we examine the effect of the initial seed words on the performance of our framework.Figure 3 shows the accuracy of our framework using different initial seed word sets on the Restaurant dataset, comparing them with other state-of-the-art methods based on seed words.As can be seen, both JASen and UCE exhibit significant variations in performance when changing the initial seed word sets, highlighting their strong dependence on the quality of initial seed words.Meanwhile, ASeM demonstrates a good adaptation ability to the quality of initial seed words, delivering promising results across all three sets of initial seed words.In addition, SEC aids in identifying previously unextracted aspect terms (e.g.martinis, hot dogs), as evidenced by the improved ATE performance, providing insight into how SEC enhances the performance of ACD.

Keyword distinction
In this section, we carry out experiments on adding weights to distinguish keywords/sentences based on uncertain connections, while ASeM considered the role of seed words and uncertain keywords as equal when they contribute to aspect category representation.Table 7 shows that our approach is simpler and more efficient than prior works and optimally setting weights to preserve inherent data properties is challenging.In detail, weighting by Figure 4: ACD and ATE performances with different seed word addition strategies.Our baselines: random adds seed words randomly from the vocabulary, mapped to their aspects by Eq.2, while popularity selects the most frequently occurring important words (noun/adjective) of each aspect based on a small labeled dataset (dev set).PCCT is a set of additional seed words proposed by (Li et al., 2022), extracted from the vocabulary of a pre-trained model.our model accurately associates the sentiment of a sentence with the corresponding aspect, thereby enhancing ACD performance as demonstrated by rows 3 and 4 of Table 8.

Retrieval-based Augmentation
In this subsection, we examine the impact of transforming the unsupervised learning problem based on seed words into a data augmentation task for a low-resource task.As observed in Figure 5, our framework's performance shows a substantial improvement in the initial phase but gradually declines afterward.This decline can be attributed to an excessive increase in neighbors, which leads to the inclusion of misaligned data that does not connect well with the target task's prior knowledge (e.g.seed words).Consequently, the pseudo-label generation becomes insufficient to provide accurate predictions, resulting in compromised classification and decreased performance.

Conclusion
In this work, we propose a novel framework for ACD that achieves three main goals: (1) enhancing aspect understanding and reducing reliance on initial seed words, (2) effectively handling noise in the training data, and (3) self-boosting supervised signals through multitask-learning three unsupervised tasks (ACD, ATE, ATP) to improve performance.The experimental results demonstrate that our model outperforms the baselines and achieves state-of-the-art performance on three benchmark datasets.In the future, we plan to extend our framework to address other unsupervised problems.

Limitations
Although our experiments have proven the effectiveness of our proposed method, there are still some limitations that can be improved in future work.First, our process of assigning keywords to their relevant aspects is not entirely accurate.Future work may explore alternatives to make this process more precise.Second, through the analysis of the results, we notice that our framework predicts the aspect categories of sentences with implicit aspect terms less accurately than sentences with explicit aspect terms.This is because we prioritize the presence of aspect terms in sentences when predicting their aspect categories, which can be seen in the pseudo-label generation.However, sentences with implicit aspect terms do not contain aspect terms, or even contain terms of other aspects, leading to incorrect predictions.For example, the only beverage we did receive was water in dirty glasses was predicted as DRINKS instead of the golden aspect label SERVICE.Future works may focus more on the context of sentences to make better predictions for sentences with implicit aspect terms.

Figure 2 :
Figure 2: Retrieval-based augmentation (k = 3).Information for each aspect is derived from prior knowledge, including seed words and task-specific in-domain sentences having certain connections with seed words.

Figure 5 :
Figure 5: Performance of ACD with an increasing number of neighbors.More neighbors result in a larger training data set.

Table 1 :
Aspect Category Detection, Aspect Term Extraction, and Aspect Term Polarity example.

Table 2 :
Statistics of the datasets.

Table 3 :
The performance of ASeM on Aspect Category Detection.

Table 4 :
The performance of ASeM on Aspect Term Polarity (ATP), Sentence-level ATP (ATP(s)) and Aspect Term Extraction (ATE).In our work, Sentence-level ATP is not directly learned, but rather its labels are inferred based on the polarity labels of terms in that sentence.

Table 8 :
Examples of improved accuracy by ASeM