A Review of Datasets for Aspect-based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) is a natural language processing problem that analyzes user-generated reviews to determine a) the target entity being reviewed, b) the high-level aspect to which it belongs, and c) the sentiment expressed toward the targets and the aspects. Numerous yet scattered corpora for ABSA make it difficult for researchers to identify corpora best suited for a specific ABSA subtask quickly. This study presents a database of corpora that can be used to train and evaluate autonomous ABSA systems. Additionally, we provide an overview of the major corpora for ABSA and its subtasks and highlight several features that researchers should consider when selecting a corpus. Finally, we discuss the advantages and disadvantages of existing dataset collection approaches and make recommendations for future corpora creation. This survey examines 98 publicly available ABSA datasets covering over 25 domains, including 77 English and 21 other languages datasets ( https://github.com/RiTUAL-UH/ ABSA-Datasets-Info ).


Introduction
Consumers, product makers/service providers benefit from user-generated reviews on e-commerce platforms. Reading about previous customer experiences can assist future customers in making informed decisions. Reading about the characteristics that elicit user feedback may help manufacturers and merchants develop measures to enhance customer satisfaction. We need to automatically recognize and extract sentiment or opinion from text reviews as the data grows. Opinion mining (Pang et al., 2002;Turney, 2002) is a technology that combines computational linguistics and natural language processing. Document-level sentiment analysis seeks to determine a document's overall opinion, whereas sentence-level sentiment analysis focuses on individual phrases, assuming a unique opinion towards an entity. In many circumstances, however, an entity such as a restaurant, a mobile phone, a laptop, or any other object might have hundreds of ratings addressing various aspects and opinions. Aspects can be a feature, a trait, or a behavior of a product or an entity, like the atmosphere of a restaurant, the performance of a laptop, the display of a phone, and so on.
To this end, the analysis is focused on a finer degree of analysis, namely, aspect-level sentiment analysis (Hu and Liu, 2004a), in which sentiment is determined for each entity as well as its aspects (Poria et al., 2020). ABSA has been studied for two decades, and various sub-tasks have emerged to overcome several challenges. Many systems, metrics, and subtasks are created along with various corpora to solve the task. All the analysis, methods, and subtasks are based on the elements, and the subtasks get their names from the subset of identified elements in that study. For example, the task of detecting the sentiment polarity of the aspect terms (given the aspect terms) is called aspect-term sentiment analysis (ATSA). Although there has been a significant amount of research on ABSA in the last two decades, it has become more popular after its formal introduction as a task in the SemEval-2014. SemEval-2015 consolidated its subtasks into a single framework in which all detected elements of expressed opinions (i.e., aspects, opinion target expressions, and emotion polarities) comply with a set of criteria and are related via sentence-level tuples. However, a user may be interested in the text's total rating on a particular aspect. These ratings may be used to calculate the average sentiment for each aspect based on several sentences of a single review. Thus, in addition to sentence-level ABSA annotations, SE-16 included text-level ABSA annotations and associated training and testing data in various languages for various domains. Therefore, ABSA can be performed at two levels: 1) Sentence-level and 2) Review-level, as mentioned in (Chebolu et al., 2022).
Given the range of ABSA subtasks and techniques, researchers may find it challenging to establish which corpora are optimal for a specific research task. We want to solve this difficulty by providing an overview of available corpora and evaluating their applicability for fundamental ABSA tasks. The primary difference between this survey and previous ones on ABSA (Laskari and Sanampudi, 2016;Schouten and Frasincar, 2016;Suresh and Raghavi, 2016;Sethi and Bhattacharyya, 2017;Sabeeh and Dewang, 2018;Do et al., 2019;Ahmet and Abdullah, 2020;Nazir et al., 2020;Brauwers and Frasincar, 2021) is that the latter focuses primarily on the tasks, conducts a critical analysis of the techniques, and offers ideas and future directions for enhancing the performance of the tasks and addressing unresolved issues. In contrast, this research aims to review and summarize the literature on collecting text reviews and categorical values for ABSA elements, explain what has been learned to date, and give recommendations for constructing future datasets. Consequently, previous and more current ABSA surveys and critical retrospectives (Poria et al., 2020) focusing on definitions, methodology, and evaluations, but not datasets, will benefit from this study.
We review 65 publicly ABSA datasets in this survey that cover more than 25 domains, with 45 English and 20 other language datasets that help solve 12 different subtasks. We provide an overview of existing sub-tasks and current datasets, followed by a live version of the tables as a website 1 allowing community additions. Following that, we look at what can be learned from current data collection approaches. We emphasize two aspects in Section 2.4 that we believe will be particularly essential to the present ABSA research.

Tasks and Datasets Overview
This section will discuss the various tasks and subtasks associated with ABSA and the different datasets that help solve one or more of its subtasks 1 URL hidden for the anonymous review. independently or jointly.

Tasks Overview
This problem has two sub-problems: 1) aspect extraction (for example, sushi, pasta, and wellbehaved staff) and 2) identifying the polarity toward each aspect. Aspect extraction comprises two sub-tasks: a) extracting aspect terms and b) categorizing/normalizing the extracted aspect terms into aspect categories. Furthermore, there are three subtasks in polarity detection: a) determine the polarity of each category, b) identify the polarity of each aspect word, and c) determine the joint polarity for aspect terms/targets and aspect categories. For example, we have a positive sentiment polarity for the aspect terms value, dumplings, sushi, and service and the respective aspect categories Price, Food and Service from the review, "Highly recommend this as great value for excellent dumplings, sushi, and service." The opinion phrases that are useful in determining the polarity are great, and excellent. Therefore, the four main elements that we can identify from a given data for different ABSA tasks and sub-tasks are 1) aspect terms/targets, 2) aspect categories, 3) opinion phrases, and 4) sentiment polarity.
We present an overview of all the sub-tasks that stemmed from ABSA as rows with their identified subset of elements as columns in Table 1. We also give the common names of each sub-tasks with their respective identified elements, optionally the sentiment polarities, and opinion phrases tied to the aspect terms, aspect categories, or targets. For example, Aspect-Category Sentiment Analysis (ACSA) aims to identify the polarity of a given aspect category. However, the Target-Aspect-Sentiment Detection (TASD) task jointly identifies the targets, aspect categories, and the polarity expressed towards the target-category pair. The last three rows in the Table are recently created tasks that include identifying opinion phrases in the given text that align with the sentiment polarity given to a respective element, such as the target or aspects.

Task Challenges
There are a few major challenges for ABSA and its sub-tasks. Firstly, each of the elements described above is not independent but rather depends on other elements' detection. For example, aspect term extraction and aspect category detection tasks can be used in tandem to find terms and categories Restaurant Review (sent.): The pasta was very yummy but the place has some weird smell.  (Hu and Liu, 2004a) Table 1: Common sub-tasks of ABSA and their relation with the identified elements (Aspect Categories, Aspect Terms, and Targets). sent.: Review Sentence. single: task is independent, Joint: Task is a multi-task setting or a joint identification task in a review (Wan et al., 2020;Xue et al., 2017;Chebolu et al., 2022). The aspect term extractor may extract related aspect terms and vice versa if it knows which aspect categories a review belongs to. In the review, "However, it's the service that leaves a sour taste in my mouth." The term service is explicitly used, indicating the aspect service. If the aspect term extractor is aware of the aspect category Service, it gives the word service in the review greater weight. Similarly, if the term service is given higher weight in the review, the aspect category detector can identify the Service category. And we need to detect the implicit phrase "sour taste in my mouth" as a sentiment indicator to know that the review conveys a negative sentiment polarity towards service. This phrase is an idiom with a negative connotation for service. The literal meaning should not be considered in this situation because the criticism is not directed toward any of the restaurant's food or beverages. There is some dedicated research on the implicit aspect and it's sentiment detection (Cruz et al., 2014), which could be leveraged to improve the overall detection performance.
Another issue is ABSA's relevance and reliance on several other NLP tasks. It is worth noting that not every entity described in a text is an aspect. The entities toward which an opinion is expressed are referred to as aspects. We require a sophisticated NER system to identify the names of foods, beverages, restaurants, computers, processors, hotels, and other items that may be possible targets/aspectterms in the provided review sentence. So, by addressing the common opinion issue, we must find opinion phrases in the supplied review and link them to all of the proper entities. This issue is closely connected to the NLP Entity-Linking problem (Daiber et al., 2013;Kolitsas et al., 2018). great is used for both food and margaritas in the review "the food was excellent, the margaritas too." On the other hand, there is no explicit reference to an entity in this review "creative and tasty but pricey," thus it should be assumed that the opinion is conveyed on FOOD. However, we must conclude that the aspect is RESTAURANT rather than FOOD in the restaurant review "Good and affordable.." To address these issues, we must model the ABSA problem jointly with the related sub-problems such as aspect-term identification (NER), polarity and opinion detection (opinion phrases), syntactic simplification (Siddharthan, 2006;Scarton et al., 2017) to get separate sentences for each opinion-entity pair to solve the common-opinion problem.

Datasets Overview
All the publicly available datasets for ABSA are presented in Table 2 and Table 3. We give the details of the paper in which the dataset is introduced, with its citation, dataset's source and domain, number of reviews and the respective number of sentences, and other statistics such as the number of sentences that are annotated with positive, negative and neutral sentiment polarity for the aspect terms/targets or aspect categories. In Table 4, we show the which ABSA sub tasks from Table 1 could be solved using the the datasets from Table 2 and Table 3.
The SemEval challenge datasets and the recently published SentiHood and MAMS corpora are the most extensively used corpora for aspect-based sentiment analysis. The SemEval corpora were made public as part of a shared work held during the International Workshop on Semantic Evaluation, held annually from 2014 to 2016. The datasets are described in full in (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. Historically, the ABSA was primarily concerned with aspect term extraction and sentiment analysis (Hu and Liu, 2004a). Before 2014, there has been very little research into aspect category detection and sentiment analysis. However, as ACD was formally presented at SemEval-2014, a slew of new challenges arose. ABSA has witnessed positive outcomes across all tasks thanks to the emergence of artificial neural networks.
Despite the popularity of SemEval datasets for this work, most sentences only include one or many aspects with the same sentiment polarity, reducing the ABSA task to sentence-level sentiment analysis. (Jiang et al., 2019) published a new large-scale Multi-Aspect Multi-Sentiment (MAMS) dataset, in which each phrase has at least two independent aspects with different sentiment polarity. Although Jiang et al. (2019) claimed that each sentence has more than one aspect-sentiment tuple, the approach they followed is not realistic. When there is only one opinion tuple in a sentence, they introduce either a "miscellaneous" category or another category with neutral sentiment as a second opinion tuple that doesn't have an opinion in the review. For in-stance, in the following review, I like the smaller portion size for dinner., there is only one opinion, which is about the food's portion size. However, the actual annotation has two opinion tuples: one is for the food, and the other is a neutral opinion on the restaurant's miscellaneous aspect category. We do not dispute the legitimacy of this strategy, but we do not find it practical in the real world. Another drawback with the SemEval corpora is that they contain reviews about a single target entity, such as a laptop or restaurant. To overcome this, (Saeidi et al., 2016) created the SentiHood dataset to identify the sentiment towards each aspect of one or more entities.
As we discussed in the previous section, an opinion phrase is critical in determining the sentiment polarity towards an aspect or a target and, sometimes, determining to which target/aspect that opinion belongs. (Fan et al., 2019;Peng et al., 2020) modified the SemEval datasets to account for the missing opinion phrases that lead to a specific sentiment polarity for the target, aspect term, or aspect category. However, this resulted in a few instances coercing them to merge all the reviews from SemEval-2014 to SemEval-2016 into a single dataset.

Annotation Procedure and Dataset Source
Even though researchers use various annotation methods when building ABSA datasets, we explain the most frequent method here. One annotator (A) initially annotates a portion of the data, which is then checked by another annotator (B) for any corrections. The remainder of the sentences in the dataset will be annotated by annotator A, with additional instructions based on the nature of the earlier disagreements. When A lacked assurance, a decision was taken in collaboration with B. When A and B differed, they and a third expert annotator came to a judgment together. Another conflict resolution method was to take the vote of the majority and consider that as the correct annotation. For instance, the SemEval-2014 (SE-14) dataset was annotated in two stages. The first stage consisted of tagging and detecting the polarity of all single and multi-word words that designated certain aspects of the target item. The second step involves identifying the aspect categories and polarity of the sentences. Most datasets that include annotations simply for aspect terms/targets and their  polarity, such as the Customer Review datasets (Hu and Liu, 2004b;Ding et al., 2008;Liu et al., 2015), TOWE (Fan et al., 2019), ASTE (Peng et al., 2020), follow the first stage of this process. The second stage is only implemented for datasets including aspect categories, such as the SemEval and FiQA datasets.
Few datasets, such as MAMS, give distinct annotations for ATSA and ACSA tasks in which there is no one-to-one correspondence between aspect words, aspect categories, and their polarities. The restaurant datasets from SemEval (Table 2 and 3) are a subset of the dataset published by Ganu et al. (2009), which had annotations for only six aspect categories. A typical approach is taking existing datasets and annotating them for missing items for an existing subtask or proposing a new subtask for ABSA. The TOWE and ASTE datasets ( Table 2) are derived from SemEval restaurants and laptops. The authors included the opinion phrase information for the existing opinion tuples, making this more apparent. The most common disagreements were noticed when annotating the multiword aspect term boundaries, aspect term vs. reference to target entity, neutral polarity ambiguity, and the problem of distinguishing aspect terms when they appear in conjunctions or disjunctions. The last one was resolved using the maximal phrase as the aspect  term.
Most of the English restaurant datasets, such as SemEval, MAMS, TOWE, and ASTE, are obtained from citysearch.com for New York restaurants. And the majority of Laptop data were derived from laptop reviews on Amazon.com. Since the formal introduction of aspect category detection in SemEval-2014, all preceding datasets only include annotations for aspect words and their polarity.

Discussion and Future Directions
We explore several characteristics of the corpora in this section, including the formats, distribution of aspect categories based on sentences, the need for bigger and joint datasets with opinion phrase

Dataset Formats
The definition and format of ABSA components vary greatly depending on the dataset's source. SemEval-2014, for example, published a dataset with explicit and independent aspect categories, aspect terms, and corresponding sentiment polarity.
Because there is no one-to-one correspondence between the terms and the categories, using aspect terms and categories in a joint detection scenario is problematic. Due to this, the researcher is forced to work on either the ATE and ASTE or the ACD and ACSA. However, in SemEval-2015 and SemEval-2016, the dataset structure is more unambiguous, establishing a link between the targets and the aspect categories. The sentiment polarity is linked to the target-aspect category combination. It allowed the community to better recognize a text's stated sentiment using terms and categories. Again, in SemEval-2015 and SemEval-2016, the aspect category is divided into 1) Entity and 2) Attribute. Entities can be the reviewed entity itself, such as the RESTAURANT, a part/component of it such as AMBIENCE, or another relevant entity such as DRINKS. Attributes are facets of an Entity such as PRICE, or QUALITY. A review can consist of multiple entities, and each entity may have multiple attributes. (Jiang et al., 2019;Regatte et al., 2020) and a few others followed the SemEval-2014's XML format and released new datasets in the recent past. However, (Fan et al., 2019;Peng et al., 2020) modified the datasets from all three SemEval shared tasks into another format to include the opinion phrase and released the datasets in a XML and NER task's BIO (beginning, inside, and outside) formats. On the other hand, the SentiHood dataset used the JSON format to provide the annotations for targets, aspects, and sentiments. But the definition of aspect category in the SentiHood dataset is the combination of the target and the aspect, leading to identifying ACD and ACSA tasks from Table  1. As mentioned in Section 2.3, aspect categories are formally introduced in SemeEval-2014. All the prior datasets only annotate the aspect terms and their sentiment polarity to solve the ATE and the ATSA tasks. Therefore, for more robust ABSA systems, we urge that the community uses an already established structure and criteria for future datasets rather than introducing a new format or structure.

Opinion Phrases Annotations and Merging Datasets
As previously explained in Section 2.2, we must identify the opinion words in a given text to determine the sentiment polarity and the entities on which the opinion is conveyed, i.e., the aspect terms. It is evident from Table 4 that most recent tasks, such as the ASTE, TOWE, and ASQP, annotated the opinion words in the current SemEval shared task datasets to enhance the ABSA task overall.
In the original datasets of the SemEval challenge, the opinion targets (aspect terms) are annotated, but the opinion words and the correspondence with targets are not provided. In addition, most of the available benchmark corpora are small. So, we can combine or merge datasets with similar characteristics. For instance, Fan et al. (2019) annotated the corresponding opinion words for the annotated targets, and the sentences without targets or with implicit opinion expressions are not included. The original ASTE dataset does not contain cases where one opinion span is associated with multiple targets. So, Peng et al. (2020) refined the dataset with these additional missing triplets and expanded the corpora.
The community could focus on merging the existing datasets to obtain better quality corpora with increased sizes. Furthermore, researchers could annotate for the opinion words missing in most of the current datasets, which could greatly improve the overall performance of ABSA.

Large and Joint Datasets for Unified Models
The majority of the recent ABSA datasets come from SemEval shared challenges, with additional data processing and task-specific annotations. However, comparing models with precision, especially Transformer-based models with millions of parameters, is difficult because of the small amount of data (e.g., hundreds of phrases). Researchers currently evaluate a model's accuracy by averaging the results of numerous runs, but larger datasets would allow for more precise comparisons. However, more challenging datasets must still be provided to meet real-world scenarios that include reviews from many domains or languages, for example, can help evaluate multi-domain and multi-lingual ABSA systems. Recently, the unified models built using the gen-erative frameworks (Chebolu et al., 2021;Zhang et al., 2021b) yield SOTA performance on all the subtasks of ABSA by jointly solving for all the elements. Therefore, we need to build more datasets similar to (Zhang et al., 2021a) with annotations for all the elements and encourage such unified models to handle multiple tasks together. It's also computationally beneficial in practice since we don't always want to modify the model architecture and retrain it every time we get new data with various opinion annotations.

Skewed Data Distribution
Another major issue with current ABSA datasets is the unbalanced distribution of aspect categories, which results in poor performance. The ratio of reviews in which an aspect category is present to the total number of reviews in each dataset split is shown in each plot in Figure 1. The distribution of the number of instances of each category in all datasets is non-uniform, as shown in Figure 1. Because of the skewed structure of the dataset, tasks for underrepresented entities and categories, such as drinks in SE-16 Restaurants and price in MAMS and SE-14 datasets, are more complex. In the restaurant domain, however, it would be simpler to identify the food-related entities and categories with the highest reviews. For future datasets, we suggest that it would be better to have a more balanced nature of the aspect categories. Our underlying point is that the current situation is more challenging than it would be if a balanced dataset were made available. Any effort to annotate data begins with some sort of data selection process, and we propose that perhaps certain heuristics can be employed to achieve a more balanced representation in classes, even just for the training set.

Conclusion
In this work, we summarize the key corpora for aspect-based sentiment analysis, the relationship between the various ABSA subtasks, and the annotated components in each dataset. Many corpora exist, however, the majority are small and cannot be utilized to determine the efficacy of identifying the sentiment polarity of aspects/targets using the neural network models. More large-scale ABSA corpora are needed to detect various elements, including opinion phrases, effectively and to construct a robust system for determining sentiment polarity toward aspects and targets. Furthermore, each corpus has its unique data structure; establishing a data standard for ABSA corpora will streamline research. In addition, researchers should attempt to incorporate opinion phrase annotations in future datasets, which are critical for ABSA.