Emotive or Non-emotive: That is The Question

In this research we focus on discriminating between emotive (emotionally loaded) and non-emotive sentences. We deﬁne the problem from a linguistic point of view assuming that emotive sentences stand out both lexically and grammatically. We verify this assumption experimentally by comparing two sets of such sentences in Japanese. The comparison is based on words, longer n-grams as well as more sophisticated patterns. In the classiﬁcation we use a novel unsupervised learning algo-rithm based on the idea of language combinatorics. The method reached results comparable to the state of the art, while the fact that it is fully automatic makes it more efﬁcient and language independent.


Introduction
Recently the field of sentiment analysis has attracted great interest. It has become popular to try different methods to distinguish between sentences loaded with positive and negative sentiments. However, a few research focused on a task more generic, namely, discriminating whether a sentence is even loaded with emotional content or not. The difficulty of the task is indicated by three facts. Firstly, the task has not been widely undertaken. Secondly, in research which addresses the challenge, the definition of the task is usually based on subjective ad hoc assumptions. Thirdly, in research which do tackle the problem in a systematic way, the results are usually unsatisfactory, and satisfactory results can be obtained only with large workload.
We decided to tackle the problem in a standardized and systematic way. We defined emotionally loaded sentences as those which in linguistics are described as fulfilling the emotive function of lan-guage. We assumed that there are repetitive patterns which appear uniquely in emotive sentences. We performed experiments using a novel unsupervised clustering algorithm based on the idea of language combinatorics. By using this method we were also able to minimize human effort and achieve F-score comparable to the state of the art with much higher Recall rate.
The outline of the paper is as follows. We present the background for this research in Section 2. Section 3 describes the language combinatorics approach which we used to compare emotive and non-emotive sentences. In section 4 we describe our dataset and experiment settings. The results of the experiment are presented in Section 5. Finally the paper is concluded in Section 6.

Background
There are different linguistic means used to inform interlocutors of emotional states in an everyday communication. The emotive meaning is conveyed verbally and lexically through exclamations (Beijer, 2002;Ono, 2002), hypocoristics (endearments) (Kamei et al., 1996), vulgarities (Crystal, 1989) or, for example in Japanese, through mimetic expressions (gitaigo) (Baba, 2003). The function of language realized by such elements of language conveying emotive meaning is called the emotive function of language. It was first distinguished by Bühler (1934Bühler ( -1990 in his Sprachtheorie as one of three basic functions of language 1 . Bühler's theory was picked up later by Jakobson (1960), who by distinguishing three other functions laid the grounds for structural linguistics and communication studies.

Previous Research
Detecting whether sentences are loaded with emotional content has been undertaken by a number 1 The other two being descriptive and impressive. of researchers, most often as an additional task in either sentiment analysis (SA) or affect analysis (AA). SA, in great simplification, focuses on determining whether a language entity (sentence, document) was written with positive or negative attitude toward its topic. AA on the other hand focuses on specifying which exactly emotion type (joy, anger, etc.) has been conveyed. The fact, that the task was usually undertaken as a subtask, influences the way it was formulated. Below we present some of the most influential works on the topic, but formulating it in slightly different terms.
Emotional vs. Neutral: Discriminating whether a sentence is emotional or neutral is to answer the question of whether it can be interpreted as produced in an emotional state. This way the task was studied by Minato et al. (2006), Aman and Szpakowicz (2007) or Neviarouskaya et al. (2011).
Subjective vs. Objective: Discriminating between subjective and objective sentences is to say whether the speaker presented the sentence contents from a first-person-centric perspective or from no specific perspective. The research formulating the problem this way include e.g, Wiebe et al. (1999), who classified subjectivity of sentences using naive Bayes classifier, or later Wilson and Wiebe (2005). In other research Yu and Hatzivassiloglou (2003) used supervised learning to detect subjectivity and Hatzivassiloglou and Wiebe (2012) studied the effect of gradable adjectives on sentence subjectivity.
Emotive vs. Non-emotive: Saying that a sentence is emotive means to specify the linguistic features of language which where used to produce a sentence uttered with emphasis. Research that formulated and tackled the problem this way was done by, e.g., Ptaszynski et al. (2009).
Each of the above nomenclature implies similar, though slightly different assumptions. For example, a sentence produced without any emotive characteristics (non-emotive) could still imply emotional state in some situations. Also Bing and Zhang (2012) notice that "not all subjective sentences express opinions and those that do are a subgroup of opinionated sentences." A comparison of the scopes and overlaps of different nomenclature is represented in Figure 1. In this research we formulate the problem similarly to Ptaszynski et al. (2009), therefore we used their system to compare with our method.

Language Combinatorics
The idea of language combinatorics (LC) assumes that patterns with disjoint elements provide better results than the usual bag-of-words or n-gram approach (Ptaszynski et al., 2011). Such patterns are defined as ordered non-repeated combinations of sentence elements. They are automatically extracted by generating all ordered combinations of sentence elements and verifying their occurrences within a corpus.
In particular, in every n-element sentence there is k-number of combination clusters, such as that 1 ≤ k ≤ n, where k represents all k-element combinations being a subset of n. The number of combinations generated for one k-element cluster of combinations is equal to binomial coefficient, like in eq. 1. Thus the number of all possible combinations generated for all values of k from the range of {1, ..., n} is equal to the sum of all combinations from all k-element clusters, like in eq. 2.
One problem with combinatorial approach is the phenomenon of exponential and rapid growth of function values during combinatorial manipulations, called combinatorial explosion (Krippendorff, 1986). Since this phenomenon causes long processing time, combinatorial approaches have been often disregarded. We assumed however, that it could be dealt with when the algorithm is optimized to the requirements of the task. In preliminary experiments Ptaszynski et al. (2011) used a generic sentence pattern extraction architecture SPEC to compare the amounts of generated sophisticated patterns with n-grams, and noticed that it is not necessary to generate patterns of all lengths, since the most useful ones usually appear in the group of 2 to 5 element patterns. Following their experience we limit the pattern length in our research to 6 elements. All non-subsequent el- Table 1: Some examples from the dataset representing emotive and non-emotive sentences close in content, but differing in emotional load expressed in the sentence (Romanized Japanese / Translation).
Kono karē wa karai. / This curry is hot. Nanto ano hito, kekkon suru rashii yo! / Have you heard? She's getting married! Ano hito ga kekkon suru rashii desu. / They say she is gatting married. Chō ha ga itee / Oh, how my tooth aches! Ha ga itai / A tooth aches Sugoku kirei na umi da naaa / Oh, what a beautiful sea! Kirei na umi desu / This is a beautiful sea ements are also separated with an asterisk ("*") to mark disjoint elements. The weight w j of each pattern generated this way is calculated, according to equation 3, as a ratio of all occurrences of a pattern in one corpus O pos to the sum of occurrences in two compared corpora O pos +O neg . The weights are also normalized to fit in range from +1 (representing purely emotive patterns) to -1 (representing purely nonemotive patterns). The normalization is achieved by subtracting 0.5 from the initial score and multiplying this intermediate product by 2. The score of one sentence is calculated as a sum of weights of patterns found in the sentence, like in eq. 4.
The weight can be further modified by either • awarding length k, or • awarding length k and occurrence O.
The list of generated frequent patterns can also be further modified. When two collections of sentences of opposite features (such as "emotive vs. non-emotive") are compared, a generated list will contain patterns appearing uniquely on only one of the sides (e.g. uniquely emotive patterns and uniquely non-emotive patterns) or in both (ambiguous patterns). Therefore the pattern list can be modified by deleting • all ambiguous patterns, or • only ambiguous patterns appearing in the same number on both sides (later called "zero patterns", since their weight is equal 0). Moreover, since a list of patterns will contain both the sophisticated patterns as well usual n-grams, the experiments were performed separately for all patterns and n-grams only. Also, if the initial collection was biased toward one of the sides (sentences of one kind were longer or more numerous), there will be more patterns of a certain sort. To mitigate this bias, instead of applying a rule of thumb, the threshold was optimized automatically.

Dataset Preparation
In the experiments we used a dataset developed by Ptaszynski et al. (2009) for the needs of evaluating their affect analysis system ML-Ask for Japanese language. The dataset contains 50 emotive and 41 non-emotive sentences. It was created as follows.
Thirty people of different age and social groups participated in an anonymous survey. Each participant was to imagine or remember a conversation with any person they know and write three sentences from that conversation: one free, one emotive, and one non-emotive. Additionally, the participants were asked to make the emotive and nonemotive sentences as close in content as possible, so the only difference was whether a sentence was loaded with emotion or not. The participants also annotated on their own free utterances whether or not they were emotive. Some examples from the dataset are represented in Table 1.
In our research the above dataset was further preprocessed to make the sentences separable into elements. We did this in three ways to check how the preprocessing influences the results. We used MeCab 2 , a morphological analyzer for Japanese to preprocess the sentences from the dataset in the three following ways: • Tokenization: All words, punctuation marks, etc. are separated by spaces.  Table 2. In theory, the more generalized a sentence is, the less unique patterns it will produce, but the produced patterns will be more frequent. This can be explained by comparing tokenized sentence with its POS representation. For example, in the sentence from Table 2 we can see that a simple phrase kimochi ii ("feeling good") can be Table 2: Three kinds of preprocessing of a sentence in Japanese; N = noun, TOP = topic marker, ADV = adverbial particle, ADJ = adjective, COP = copula, EXCL = exclamation mark. represented by a POS pattern N ADJ. We can easily assume that there will be more N ADJ patterns than kimochi ii, because many word combinations can be represented as N ADJ. Therefore POS patterns will come in less variety but with higher occurrence frequency. By comparing the result of classification using different preprocessing methods we can find out whether it is better to represent sentences as more generalized or as more specific.

Experiment Setup
The experiment was performed three times, once for each kind of preprocessing. Each time 10fold cross validation was performed and the results were calculated using Precision (P), Recall (R) and balanced F-score (F) for each threshold. We verified which version of the algorithm achieves the top score within the threshold span. However, an algorithm could achieve the best score for one certain threshold, while for others it could perform poorly. Therefore we also looked at which version achieves high scores for the longest threshold span. This shows which algorithm is more balanced. Finally, we checked the statistical significance of the results. We used paired t-test because the classification results could represent only one of two classes (emotive or non-emotive). We also compared the performance to the state of the art, namely the affect analysis system ML-Ask developed by Ptaszynski et al. (2009).

Results and Discussion
The overall F-score results were generally the best for the datasets containing in order: both tokens and POS, tokens only and POS only. The Fscores for POS-preprocessed sentences revealed the least constancy. For many cases n-grams scored higher than all patterns, but almost none of the results reached statistical significance. The Fscore results for the tokenized dataset were also not unequivocal. For higher thresholds patterns scored higher, while for lower thresholds the results were similar. The scores were rarely significant, utmost at 5% level (p<0.05), however, in all situations where n-grams visibly scored higher, the differences were not statistically significant. Finally, for the preprocessing including both tokens and POS information, pattern-based approach achieved significantly better results (pvalue <0.01 or <0.001). The algorithm reached its plateau at F-score around 0.73-0.74 for tokens and POS separately, and 0.75-0.76 for tokens with POS together. In the POS dataset the elements were more abstracted, while in token-POS dataset the elements were more specific, producing a larger number, but less frequent patterns. Lower scores for POS dataset could suggest that the algorithm works better with less abstracted preprocessing. Examples of F-score comparison between n-grams and patterns for tokenized and token-POS datasets are represented in Figures 2  and 3, respectively. Results for Precision showed similar tendencies. They were the most ambiguous for POS preprocessing. For the tokenized dataset, although there always was one or two thresholds for which n-grams scored higher, scores for patterns were more balanced, starting with a high score and decreasing slowly. As for the token-POS preprocessing patterns achieved higher Precision for most of the threshold span. The highest Precision of all was achieved in this dataset by patterns with P = 0.87 for R = 0.50.
As for Recall, the scores were consistent for all kinds of preprocessing, with higher scores for patterns within most of the threshold span and equaling while the threshold decreases. The highest scores achieved for each preprocessing for ngrams and patterns are represented in Table 3. The affect analysis system ML-Ask (Ptaszynski et al., 2009) on the same dataset reached F = 0.79, P = 0.8 and R = 0.78. The results were generally comparable, however slightly higher for ML-Ask when it comes to P and F-score. R was always better for the proposed method. However, ML-Ask is a system requiring handcrafted lexicons, while our method is fully automatic, learning the patterns from data, not needing any particular preparations, which makes it more efficient.

Detailed Analysis of Learned Patterns
Within some of the most frequently appearing emotive patterns there were for example: ! (exclamation mark), n*yo, cha (emotive verb modification), yo (exclamative sentence ending particle), ga*yo, n*! or naa (interjection). Some examples of sentences containing those patterns are below (patterns underlined). Interestingly, most elements of those patterns appear in ML-Ask handcrafted databases, which suggests it could be possible to improve ML-Ask performance by extracting additional patterns with SPEC. Ex. 1. Megane, soko ni atta nda yo. (The glasses were over there!) Ex. 2. Uuun, butai ga mienai yo. (Ohh, I cannot see the stage!) Ex. 3. Aaa, onaka ga suita yo. (Ohh, I'm so hungry) Another advantage of our method is the fact that it can mark both emotive and non-emotive elements in sentence, while ML-Ask is designed to annotate only emotive elements. Some examples of extracted non-emotive patterns were for example: desu, wa*desu, mashi ta, or te*masu. All of them were patterns described in linguistic literature as typically non-emotive, consisting in copulas (desu), verb endings (masu, mashi ta). Some sentence examples with those patterns include:

Conclusions and Future Work
We presented a method for automatic extraction of patterns from emotive sentences. We assumed emotive sentences are distinguishable both lexically and grammatically and performed experiments to verify this assumption. In the experiments we used a set of emotive and non-emotive sentences preprocessed in different ways (tokens, POS, token-POS) The patterns extracted from sentences were applied to recognize emotionally loaded sentences. The algorithm reached its plateau for F-score around 0.75-0.76 for patterns containing both tokens and POS information. Precision for patterns was balanced, while for n-grams, although occasionally achieving high scores, it was quickly decreasing. Recall scores were almost always better for patterns. The generally lower results for POSrepresented sentences suggest that the algorithm works better with less abstracted elements.
The results of the proposed method and the affect analysis system ML-Ask were comparable. ML-Ask achieved better Precision, but lower Recall. However, our method is more efficient as it does not require handcrafted lexicons. Moreover, automatically extracted patterns overlap with handcrafted databases of ML-Ask, which suggests it could be possible to improve ML-Ask performance with our method. In the near future we plan to perform experiments on larger datasets, also in other languages, such as English or Chinese. (h) Precision comparison for tokenized dataset with POS tags.
(i) Recall comparison for tokenized dataset with POS tags.