2024
pdf
bib
abs
Context vs. Human Disagreement in Sarcasm Detection
Hyewon Jang
|
Moritz Jakob
|
Diego Frassinelli
Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)
Prior work has highlighted the importance of context in the identification of sarcasm by humans and language models. This work examines how much context is required for a better identification of sarcasm by both parties. We collect textual responses to dialogical prompts and sarcasm judgment to the responses placed after long contexts, short contexts, and no contexts. We find that both for humans and language models, the presence of context is generally important in identifying sarcasm in the response. But increasing the amount of context provides no added benefit to humans (long = short > none). This is the same for language models, but only on easily agreed-upon sentences; for sentences with disagreement among human evaluators, different models show different behavior. We also show how sarcasm detection patterns stay consistent as the amount of context is manipulated despite the low agreement in human evaluation.
pdf
bib
abs
Generalizable Sarcasm Detection is Just Around the Corner, of Course!
Hyewon Jang
|
Diego Frassinelli
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We tested the robustness of sarcasm detection models by examining their behavior when fine-tuned on four sarcasm datasets containing varying characteristics of sarcasm: label source (authors vs. third-party), domain (social media/online vs. offline conversations/dialogues), style (aggressive vs. humorous mocking). We tested their prediction performance on the same dataset (intra-dataset) and across different datasets (cross-dataset). For intra-dataset predictions, models consistently performed better when fine-tuned with third-party labels rather than with author labels. For cross-dataset predictions, most models failed to generalize well to the other datasets, implying that one type of dataset cannot represent all sorts of sarcasm with different styles and domains. Compared to the existing datasets, models fine-tuned on the new dataset we release in this work showed the highest generalizability to other datasets. With a manual inspection of the datasets and post-hoc analysis, we attributed the difficulty in generalization to the fact that sarcasm actually comes in different domains and styles. We argue that future sarcasm research should take the broad scope of sarcasm into account.
2023
pdf
bib
abs
Figurative Language Processing: A Linguistically Informed Feature Analysis of the Behavior of Language Models and Humans
Hyewon Jang
|
Qi Yu
|
Diego Frassinelli
Findings of the Association for Computational Linguistics: ACL 2023
Recent years have witnessed a growing interest in investigating what Transformer-based language models (TLMs) actually learn from the training data. This is especially relevant for complex tasks such as the understanding of non-literal meaning. In this work, we probe the performance of three black-box TLMs and two intrinsically transparent white-box models on figurative language classification of sarcasm, similes, idioms, and metaphors. We conduct two studies on the classification results to provide insights into the inner workings of such models. With our first analysis on feature importance, we identify crucial differences in model behavior. With our second analysis using an online experiment with human participants, we inspect different linguistic characteristics of the four figurative language types.
2022
pdf
bib
abs
Capturing Changes in Mood Over Time in Longitudinal Data Using Ensemble Methodologies
Ana-Maria Bucur
|
Hyewon Jang
|
Farhana Ferdousi Liza
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology
This paper presents the system description of team BLUE for Task A of the CLPsych 2022 Shared Task on identifying changes in mood and behaviour in longitudinal textual data. These moments of change are signals that can be used to screen and prevent suicide attempts. To detect these changes, we experimented with several text representation methods, such as TF-IDF, sentence embeddings, emotion-informed embeddings and several classical machine learning classifiers. We chose to submit three runs of ensemble systems based on maximum voting on the predictions from the best performing models. Of the nine participating teams in Task A, our team ranked second in the Precision-oriented Coverage-based Evaluation, with a score of 0.499. Our best system was an ensemble of Support Vector Machine, Logistic Regression, and Adaptive Boosting classifiers using emotion-informed embeddings as input representation that can model both the linguistic and emotional information found in users? posts.