2024
pdf
bib
abs
LLaMA-Based Models for Aspect-Based Sentiment Analysis
Jakub Šmíd
|
Pavel Priban
|
Pavel Kral
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca 2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models.
pdf
bib
abs
UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection
Jakub Šmíd
|
Pavel Přibáň
|
Pavel Král
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca 2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.
pdf
bib
abs
COMICORDA: Dialogue Act Recognition in Comic Books
Jiri Martinek
|
Pavel Kral
|
Ladislav Lenc
|
Josef Baloun
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Dialogue act (DA) recognition is usually realized from a speech signal that is transcribed and segmented into text. However, only a little work in DA recognition from images exists. Therefore, this paper concentrates on this modality and presents a novel DA recognition approach for image documents, namely comic books. To the best of our knowledge, this is the first study investigating dialogue acts from comic books and represents the first steps to building a model for comic book understanding. The proposed method is composed of the following steps: speech balloon segmentation, optical character recognition (OCR), and DA recognition itself. We use YOLOv8 for balloon segmentation, Google Vision for OCR, and Transformer-based models for DA classification. The experiments are performed on a newly created dataset comprising 1,438 annotated comic panels. It contains bounding boxes, transcriptions, and dialogue act annotation. We have achieved nearly 98% average precision for speech balloon segmentation and exceeded the accuracy of 70% for the DA recognition task. We also present an analysis of dialogue structure in the comics domain and compare it with the standard DA datasets, representing another contribution of this paper.
pdf
bib
abs
Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks
Jakub Šmíd
|
Pavel Přibáň
|
Ondrej Prazak
|
Pavel Kral
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed to allow its usage for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.
pdf
bib
abs
UWBA at SemEval-2024 Task 3: Dialogue Representation and Multimodal Fusion for Emotion Cause Analysis
Josef Baloun
|
Jiri Martinek
|
Ladislav Lenc
|
Pavel Kral
|
Matěj Zeman
|
Lukáš Vlček
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In this paper, we present an approach for solving SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations. The task includes two subtasks that focus on emotion-cause pair extraction using text, video, and audio modalities. Our approach is composed of encoding all modalities (MFCC and Wav2Vec for audio, 3D-CNN for video, and transformer-based models for text) and combining them in an utterance-level fusion module. The model is then optimized for link and emotion prediction simultaneously. Our approach achieved 6th place in both subtasks. The full leaderboard can be found at https://codalab.lisn.upsaclay.fr/competitions/16141#results
2021
pdf
bib
abs
Evaluation Datasets for Cross-lingual Semantic Textual Similarity
Tomáš Hercig
|
Pavel Kral
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Semantic textual similarity (STS) systems estimate the degree of the meaning similarity between two sentences. Cross-lingual STS systems estimate the degree of the meaning similarity between two sentences, each in a different language. State-of-the-art algorithms usually employ a strongly supervised, resource-rich approach difficult to use for poorly-resourced languages. However, any approach needs to have evaluation data to confirm the results. In order to simplify the evaluation process for poorly-resourced languages (in terms of STS evaluation datasets), we present new datasets for cross-lingual and monolingual STS for languages without this evaluation data. We also present the results of several state-of-the-art methods on these data which can be used as a baseline for further research. We believe that this article will not only extend the current STS research to other languages, but will also encourage competition on this new evaluation data.
pdf
bib
abs
Transfer Learning for Czech Historical Named Entity Recognition
Helena Hubková
|
Pavel Kral
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Nowadays, named entity recognition (NER) achieved excellent results on the standard corpora. However, big issues are emerging with a need for an application in a specific domain, because it requires a suitable annotated corpus with adapted NE tag-set. This is particularly evident in the historical document processing field. The main goal of this paper consists of proposing and evaluation of several transfer learning methods to increase the score of the Czech historical NER. We study several information sources, and we use two neural nets for NE modeling and recognition. We employ two corpora for evaluation of our transfer learning methods, namely Czech named entity corpus and Czech historical named entity corpus. We show that BERT representation with fine-tuning and only the simple classifier trained on the union of corpora achieves excellent results.
2020
pdf
bib
abs
Czech Historical Named Entity Corpus v 1.0
Helena Hubková
|
Pavel Kral
|
Eva Pettersson
Proceedings of the Twelfth Language Resources and Evaluation Conference
As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.
pdf
bib
abs
UWB@FinTOC-2020 Shared Task: Financial Document Title Detection
Tomáš Hercig
|
Pavel Kral
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
This paper describes our system created for the Financial Document Structure Extraction Shared Task (FinTOC-2020): Title Detection. We rely on the Apache PDFBox library to extract text and all additional information e.g. font type and font size from the financial prospectuses. Our constrained system uses only the provided training data without any additional external resources. Our system is based on the Maximum Entropy classifier and various features including font type and font size. Our system achieves F1 score 81% and #1 place in the French track and F1 score 77% and #2 place among 5 participating teams in the English track.
2018
pdf
bib
Czech Text Document Corpus v 2.0
Pavel Král
|
Ladislav Lenc
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
abs
Word Embeddings for Multi-label Document Classification
Ladislav Lenc
|
Pavel Král
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
In this paper, we analyze and evaluate word embeddings for representation of longer texts in the multi-label classification scenario. The embeddings are used in three convolutional neural network topologies. The experiments are realized on the Czech ČTK and English Reuters-21578 standard corpora. We compare the results of word2vec static and trainable embeddings with randomly initialized word vectors. We conclude that initialization does not play an important role for classification. However, learning of word vectors is crucial to obtain good results.
pdf
bib
abs
Unsupervised Dialogue Act Induction using Gaussian Mixtures
Tomáš Brychcín
|
Pavel Král
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
This paper introduces a new unsupervised approach for dialogue act induction. Given the sequence of dialogue utterances, the task is to assign them the labels representing their function in the dialogue. Utterances are represented as real-valued vectors encoding their meaning. We model the dialogue as Hidden Markov model with emission probabilities estimated by Gaussian mixtures. We use Gibbs sampling for posterior inference. We present the results on the standard Switchboard-DAMSL corpus. Our algorithm achieves promising results compared with strong supervised baselines and outperforms other unsupervised algorithms.
2016
pdf
bib
UWB at SemEval-2016 Task 7: Novel Method for Automatic Sentiment Intensity Determination
Ladislav Lenc
|
Pavel Král
|
Václav Rajtmajer
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)