Chris Biemann - ACL Anthology

Chris Biemann

Also published as: Christian Biemann

2026

T²-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation
Jan Strich | Enes Kutay Isgorur | Maximilian Trescher | Chris Biemann | Martin Semmann
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Since many real-world documents combine textual and tabular data, robust Retrieval Augmented Generation (RAG) systems are essential for effectively accessing and analyzing such content to support complex reasoning tasks. Therefore, this paper introduces T²-RAGBench, a benchmark comprising 23,088 question-context-answer triples, designed to evaluate RAG methods on real-world text-and-table data. Unlike typical QA datasets that operate under Oracle Context settings, T²-RAGBench challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets containing text-and-table data typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context. To address this, we transform SOTA datasets into a context-independent format, validated by experts as 91.3% context-independent questions, enabling reliable RAG evaluation. Our comprehensive evaluation identifies Hybrid BM25 , a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that T²-RAGBench remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance. T²-RAGBench provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online: https://github.com/uhh-hcds/g4kmu-paper

Narrative in Short German Prose: A Multi-Phenomenon Dataset for Computational Literary Analysis
Hans Ole Hatzel | Haimo Stiemer | Evelyn Gius | Chris Biemann
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

We present the novel dataset GermAnProse, an annotated corpus consisting of four German short prose texts accompanied by an extensive set of narrative-focused annotations.As part of this dataset, we contribute an annotation scheme for mentions, speech, and character agency: Characters in Action (ChiA).GermAnProse also contains information on narrative phenomena: narrativity, semantic verb classes, and plot keyness.Moreover, we include reader reception data in the form of timing information for audiobook performances, indicating pauses between sentences and the time taken to read a specific sentence in a performance.We release the dataset, which contains more than 18,000 manually created standoff annotations in JSON format, enabling researchers to utilize this resource for further exploratory applications.

Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA
Klejda Alushi | Jan Strich | Chris Biemann | Martin Semmann
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.[GitHub Repository]

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
Narges Baba Ahmadi | Jan Strich | Martin Semmann | Chris Biemann
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We further propose the Lexical Content Score (LCS), a language-agnostic metric that quantifies the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions. Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code[GitHub Repository] and data[Hugging Face Dataset].

2025

HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation
Naquee Rizwan | Seid Muhie Yimam | Daryna Dementieva | Dr. Florian Skupin | Tim Fischer | Daniil Moskovskiy | Aarushi Ajay Borkar | Robert Geislinger | Punyajoy Saha | Sarthak Roy | Martin Semmann | Alexander Panchenko | Chris Biemann | Animesh Mukherjee
Findings of the Association for Computational Linguistics: ACL 2025

Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HATEPRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Fabian David Schmidt | Florian Schneider | Chris Biemann | Goran Glavaš
Findings of the Association for Computational Linguistics: ACL 2025

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages – over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N’Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Gregor Geigle | Florian Schneider | Carolin Holtermann | Chris Biemann | Radu Timofte | Anne Lauscher | Goran Glavaš
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train , a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.

Semi-automatic Sequential Sentence Classification in the Discourse Analysis Tool Suite
Tim Fischer | Chris Biemann
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

This paper explores an AI-assisted approach to sequential sentence annotation designed to enhance qualitative data analysis (QDA) workflows within the open-source Discourse Analysis Tool Suite (DATS) developed at our university.We introduce a three-phase Annotation Assistant that leverages the capabilities of large language models (LLMs) to assist researchers during annotation.Based on the number of annotations, the assistant employs zero-shot prompting, few-shot prompting, or fine-tuned models to provide the best suggestions.To evaluate this approach, we construct a benchmark with five diverse datasets.We assess the performance of three prominent open-source LLMs — Llama 3.1, Gemma 2, and Mistral NeMo — and a sequence tagging model based on SentenceTransformers.Our findings demonstrate the effectiveness of our approach, with performance improving as the number of annotated examples increases. Consequently, we implemented the Annotation Assistant within DATS and report the implementation details.With this, we hope to contribute to a novel AI-assisted workflow and further democratize access to AI for qualitative data analysis.

Efficient and Effective Coreference Resolution for German
Fynn Petersen-Frey | Hans Ole Hatzel | Chris Biemann
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Florian Schneider | Carolin Holtermann | Chris Biemann | Anne Lauscher
Findings of the Association for Computational Linguistics: ACL 2025

Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.

FASCIST-O-METER: Classifier for Neo-fascist Discourse Online
Rudy Alexandro Garrido Veliz | Martin Semmann | Chris Biemann | Seid Muhie Yimam
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
Anshul Singh | Chris Biemann | Jan Strich
Findings of the Association for Computational Linguistics: EMNLP 2025

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in interpreting visual layouts and text. However, a significant challenge remains in their ability to interpret robustly and reason over multi-tabular data presented as images, a common occurrence in real-world scenarios like web pages and digital documents. Existing benchmarks typically address single tables or non-visual data (text/structured). This leaves a critical gap: they don’t assess the ability to parse diverse table images, correlate information across them, and perform multi-hop reasoning on the combined visual data. To bridge this evaluation gap, we introduce MTabVQA, a novel benchmark specifically designed for multi-tabular visual question answering. MTabVQA comprises 3,745 complex question-answer pairs that necessitate multi-hop reasoning across several visually rendered table images. We provide extensive benchmark results for state-of-the-art VLMs on MTabVQA, revealing significant performance limitations. We further investigate post-training techniques to enhance these reasoning abilities and release MTabVQA-Instruct, a large-scale instruction-tuning dataset. Our experiments show that fine-tuning VLMs with MTabVQA-Instruct substantially improves their performance on visual multi-tabular reasoning. Code and dataset are available online: .

CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Xintong Wang | Jingheng Pan | Liang Ding | Longyue Wang | Longqin Jiang | Xingshan Li | Chris Biemann
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data. This enables efficient adaptation to diverse downstream tasks. However, the lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications. In this work, we investigate the intrinsic mechanisms of LLMs from a cognitive perspective using eye movement measures. Specifically, we analyze the layer-wise correlation between human cognitive indicators and LLM representations. Building on these insights, we propose a heuristic approach for selecting the optimal steering layer to modulate LLM semantics. To this end, we introduce an efficient selective layer intervention based on prominent parameter-efficient fine-tuning methods, which conventionally adjust either all layers or only the final layer. Additionally, we present an implicit layer contrastive intervention during inference to steer LLMs away from toxic outputs. Extensive experiments on natural language understanding, reasoning, and generation tasks, conducted on GPT-2, LLaMa2-7B, and Mixtral-7B, demonstrate the effectiveness and efficiency of our approach. As a model-agnostic framework, it enhances the interpretability of LLMs while improving efficiency for safe deployment.

CompUGE-Bench: Comparative Understanding and Generation Evaluation Benchmark for Comparative Question Answering
Ahmad Shallouf | Irina Nikishina | Chris Biemann
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

This paper presents CompUGE, a comprehensive benchmark designed to evaluate Comparative Question Answering (CompQA) systems. The benchmark is structured around four core tasks: Comparative Question Identification, Object and Aspect Identification, Stance Classification, and Answer Generation. It unifies multiple datasets and provides a robust evaluation platform to compare various models across these sub-tasks. We also create additional all-encompassing CompUGE datasets by filtering and merging the existing ones. The benchmark for comparative question answering sub-tasks is designed as a web application available on HuggingFace Spaces: https://huggingface.co/spaces/uhhlt/CompUGE-Bench

Large Language Models Are Overparameterized Text Encoders
Thennal D K | Tim Fischer | Chris Biemann
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last % layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30% of layers with negligible impact on performance and up to 80% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose L3Prune, a novel layer-pruning strategy based on the model’s initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21% of the parameters with a performance drop, and the small variant only suffers from a decrease while pruning 74% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

Visual Question Answering on Scientific Charts Using Fine-Tuned Vision-Language Models
Florian Schleid | Jan Strich | Chris Biemann
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Scientific charts often encapsulate the core findings of research papers, making the ability to answer questions about these charts highly valuable. This paper explores recent advancements in scientific chart visual question answering (VQA) enabled by large Vision Language Models (VLMs) and newly curated datasets. As part of the SciVQA shared task from the 5th Workshop on Scholarly Document Processing, we develop and evaluate multimodal Systems capable of answering diverse question types - including multiple-choice, yes/no, unanswerable, and infinite answer set questions - based on chart images extracted from scientific literature. We investigate the effects of zero-shot and one-shot prompting, as well as supervised fine-tuning (SFT), on the performance of Qwen2.5-VL models (7B and 32B variants). We also tried to include more training data from domain-specific datasets (SpiQA and ArXivQA). Our fine-tuned Qwen2.5-VL 32B model achieves a substantial improvement over the GPT-4o-mini baseline and reaches the 4th place in the shared task, highlighting the effectiveness of domain-specific fine-tuning. We published the code for the experiments.

Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites
Xintong Wang | Yixiao Liu | Jingheng Pan | Liang Ding | Longyue Wang | Chris Biemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Detoxifying offensive language while preserving the speaker’s original intent is a challenging yet critical goal for improving the quality of online interactions. Although large language models (LLMs) show promise in rewriting toxic content, they often default to overly polite rewrites, distorting the emotional tone and communicative intent. This problem is especially acute in Chinese, where toxicity often arises implicitly through emojis, homophones, or discourse context. We present ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity. The dataset comprises 1,556 carefully annotated triplets, each containing a toxic sentence, a sentiment-aligned non-toxic rewrite, and labeled toxic spans. It covers five real-world scenarios: standard expressions, emoji-induced and homophonic toxicity, as well as single-turn and multi-turn dialogues. We evaluate 17 LLMs, including commercial and open-source models with variant architectures, across four dimensions: detoxification accuracy, fluency, content preservation, and sentiment polarity. Results show that while commercial and MoE models perform best overall, all models struggle to balance safety with emotional fidelity in more subtle or context-heavy settings such as emoji, homophone, and dialogue-based inputs. We release ToxiRewriteCN to support future research on controllable, sentiment-aware detoxification for Chinese.

How to Compare Things Properly? A Study of Argument Relevance in Comparative Question Answering
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Artem Shelmanov | Chris Biemann
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Comparative Question Answering (CQA) lies at the intersection of Question Answering, Argument Mining, and Summarization. It poses unique challenges due to the inherently subjective nature of many questions and the need to integrate diverse perspectives. Although the CQA task can be addressed using recently emerged instruction-following Large Language Models (LLMs), challenges such as hallucinations in their outputs and the lack of transparent argument provenance remain significant limitations.To address these challenges, we construct a manually curated dataset comprising arguments annotated with their relevance. These arguments are further used to answer comparative questions, enabling precise traceability and faithfulness. Furthermore, we define explicit criteria for an “ideal” comparison and introduce a benchmark for evaluating the outputs of various Retrieval-Augmented Generation (RAG) models with respect to argument relevance. All code and data are publicly released to support further research.

CollEX – A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections
Florian Schneider | Narges Baba Ahmadi | Niloufar Baba Ahmadi | Iris Vogel | Martin Semmann | Chris Biemann
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)

In this paper, we introduce CollEx, an innovative multimodal agentic Retrieval-Augmented Generation (RAG) system designed to enhance interactive exploration of extensive scientific collections. Given the overwhelming volume and inherent complexity of scientific collections, conventional search systems often lack necessary intuitiveness and interactivity, presenting substantial barriers for learners, educators, and researchers. CollEx addresses these limitations by employing state-of-the-art Large Vision-Language Models (LVLMs) as multimodal agents accessible through an intuitive chat interface. By abstracting complex interactions via specialized agents equipped with advanced tools, CollEx facilitates curiosity-driven exploration, significantly simplifying access to diverse scientific collections and records therein. Our system integrates textual and visual modalities, supporting educational scenarios that are helpful for teachers, pupils, students, and researchers by fostering independent exploration as well as scientific excitement and curiosity. Furthermore, CollEx serves the research community by discovering interdisciplinary connections and complementing visual data. We illustrate the effectiveness of our system through a proof-of-concept application containing over 64,000 unique records across 32 collections from a local scientific collection from a public university.

2024

Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse
Abinew Ali Ayele | Esubalew Alemneh Jalew | Adem Chanie Ali | Seid Muhie Yimam | Chris Biemann
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia’s sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.

Extending the Discourse Analysis Tool Suite with Whiteboards for Visual Qualitative Analysis
Tim Fischer | Florian Schneider | Fynn Petersen-Frey | Anja Silvia Mollah Haque | Isabel Eiser | Gertraud Koch | Chris Biemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this system demonstration paper, we describe the Whiteboards extension for an existing web-based platform for digital qualitative discourse analysis. Whiteboards comprise interactive graph-based interfaces to organize and manipulate objects, which can be qualitative research data, such as documents, images, etc., and analyses of these research data, such as annotations, tags, and code structures. The proposed extension offers a customizable view of the material and a wide range of actions that enable new ways of interacting and working with such resources. We show that the visualizations facilitate various use cases of qualitative data analysis, including reflection of the research process through sampling maps, creation of actor networks, and refining code taxonomies.

Detecting Hate Speech in Amharic Using Multimodal Analysis of Social Media Memes
Melese Ayichlie Jigar | Abinew Ali Ayele | Seid Muhie Yimam | Chris Biemann
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

In contemporary society, the proliferation of hate speech is increasingly prevalent across various social media platforms, with a notable trend of incorporating memes to amplify its visual impact and reach. The conventional text-based detection approaches frequently fail to address the complexities introduced by memes, thereby aggravating the challenges, particularly in low-resource languages such as Amharic. We develop Amharic meme hate speech detection models using 2,000 memes collected from Facebook, Twitter, and Telegram over four months. We employ native Amharic speakers to annotate each meme using a web-based tool, yielding a Fleiss’ kappa score of 0.50. We utilize different feature extraction techniques, namely VGG16 for images and word2Vec for textual content, and build unimodal and multimodal models such as LSTM, BiLSTM, and CNN. The BiLSTM model shows the best performance, achieving 63% accuracy for text and 75% for multimodal features. In image-only experiments, the CNN model achieves 69% in accuracy. Multimodal models demonstrate superior performance in detecting Amharic hate speech in memes, showcasing their potential to address the unique challenges posed by meme-based hate speech on social media.

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding
Xintong Wang | Jingheng Pan | Liang Ding | Chris Biemann
Findings of the Association for Computational Linguistics: ACL 2024

Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs.

Probing Large Language Models from a Human Behavioral Perspective
Xintong Wang | Xiaoyu Li | Xingshan Li | Chris Biemann
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024

Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. However, the understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA), remains largely unexplored. In this work, we probe LLMs from a human behavioral perspective, correlating values from LLMs with eye-tracking measures, which are widely recognized as meaningful indicators of human reading patterns. Our findings reveal that LLMs exhibit a similar prediction pattern with humans but distinct from that of Shallow Language Models (SLMs). Moreover, with the escalation of LLM layers from the middle layers, the correlation coefficients also increase in FFN and MHSA, indicating that the logits within FFN increasingly encapsulate word semantics suitable for predicting tokens from the vocabulary.

Exploring Large Language Models for Qualitative Data Analysis
Tim Fischer | Chris Biemann
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper explores the potential of Large Language Models (LLMs) to enhance qualitative data analysis (QDA) workflows within the open-source QDA platform developed at our university. We identify several opportunities within a typical QDA workflow where AI assistance can boost researcher productivity and translate these opportunities into corresponding NLP tasks: document classification, information extraction, span classification, and text generation. A benchmark tailored to these QDA activities is constructed, utilizing English and German datasets that align with relevant use cases. Focusing on efficiency and accessibility, we evaluate the performance of three prominent open-source LLMs - Llama 3.1, Gemma 2, and Mistral NeMo - on this benchmark. Our findings reveal the promise of LLM integration for streamlining QDA workflows, particularly for English-language projects. Consequently, we have implemented the LLM Assistant as an opt-in feature within our platform and report the implementation details. With this, we hope to further democratize access to AI capabilities for qualitative data analysis.

Dataset of Quotation Attribution in German News Articles
Fynn Petersen-Frey | Chris Biemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Extracting who says what to whom is a crucial part in analyzing human communication in today’s abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

On Improving Repository-Level Code QA for Large Language Models
Jan Strich | Florian Schneider | Irina Nikishina | Chris Biemann
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large Language Models (LLMs) such as ChatGPT, GitHub Copilot, Llama, or Mistral assist programmers as copilots and knowledge sources to make the coding process faster and more efficient. This paper aims to improve the copilot performance by implementing different self-alignment processes and retrieval-augmented generation (RAG) pipelines, as well as their combination. To test the effectiveness of all approaches, we create a dataset and apply a model-based evaluation, using LLM as a judge. It is designed to check the model’s abilities to understand the source code semantics, the dependency between files, and the overall meta-information about the repository. We also compare our approach with other existing solutions, e.g. ChatGPT-3.5, and evaluate on the existing benchmarks. Code and dataset are available online (https://anonymous.4open.science/r/ma_llm-382D).

Story Embeddings — Narrative-Focused Representations of Fictional Stories
Hans Ole Hatzel | Chris Biemann
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present a novel approach to modeling fictional narratives. The proposed model creates embeddings that represent a story such that similar narratives, that is, reformulations of the same story, will result in similar embeddings. We showcase the prowess of our narrative-focused embeddings on various datasets, exhibiting state-of-the-art performance on multiple retrieval tasks. The embeddings also show promising results on a narrative understanding task. Additionally, we perform an annotation-based evaluation to validate that our introduced computational notion of narrative similarity aligns with human perception. The approach can help to explore vast datasets of stories, with potential applications in recommender systems and in the computational analysis of literature.

Fine-grained quotation detection and attribution in German news articles
Fynn Petersen-Frey | Chris Biemann
Proceedings of the 20th Conference on Natural Language Processing (KONVENS 2024)

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

Low-Resource Machine Translation through the Lens of Personalized Federated Learning
Viktor Moskvoretskii | Nazarii Tupitsa | Chris Biemann | Samuel Horváth | Eduard Gorbunov | Irina Nikishina
Findings of the Association for Computational Linguistics: EMNLP 2024

We present a new approach called MeritOpt based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the datasets of South East Asian and Finno-Ugric languages. In addition to its effectiveness, MeritOpt is also highly interpretable, as it can be applied to track the impact of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply with a few lines of code, and we provide scripts for reproducing the experiments (https://github.com/VityaVitalich/MeritOpt).

UHH at AVeriTeC: RAG for Fact-Checking with Real-World Claims
Özge Sevgili | Irina Nikishina | Seid Muhie Yimam | Martin Semmann | Chris Biemann
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

This paper presents UHH’s approach developed for the AVeriTeC shared task. The goal of the challenge is to verify given real-world claims with evidences from the Web. In this shared task, we investigate a Retrieval-Augmented Generation (RAG) model, which mainly contains retrieval, generation, and augmentation components. We start with the selection of the top 10k evidences via BM25 scores, and continue with two approaches to retrieve the most similar evidences: (1) to retrieve top 10 evidences through vector similarity, generate questions for them, and rerank them or (2) to generate questions for the claim and retrieve the most similar evidence, again, through vector similarity. After retrieving the top evidences, a Large Language Model (LLM) is prompted using the claim along with either all evidences or individual evidence to predict the label. Our system submission, UHH, using the first approach and individual evidence prompts, ranks 6th out of 23 systems.

VIDA: The Visual Incel Data Archive. A Theory-oriented Annotated Dataset To Enhance Hate Detection Through Visual Culture
Selenia Anastasi | Florian Schneider | Chris Biemann | Tim Fischer
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Images increasingly constitute a larger portion of internet content, encoding even more complex meanings. Recent studies have highlight the pivotal role of visual communication in the spread of extremist content, particularly that associated with right-wing political ideologies. However, the capability of machine learning systems to recognize such meanings, sometimes implicit, remains limited. To enable future research in this area, we introduce and release VIDA, the Visual Incel Data Archive, a multimodal dataset comprising visual material and internet memes collected from two main Incel communities (Italian and Anglophone) known for their extremist misogynistic content. Following the analytical framework of Shifman (2014), we propose a new taxonomy for annotation across three main levels of analysis: content, form, and stance (hate). This allows for the association of images with fine-grained contextual information that help to identify the presence of offensiveness and a broader set of cultural references, enhancing the understanding of more nuanced aspects in visual communication. In this work we present a statistical analysis of the annotated dataset as well as discuss annotation examples and future line of research.

WISMIR3: A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
Florian Schneider | Chris Biemann
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

This paper presents WISMIR3, a multi-modal dataset comprising roughly 300K text-image pairs from Wikipedia. With a sophisticated automatic ETL pipeline, we scraped, filtered, and transformed the data so that WISMIR3 intrinsically differs from other popular text-image datasets like COCO and Flickr30k. We prove this difference by comparing various linguistic statistics between the three datasets computed using the pipeline. The primary purpose of WISMIR3 is to use it as a benchmark to challenge state-of-the-art text-image retrieval approaches, which already reach around 90% Recall@5 scores on the mentioned popular datasets. Therefore, we ran several text-image retrieval experiments on our dataset using current models, which show that the models, in fact, perform significantly worse compared to evaluation results on COCO and Flickr30k. In addition, for each text-image pair, we release features computed by Faster-R-CNN and CLIP models. With this, we want to ease and motivate the use of the dataset for other researchers.

Tell Me Again! a Large-Scale Dataset of Multiple Summaries for the Same Story
Hans Ole Hatzel | Chris Biemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A wide body of research is concerned with the semantics of narratives, both in terms of understanding narratives and generating fictional narratives and stories. We provide a dataset of summaries to be used as a proxy for entire stories or for the analysis of the summaries themselves. Our dataset consists of a total of 96,831 individual summaries across 29,505 stories. We intend for the dataset to be used for training and evaluation of embedding representations for stories, specifically the stories’ narratives. The summary data is harvested from five different language versions of Wikipedia. Our dataset comes with rich metadata, which we extract from Wikidata, enabling a wide range of applications that operate on story summaries in conjunction with metadata. To set baseline results, we run retrieval experiments on the dataset, exploring the capability of similarity models in retrieving summaries of the same story. For this retrieval, a crucial element is to not place too much emphasis on the named entities, as this can enable retrieval of other summaries for the same work without taking the narrative into account.

Coreference in Long Documents using Hierarchical Entity Merging
Talika Gupta | Hans Ole Hatzel | Chris Biemann
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

Current top-performing coreference resolution approaches are limited with regard to the maximum length of texts they can accept. We explore a recursive merging technique of entities that allows us to apply coreference models to texts of arbitrary length, as found in many narrative genres. In experiments on established datasets, we quantify the drop in resolution quality caused by this approach. Finally, we use an under-explored resource in the form of a fully coreference-annotated novel to illustrate our model’s performance for long documents in practice. Here, we achieve state-of-the-art performance, outperforming previous systems capable of handling long documents.

On Zero-Shot Counterspeech Generation by LLMs
Punyajoy Saha | Aalok Agrawal | Abhik Jana | Chris Biemann | Animesh Mukherjee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models.

Concept Over Time Analysis: Unveiling Temporal Patterns for Qualitative Data Analysis
Tim Fischer | Florian Schneider | Robert Geislinger | Florian Helfer | Gertraud Koch | Chris Biemann
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

In this system demonstration paper, we present the Concept Over Time Analysis extension for the Discourse Analysis Tool Suite.The proposed tool empowers users to define, refine, and visualize their concepts of interest within an interactive interface. Adhering to the Human-in-the-loop paradigm, users can give feedback through sentence annotations. Utilizing few-shot sentence classification, the system employs Sentence Transformers to compute representations of sentences and concepts. Through an iterative process involving semantic similarity searches, sentence annotation, and fine-tuning with contrastive data, the model continuously refines, providing users with enhanced analysis outcomes. The final output is a timeline visualization of sentences classified to concepts. Especially suited for the Digital Humanities, Concept Over Time Analysis serves as a valuable tool for qualitative data analysis within extensive datasets. The chronological overview of concepts enables researchers to uncover patterns, trends, and shifts in discourse over time.

Sövereign at The Perspective Argument Retrieval Shared Task 2024: Using LLMs with Argument Mining
Robert Günzler | Özge Sevgili | Steffen Remus | Chris Biemann | Irina Nikishina
Proceedings of the 11th Workshop on Argument Mining (ArgMining 2024)

This paper presents the Sövereign submission for the shared task on perspective argument retrieval for the Argument Mining Workshop 2024. The main challenge is to perform argument retrieval considering socio-cultural aspects such as political interests, occupation, age, and gender. To address the challenge, we apply open-access Large Language Models (Mistral-7b) in a zero-shot fashion for re-ranking and explicit similarity scoring. Additionally, we combine different features in an ensemble setup using logistic regression. Our system ranks second in the competition for all test set rounds on average for the logistic regression approach using LLM similarity scores as a feature. In addition to the description of the approach, we also provide further results of our ablation study. Our code will be open-sourced upon acceptance.

CAM 2.0: End-to-End Open Domain Comparative Question Answering System
Ahmad Shallouf | Hanna Herasimchyk | Mikhail Salnikov | Rudy Alexandro Garrido Veliz | Natia Mestvirishvili | Alexander Panchenko | Chris Biemann | Irina Nikishina
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Comparative Question Answering (CompQA) is a Natural Language Processing task that combines Question Answering and Argument Mining approaches to answer subjective comparative questions in an efficient argumentative manner. In this paper, we present an end-to-end (full pipeline) system for answering comparative questions called CAM 2.0 as well as a public leaderboard called CompUGE that unifies the existing datasets under a single easy-to-use evaluation suite. As compared to previous web-form-based CompQA systems, it features question identification, object and aspect labeling, stance classification, and summarization using up-to-date models. We also select the most time- and memory-effective pipeline by comparing separately fine-tuned Transformer Encoder models which show state-of-the-art performance on the subtasks with Generative LLMs in few-shot and LoRA setups. We also conduct a user study for a whole-system evaluation.

2023

CodeAnno: Extending WebAnno with Hierarchical Document Level Annotation and Automation
Florian Schneider | Seid Muhie Yimam | Fynn Petersen-Frey | Gerret von Nordheim | Katharina Kleinen-von Königslöw | Chris Biemann
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

WebAnno is one of the most popular annotation tools that supports generic annotation types and distributive annotation with multiple user roles. However, WebAnno focuses on annotating span-level mentions and relations among them, making document-level annotation complicated. When it comes to the annotation and analysis of social science materials, it usually involves the creation of codes to categorize a given document. The codes, which are known as codebooks, are typically hierarchical, which enables to code the document either with a general category or more fine-grained subcategories. CodeAnno is forked from WebAnno and designed to solve the coding problems faced by many social science researchers with the following main functionalities. 1) Creation of hierarchical codebooks, with functionality to move and sort categories in the hierarchy 2) an interactive UI for codebook annotation 3) import and export of annotations in CSV format, hence being compatible with existing annotations conducted using spreadsheet applications 4) integration of an external automation component to facilitate coding using machine learning 5) project templating that allows duplicating a project structure without copying the actual documents. We present different use-cases to demonstrate the capability of CodeAnno. A shot demonstration video of the system is available here: https://www.youtube.com/watch?v=RmCdTghBe-s

Predicting Terms in IS-A Relations with Pre-trained Transformers
Irina Nikishina | Polina Chernomorchenko | Anastasiia Demidova | Alexander Panchenko | Chris Biemann
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing
Debayan Banerjee | Pranav Nair | Ricardo Usbeck | Chris Biemann
Findings of the Association for Computational Linguistics: ACL 2023

In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.

The D-WISE Tool Suite: Multi-Modal Machine-Learning-Powered Tools Supporting and Enhancing Digital Discourse Analysis
Florian Schneider | Tim Fischer | Fynn Petersen-Frey | Isabel Eiser | Gertraud Koch | Chris Biemann
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

This work introduces the D-WISE Tool Suite (DWTS), a novel working environment for digital qualitative discourse analysis in the Digital Humanities (DH). The DWTS addresses limitations of current DH tools induced by the ever-increasing amount of heterogeneous, unstructured, and multi-modal data in which the discourses of contemporary societies are encoded. To provide meaningful insights from such data, our system leverages and combines state-of-the-art machine learning technologies from Natural Language Processing and Com-puter Vision. Further, the DWTS is conceived and developed by an interdisciplinary team ofcultural anthropologists and computer scientists to ensure the tool’s usability for modernDH research. Central features of the DWTS are: a) import of multi-modal data like text, image, audio, and video b) preprocessing pipelines for automatic annotations c) lexical and semantic search of documents d) manual span, bounding box, time-span, and frame annotations e) documentation of the research process.

Multilingual Racial Hate Speech Detection Using Transfer Learning
Abinew Ali Ayele | Skadi Dinter | Seid Muhie Yimam | Chris Biemann
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The rise of social media eases the spread of hateful content, especially racist content with severe consequences. In this paper, we analyze the tweets targeting the death of George Floyd in May 2020 as the event accelerated debates on racism globally. We focus on the tweets published in French for a period of one month since the death of Floyd. Using the Yandex Toloka platform, we annotate the tweets into categories as hate, offensive or normal. Tweets that are offensive or hateful are further annotated as racial or non-racial. We build French hate speech detection models based on the multilingual BERT and CamemBERT and apply transfer learning by fine-tuning the HateXplain model. We compare different approaches to resolve annotation ties and find that the detection model based on CamemBERT yields the best results in our experiments.

Multi-Modal Learning Application – Support Language Learners with NLP Techniques and Eye-Tracking
Robert Geislinger | Ali Ebrahimi Pourasad | Deniz Gül | Daniel Djahangir | Seid Muhie Yimam | Steffen Remus | Chris Biemann
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

Using Wikidata for Enhancing Compositionality in Pretrained Language Models
Meriem Beloucif | Mihir Bansal | Chris Biemann
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

One of the many advantages of pre-trained language models (PLMs) such as BERT and RoBERTa is their flexibility and contextual nature. These features give PLMs strong capabilities for representing lexical semantics. However, PLMs seem incapable of capturing high-level semantics in terms of compositionally. We show that when augmented with the relevant semantic knowledge, PMLs learn to capture a higher degree of lexical compositionality. We annotate a large dataset from Wikidata highlighting a type of semantic inference that is easy for humans to understand but difficult for PLMs, like the correlation between age and date of birth. We use this resource for finetuning DistilBERT, BERT large and RoBERTa. Our results show that the performance of PLMs against the test data continuously improves when augmented with such a rich resource. Our results are corroborated by a consistent improvement over most GLUE benchmark natural language understanding tasks.

From Qualitative to Quantitative Research: Semi-Automatic Annotation Scaling in the Digital Humanities
Fynn Petersen-Frey | Tim Fischer | Florian Schneider | Isabel Eiser | Gertraud Koch | Chris Biemann
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

LT at SemEval-2023 Task 1: Effective Zero-Shot Visual Word Sense Disambiguation Approaches using External Knowledge Sources
Florian Schneider | Chris Biemann
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The objective of the SemEval-2023 Task 1: Visual Word Sense Disambiguation (VWSD) is to identify the image illustrating the indented meaning of a target word and some minimal additional context. The omnipresence of textual and visual data in the task strongly suggests the utilization of the recent advances in multi-modal machine learning, i.e., pretrained visiolinguistic models (VLMs). Often referred to as foundation models due to their strong performance on many vision-language downstream tasks, these models further demonstrate powerful zero-shot capabilities. In this work, we utilize various pertained VLMs in a zero-shot fashion for multiple approaches using external knowledge sources to enrich the contextual information. Further, we evaluate our methods on the final test data and extensively analyze the suitability of different knowledge sources, the influence of training data, model sizes, multi-linguality, and different textual prompting strategies. Although we are not among the best-performing systems (rank 20 of 56), our experiments described in this work prove competitive results. Moreover, we aim to contribute meaningful insights and propel multi-modal machine learning tasks like VWSD.

Narrative Cloze as a Training Objective: Towards Modeling Stories Using Narrative Chain Embeddings
Hans Ole Hatzel | Chris Biemann
Proceedings of the 5th Workshop on Narrative Understanding

We present a novel approach to modeling narratives using narrative chain embeddings.A new dataset of narrative chains extracted from German news texts is presented. With neural methods, we produce models for both German and English that achieve state-of-the-art performance on the Multiple Choice Narrative Cloze task. Subsequently, we perform an extrinsic evaluation of the embeddings our models produce and show that they perform rather poorly in identifying narratively similar texts. We explore some of the reasons for this underperformance and discuss the upsides of our approach. We provide an outlook on alternative ways to model narratives, as well as techniques for evaluating such models.

Exploring Amharic Hate Speech Data Collection and Classification Approaches
Abinew Ali Ayele | Seid Muhie Yimam | Tadesse Destaw Belay | Tesfa Asfaw | Chris Biemann
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

In this paper, we present a study of efficient data selection and annotation strategies for Amharic hate speech. We also build various classification models and investigate the challenges of hate speech data selection, annotation, and classification for the Amharic language. From a total of over 18 million tweets in our Twitter corpus, 15.1k tweets are annotated by two independent native speakers, and a Cohen’s kappa score of 0.48 is achieved. A third annotator, a curator, is also employed to decide on the final gold labels. We employ both classical machine learning and deep learning approaches, which include fine-tuning AmFLAIR and AmRoBERTa contextual embedding models. Among all the models, AmFLAIR achieves the best performance with an F1-score of 72%. We publicly release the annotation guidelines, keywords/lexicon entries, datasets, models, and associated scripts with a permissive license.

2022

Modeling Referential Gaze in Task-oriented Settings of Varying Referential Complexity
Özge Alaçam | Eugen Ruppert | Ganeshan Malhotra | Chris Biemann | Sina Zarrieß
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Referential gaze is a fundamental phenomenon for psycholinguistics and human-human communication. However, modeling referential gaze for real-world scenarios, e.g. for task-oriented communication, is lacking the well-deserved attention from the NLP community. In this paper, we address this challenging issue by proposing a novel multimodal NLP task; namely predicting when the gaze is referential. We further investigate how to model referential gaze and transfer gaze features to adapt to unseen situated settings that target different referential complexities than the training environment. We train (i) a sequential attention-based LSTM model and (ii) a multivariate transformer encoder architecture to predict whether the gaze is on a referent object. The models are evaluated on the three complexity datasets. The results indicate that the gaze features can be transferred not only among various similar tasks and scenes but also across various complexity levels. Taking the referential complexity of a scene into account is important for successful target prediction using gaze parameters especially when there is not much data for fine-tuning.

MOTIF: Contextualized Images for Complex Words to Improve Human Reading
Xintong Wang | Florian Schneider | Özge Alacam | Prateek Chaudhury | Chris Biemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

MOTIF (MultimOdal ConTextualized Images For Language Learners) is a multimodal dataset that consists of 1125 comprehension texts retrieved from Wikipedia Simple Corpus. Allowing multimodal processing or enriching the context with multimodal information has proven imperative for many learning tasks, specifically for second language (L2) learning. In this respect, several traditional NLP approaches can assist L2 readers in text comprehension processes, such as simplifying text or giving dictionary descriptions for complex words. As nicely stated in the well-known proverb, sometimes “a picture is worth a thousand words” and an image can successfully complement the verbal message by enriching the representation, like in Pictionary books. This multimodal support can also assist on-the-fly text reading experience by providing a multimodal tool that chooses and displays the most relevant images for the difficult words, given the text context. This study mainly focuses on one of the key components to achieving this goal; collecting a multimodal dataset enriched with complex word annotation and validated image match.

Elvis vs. M. Jackson: Who has More Albums? Classification and Identification of Elements in Comparative Questions
Meriem Beloucif | Seid Muhie Yimam | Steffen Stahlhacke | Chris Biemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Comparative Question Answering (cQA) is the task of providing concrete and accurate responses to queries such as: “Is Lyft cheaper than a regular taxi?” or “What makes a mortgage different from a regular loan?”. In this paper, we propose two new open-domain real-world datasets for identifying and labeling comparative questions. While the first dataset contains instances of English questions labeled as comparative vs. non-comparative, the second dataset provides additional labels including the objects and the aspects of comparison. We conduct several experiments that evaluate the soundness of our datasets. The evaluation of our datasets using various classifiers show promising results that reach close-to-human results on a binary classification task with a neural model using ALBERT embeddings. When approaching the unsupervised sequence labeling task, some headroom remains.

Improved Open Source Automatic Subtitling for Lecture Videos
Robert Geislinger | Benjamin Milde | Chris Biemann
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

Question Answering Classification for Amharic Social Media Community Based Questions
Tadesse Destaw Belay | Seid Muhie Yimam | Abinew Ayele | Chris Biemann
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

In this work, we build a Question Answering (QA) classification dataset from a social media platform, namely the Telegram public channel called @AskAnythingEthiopia. The channel has more than 78k subscribers and has existed since May 31, 2019. The platform allows asking questions that belong to various domains, like politics, economics, health, education, and so on. Since the questions are posed in a mixed-code, we apply different strategies to pre-process the dataset. Questions are posted in Amharic, English, or Amharic but in a Latin script. As part of the pre-processing tools, we build a Latin to Ethiopic Script transliteration tool. We collect 8k Amharic and 24K transliterated questions and develop deep learning-based questions answering classifiers that attain as high as an F-score of 57.29 in 20 different question classes or categories. The datasets and pre-processing scripts are open-sourced to facilitate further research on the Amharic community-based question answering.

Language over Labels: Contrastive Language Supervision Exceeds Purely Label-Supervised Classification Performance on Chest X-Rays
Anton Wiehe | Florian Schneider | Sebastian Blank | Xintong Wang | Hans-Peter Zorn | Christian Biemann
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop

The multi-modal foundation model CLIP computes representations from texts and images that achieved unprecedented performance on tasks such as zero-shot image classification. However, CLIP was pretrained on public internet data. Thus it lacks highly domain-specific knowledge. We investigate the adaptation of CLIP-based models to the chest radiography domain using the MIMIC-CXR dataset. We show that the features of the pretrained CLIP models do not transfer to this domain. We adapt CLIP to the chest radiography domain using contrastive language supervision and show that this approach yields a model that outperforms supervised learning on labels on the MIMIC-CXR dataset while also generalizing to the CheXpert and RSNA Pneumonia datasets. Furthermore, we do a detailed ablation study of the batch and dataset size. Finally, we show that language supervision allows for better explainability by using the multi-modal model to generate images from texts such that experts can inspect what the model has learned.

More Like This: Semantic Retrieval with Linguistic Information
Steffen Remus | Gregor Wiedemann | Saba Anwar | Fynn Petersen-Frey | Seid Muhie Yimam | Chris Biemann
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

Classification of German Jungian Extraversion and Introversion Texts with Assessment of Changes During the COVID-19 Pandemic
Dirk Johannßen | Chris Biemann | David Scheffer
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

The corona pandemic and countermeasures such as social distancing and lockdowns have confronted individuals with new challenges for their mental health and well-being. It can be assumed that the Jungian psychology types of extraverts and introverts react differently to these challenges. We propose a Bi-LSTM model with an attention mechanism for classifying introversion and extraversion from German tweets, which is trained on hand-labeled data created by 335 participants. With this work, we provide this novel dataset for free use and validation. The proposed model achieves solid performance with F1 = .72. Furthermore, we created a feature engineered logistic model tree (LMT) trained on hand-labeled tweets, to which the data is also made available with this work. With this second model, German tweets before and during the pandemic have been investigated. Extraverts display more positive emotions, whilst introverts show more insight and higher rates of anxiety. Even though such a model can not replace proper psychological diagnostics, it can help shed light on linguistic markers and to help understand introversion and extraversion better for a variety of applications and investigations.

Dataset of Student Solutions to Algorithm and Data Structure Programming Assignments
Fynn Petersen-Frey | Marcus Soll | Louis Kobras | Melf Johannsen | Peter Kling | Chris Biemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a dataset containing source code solutions to algorithmic programming exercises solved by hundreds of Bachelor-level students at the University of Hamburg. These solutions were collected during the winter semesters 2019/2020, 2020/2021 and 2021/2022. The dataset contains a set of solutions to a total of 21 tasks written in Java as well as Python and a total of over 1500 individual solutions. All solutions were submitted through Moodle and the Coderunner plugin and passed a number of test cases (including randomized tests), such that they can be considered as working correctly. All students whose solutions are included in the dataset gave their consent into publishing their solutions. The solutions are pseudonymized with a random solution ID. Included in this paper is a short analysis of the dataset containing statistical data and highlighting a few anomalies (e.g. the number of solutions per task decreases for the last few tasks due to grading rules). We plan to extend the dataset with tasks and solutions from upcoming courses.

Measuring Faithfulness of Abstractive Summaries
Tim Fischer | Steffen Remus | Chris Biemann
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

2021

Towards Multi-Modal Text-Image Retrieval to improve Human Reading
Florian Schneider | Özge Alaçam | Xintong Wang | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In primary school, children’s books, as well as in modern language learning apps, multi-modal learning strategies like illustrations of terms and phrases are used to support reading comprehension. Also, several studies in educational psychology suggest that integrating cross-modal information will improve reading comprehension. We claim that state-of- he-art multi-modal transformers, which could be used in a language learner context to improve human reading, will perform poorly because of the short and relatively simple textual data those models are trained with. To prove our hypotheses, we collected a new multi-modal image-retrieval dataset based on data from Wikipedia. In an in-depth data analysis, we highlight the differences between our dataset and other popular datasets. Additionally, we evaluate several state-of-the-art multi-modal transformers on text-image retrieval on our dataset and analyze their meager results, which verify our claims.

An Investigation towards Differentially Private Sequence Tagging in a Federated Framework
Abhik Jana | Chris Biemann
Proceedings of the Third Workshop on Privacy in Natural Language Processing

To build machine learning-based applications for sensitive domains like medical, legal, etc. where the digitized text contains private information, anonymization of text is required for preserving privacy. Sequence tagging, e.g. as done in Named Entity Recognition (NER) can help to detect private information. However, to train sequence tagging models, a sufficient amount of labeled data are required but for privacy-sensitive domains, such labeled data also can not be shared directly. In this paper, we investigate the applicability of a privacy-preserving framework for sequence tagging tasks, specifically NER. Hence, we analyze a framework for the NER task, which incorporates two levels of privacy protection. Firstly, we deploy a federated learning (FL) framework where the labeled data are not shared with the centralized server as well as the peer clients. Secondly, we apply differential privacy (DP) while the models are being trained in each client instance. While both privacy measures are suitable for privacy-aware models, their combination results in unstable models. To our knowledge, this is the first study of its kind on privacy-aware sequence tagging models.

Word Complexity is in the Eye of the Beholder
Sian Gooding | Ekaterina Kochmar | Seid Muhie Yimam | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Lexical complexity is a highly subjective notion, yet this factor is often neglected in lexical simplification and readability systems which use a ”one-size-fits-all” approach. In this paper, we investigate which aspects contribute to the notion of lexical complexity in various groups of readers, focusing on native and non-native speakers of English, and how the notion of complexity changes depending on the proficiency level of a non-native reader. To facilitate reproducibility of our approach and foster further research into these aspects, we release a dataset of complex words annotated by readers with different backgrounds.

Forum 4.0: An Open-Source User Comment Analysis Framework
Marlo Haering | Jakob Smedegaard Andersen | Chris Biemann | Wiebke Loosen | Benjamin Milde | Tim Pietz | Christian Stöcker | Gregor Wiedemann | Olaf Zukunft | Walid Maalej
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

With the increasing number of user comments in diverse domains, including comments on online journalism and e-commerce websites, the manual content analysis of these comments becomes time-consuming and challenging. However, research showed that user comments contain useful information for different domain experts, which is thus worth finding and utilizing. This paper introduces Forum 4.0, an open-source framework to semi-automatically analyze, aggregate, and visualize user comments based on labels defined by domain experts. We demonstrate the applicability of Forum 4.0 with comments analytics scenarios within the domains of online journalism and app stores. We outline the underlying container architecture, including the web-based user interface, the machine learning component, and the task manager for time-consuming tasks. We finally conduct machine learning experiments with simulated annotations and different sampling strategies on existing datasets from both domains to evaluate Forum 4.0’s performance. Forum 4.0 achieves promising classification results (ROC-AUC ≥ 0.9 with 100 annotated samples), utilizing transformer-based embeddings with a lightweight logistic regression model. We explain how Forum 4.0’s architecture is applicable for millions of user comments in real-time, yet at feasible training and classification costs.

SCoT: Sense Clustering over Time: a tool for the analysis of lexical change
Christian Haase | Saba Anwar | Seid Muhie Yimam | Alexander Friedrich | Chris Biemann
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We present Sense Clustering over Time (SCoT), a novel network-based tool for analysing lexical change. SCoT represents the meanings of a word as clusters of similar words. It visualises their formation, change, and demise. There are two main approaches to the exploration of dynamic networks: the discrete one compares a series of clustered graphs from separate points in time. The continuous one analyses the changes of one dynamic network over a time-span. SCoT offers a new hybrid solution. First, it aggregates time-stamped documents into intervals and calculates one sense graph per discrete interval. Then, it merges the static graphs to a new type of dynamic semantic neighbourhood graph over time. The resulting sense clusters offer uniquely detailed insights into lexical change over continuous intervals with model transparency and provenance. SCoT has been successfully used in a European study on the changing meaning of ‘crisis’.

ActiveAnno: General-Purpose Document-Level Annotation Tool with Active Learning Integration
Max Wiechmann | Seid Muhie Yimam | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

ActiveAnno is an annotation tool focused on document-level annotation tasks developed both for industry and research settings. It is designed to be a general-purpose tool with a wide variety of use cases. It features a modern and responsive web UI for creating annotation projects, conducting annotations, adjudicating disagreements, and analyzing annotation results. ActiveAnno embeds a highly configurable and interactive user interface. The tool also integrates a RESTful API that enables integration into other software systems, including an API for machine learning integration. ActiveAnno is built with extensible design and easy deployment in mind, all to enable users to perform annotation tasks with high efficiency and high-quality annotation results.

Probing Pre-trained Language Models for Semantic Attributes and their Values
Meriem Beloucif | Chris Biemann
Findings of the Association for Computational Linguistics: EMNLP 2021

Pretrained language models (PTLMs) yield state-of-the-art performance on many natural language processing tasks, including syntax, semantics and commonsense. In this paper, we focus on identifying to what extent do PTLMs capture semantic attributes and their values, e.g., the correlation between rich and high net worth. We use PTLMs to predict masked tokens using patterns and lists of items from Wikidata in order to verify how likely PTLMs encode semantic attributes along with their values. Such inferences based on semantics are intuitive for humans as part of our language understanding. Since PTLMs are trained on large amount of Wikipedia data we would assume that they can generate similar predictions, yet our findings reveal that PTLMs are still much worse than humans on this task. We show evidence and analysis explaining how to exploit our methodology to integrate better context and semantics into PTLMs using knowledge bases.

Error Analysis of using BART for Multi-Document Summarization: A Study for English and German Language
Timo Johner | Abhik Jana | Chris Biemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Recent research using pre-trained language models for multi-document summarization task lacks deep investigation of potential erroneous cases and their possible application on other languages. In this work, we apply a pre-trained language model (BART) for multi-document summarization (MDS) task using both fine-tuning and without fine-tuning. We use two English datasets and one German dataset for this study. First, we reproduce the multi-document summaries for English language by following one of the recent studies. Next, we show the applicability of the model to German language by achieving state-of-the-art performance on German MDS. We perform an in-depth error analysis of the followed approach for both languages, which leads us to identifying most notable errors, from made-up facts and topic delimitation, and quantifying the amount of extractiveness.

Towards Layered Events and Schema Representations in Long Documents
Hans Ole Hatzel | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In this thesis proposal, we explore the application of event extraction to literary texts. Considering the lengths of literary documents modeling events in different granularities may be more adequate to extract meaningful information, as individual elements contribute little to the overall semantics. We adapt the concept of schemas as sequences of events all describing a single process, connected through shared participants extending it to for multiple schemas in a document. Segmentation of event sequences into schemas is approached by modeling event sequences, on such task as the narrative cloze task, the prediction of missing events in sequences. We propose building on sequences of event embeddings to form schema embeddings, thereby summarizing sections of documents using a single representation. This approach will allow for the comparisons of different sections of documents and entire literary works. Literature is a challenging domain based on its variety of genres, yet the representation of literary content has received relatively little attention.

Neural End-to-end Coreference Resolution for German in Different Domains
Fynn Schröder | Hans Ole Hatzel | Chris Biemann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

How Hateful are Movies? A Study and Prediction on Movie Subtitles
Niklas von Boguszewski | Sana Moin | Anirban Bhowmick | Seid Muhie Yimam | Chris Biemann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

Which is Better for Deep Learning: Python or MATLAB? Answering Comparative Questions in Natural Language
Viktoriia Chekalina | Alexander Bondarenko | Chris Biemann | Meriem Beloucif | Varvara Logacheva | Alexander Panchenko
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We present a system for answering comparative questions (Is X better than Y with respect to Z?) in natural language. Answering such questions is important for assisting humans in making informed decisions. The key component of our system is a natural language interface for comparative QA that can be used in personal assistants, chatbots, and similar NLP devices. Comparative QA is a challenging NLP task, since it requires collecting support evidence from many different sources, and direct comparisons of rare objects may be not available even on the entire Web. We take the first step towards a solution for such a task offering a testbed for comparative QA in natural language by probing several methods, making the three best ones available as an online demo.

2020

Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System
Seid Muhie Yimam | Gopalakrishnan Venkatesh | John Lee | Chris Biemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) “All-Words” lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.

Generating Lexical Representations of Frames using Lexical Substitution
Saba Anwar | Artem Shelmanov | Alexander Panchenko | Chris Biemann
Proceedings of the Probability and Meaning Conference (PaM 2020)

Semantic frames are formal linguistic structures describing situations/actions/events, e.g. Commercial transfer of goods. Each frame provides a set of roles corresponding to the situation participants, e.g. Buyer and Goods, and lexical units (LUs) – words and phrases that can evoke this particular frame in texts, e.g. Sell. The scarcity of annotated resources hinders wider adoption of frame semantics across languages and domains. We investigate a simple yet effective method, lexical substitution with word representation models, to automatically expand a small set of frame-annotated sentences with new words for their respective roles and LUs. We evaluate the expansion quality using FrameNet. Contextualized models demonstrate overall superior performance compared to the non-contextualized ones on roles. However, the latter show comparable performance on the task of LU expansion.

UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection
Gregor Wiedemann | Seid Muhie Yimam | Chris Biemann
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task 12 for the English language. Further experiments with the ALBERT model even surpass this result.

Individual corpora predict fast memory retrieval during reading
Markus J. Hofmann | Lara Müller | Andre Rölke | Ralph Radach | Chris Biemann
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

The corpus, from which a predictive language model is trained, can be considered the experience of a semantic system. We recorded everyday reading of two participants for two months on a tablet, generating individual corpus samples of 300/500K tokens. Then we trained word2vec models from individual corpora and a 70 million-sentence newspaper corpus to obtain individual and norm-based long-term memory structure. To test whether individual corpora can make better predictions for a cognitive task of long-term memory retrieval, we generated stimulus materials consisting of 134 sentences with uncorrelated individual and norm-based word probabilities. For the subsequent eye tracking study 1-2 months later, our regression analyses revealed that individual, but not norm-corpus-based word probabilities can account for first-fixation duration and first-pass gaze duration. Word length additionally affected gaze duration and total viewing duration. The results suggest that corpora representative for an individual’s long-term memory structure can better explain reading performance than a norm corpus, and that recently acquired information is lexically accessed rapidly.

Social Media Unrest Prediction during the COVID-19 Pandemic: Neural Implicit Motive Pattern Recognition as Psychometric Signs of Severe Crises
Dirk Johannßen | Chris Biemann
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

The COVID-19 pandemic has caused international social tension and unrest. Besides the crisis itself, there are growing signs of rising conflict potential of societies around the world. Indicators of global mood changes are hard to detect and direct questionnaires suffer from social desirability biases. However, so-called implicit methods can reveal humans intrinsic desires from e.g. social media texts. We present psychologically validated social unrest predictors and replicate scalable and automated predictions, setting a new state of the art on a recent German shared task dataset. We employ this model to investigate a change of language towards social unrest during the COVID-19 pandemic by comparing established psychological predictors on samples of tweets from spring 2019 with spring 2020. The results show a significant increase of the conflict indicating psychometrics. With this work, we demonstrate the applicability of automated NLP-based approaches to quantitative psychological research.

Word Sense Disambiguation for 158 Languages using Word Embeddings Only
Varvara Logacheva | Denis Teslenko | Artem Shelmanov | Steffen Remus | Dmitry Ustalov | Andrey Kutuzov | Ekaterina Artemova | Chris Biemann | Simone Paolo Ponzetto | Alexander Panchenko
Proceedings of the Twelfth Language Resources and Evaluation Conference

Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al., (2018), enabling WSD in these languages. Models and system are available online.

Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models
Seid Muhie Yimam | Hizkiel Mitiku Alemayehu | Abinew Ayele | Chris Biemann
Proceedings of the 28th International Conference on Computational Linguistics

This paper presents the study of sentiment analysis for Amharic social media texts. As the number of social media users is ever-increasing, social media platforms would like to understand the latent meaning and sentiments of a text to enhance decision-making procedures. However, low-resource languages such as Amharic have received less attention due to several reasons such as lack of well-annotated datasets, unavailability of computing resources, and fewer or no expert researchers in the area. This research addresses three main research questions. We first explore the suitability of existing tools for the sentiment analysis task. Annotation tools are scarce to support large-scale annotation tasks in Amharic. Also, the existing crowdsourcing platforms do not support Amharic text annotation. Hence, we build a social-network-friendly annotation tool called ‘ASAB’ using the Telegram bot. We collect 9.4k tweets, where each tweet is annotated by three Telegram users. Moreover, we explore the suitability of machine learning approaches for Amharic sentiment analysis. The FLAIR deep learning text classifier, based on network embeddings that are computed from a distributional thesaurus, outperforms other supervised classifiers. We further investigate the challenges in building a sentiment analysis system for Amharic and we found that the widespread usage of sarcasm and figurative speech are the main issues in dealing with the problem. To advance the sentiment analysis research in Amharic and other related low-resource languages, we release the dataset, the annotation tool, source code, and models publicly under a permissive.

Estimating the influence of auxiliary tasks for multi-task learning of sequence tagging tasks
Fynn Schröder | Chris Biemann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Multi-task learning (MTL) and transfer learning (TL) are techniques to overcome the issue of data scarcity when training state-of-the-art neural networks. However, finding beneficial auxiliary datasets for MTL or TL is a time- and resource-consuming trial-and-error approach. We propose new methods to automatically assess the similarity of sequence tagging datasets to identify beneficial auxiliary data for MTL or TL setups. Our methods can compute the similarity between any two sequence tagging datasets, they do not need to be annotated with the same tagset or multiple labels in parallel. Additionally, our methods take tokens and their labels into account, which is more robust than only using either of them as an information source, as conducted in prior work. We empirically show that our similarity measures correlate with the change in test score of neural networks that use the auxiliary dataset for MTL to increase the main task performance. We provide an efficient, open-source implementation.

2019

Learning Graph Embeddings from WordNet-based Similarity Measures
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.

UHH-LT at SemEval-2019 Task 6: Supervised vs. Unsupervised Transfer Learning for Offensive Language Detection
Gregor Wiedemann | Eugen Ruppert | Chris Biemann
Proceedings of the 13th International Workshop on Semantic Evaluation

We present a neural network based approach of transfer learning for offensive language detection. For our system, we compare two types of knowledge transfer: supervised and unsupervised pre-training. Supervised pre-training of our bidirectional GRU-3-CNN architecture is performed as multi-task learning of parallel training of five different tasks. The selected tasks are supervised classification problems from public NLP resources with some overlap to offensive language such as sentiment detection, emoji classification, and aggressive language classification. Unsupervised transfer learning is performed with a thematic clustering of 40M unlabeled tweets via LDA. Based on this dataset, pre-training is performed by predicting the main topic of a tweet. Results indicate that unsupervised transfer from large datasets performs slightly better than supervised training on small ‘near target category’ datasets. In the SemEval Task, our system ranks 14 out of 103 participants.

Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction
Dmitry Ustalov | Alexander Panchenko | Chris Biemann | Simone Paolo Ponzetto
Computational Linguistics, Volume 45, Issue 3 - September 2019

We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph, which reflects the “ambiguity” of its nodes. Then, it uses hard clustering to discover clusters in this “disambiguated” intermediate graph. After outlining the approach and analyzing its computational complexity, we demonstrate that Watset shows competitive results in three applications: unsupervised synset induction from a synonymy graph, unsupervised semantic frame induction from dependency triples, and unsupervised semantic class induction from a distributional thesaurus. Our algorithm is generic and can also be applied to other networks of linguistic data.

TARGER: Neural Argument Mining at Your Fingertips
Artem Chernodub | Oleksiy Oliynyk | Philipp Heidenreich | Alexander Bondarenko | Matthias Hagen | Chris Biemann | Alexander Panchenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present TARGER, an open source neural argument mining framework for tagging arguments in free input texts and for keyword-based retrieval of arguments from an argument-tagged web-scale corpus. The currently available models are pre-trained on three recent argument mining datasets and enable the use of neural argument mining without any reproducibility effort on the user’s side. The open source code ensures portability to other domains and use cases.

Categorizing Comparative Sentences
Alexander Panchenko | Alexander Bondarenko | Mirco Franzek | Matthias Hagen | Chris Biemann
Proceedings of the 6th Workshop on Argument Mining

We tackle the tasks of automatically identifying comparative sentences and categorizing the intended preference (e.g., “Python has better NLP libraries than MATLAB” → Python, better, MATLAB). To this end, we manually annotate 7,199 sentences for 217 distinct target item pairs from several domains (27% of the sentences contain an oriented comparison in the sense of “better” or “worse”). A gradient boosting model based on pre-trained sentence embeddings reaches an F1 score of 85% in our experimental evaluation. The model can be used to extract comparative sentences for pro/con argumentation in comparative / argument search engines or debating technologies.

Language-Agnostic Model for Aspect-Based Sentiment Analysis
Md Shad Akhtar | Abhishek Kumar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

In this paper, we propose a language-agnostic deep neural network architecture for aspect-based sentiment analysis. The proposed approach is based on Bidirectional Long Short-Term Memory (Bi-LSTM) network, which is further assisted with extra hand-crafted features. We define three different architectures for the successful combination of word embeddings and hand-crafted features. We evaluate the proposed approach for six languages (i.e. English, Spanish, French, Dutch, German and Hindi) and two problems (i.e. aspect term extraction and aspect sentiment classification). Experiments show that the proposed model attains state-of-the-art performance in most of the settings.

Making Fast Graph-based Algorithms with Graph Metric Embeddings
Andrey Kutuzov | Mohammad Dorgham | Oleksiy Oliynyk | Chris Biemann | Alexander Panchenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Graph measures, such as node distances, are inefficient to compute. We explore dense vector representations as an effective way to approximate the same information. We introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks.

LT Expertfinder: An Evaluation Framework for Expert Finding Methods
Tim Fischer | Steffen Remus | Chris Biemann
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Expert finding is the task of ranking persons for a predefined topic or search query. Finding experts for a specified area is an important task and has attracted much attention in the information retrieval community. Most approaches for this task are evaluated in a supervised fashion, which depend on predefined topics of interest as well as gold standard expert rankings. Famous representatives of such datasets are enriched versions of DBLP provided by the ArnetMiner projet or the W3C Corpus of TREC. However, manually ranking experts can be considered highly subjective and detailed rankings are hardly distinguishable. Evaluating these datasets does not necessarily guarantee a good or bad performance of the system. Particularly for dynamic systems, where topics are not predefined but formulated as a search query, we believe a more informative approach is to perform user studies for directly comparing different methods in the same view. In order to accomplish this in a user-friendly way, we present the LT Expert Finder web-application, which is equipped with various query-based expert finding methods that can be easily extended, a detailed expert profile view, detailed evidence in form of relevant documents and statistics, and an evaluation component that allows the qualitative comparison between different rankings.

Hierarchical Multi-label Classification of Text with Capsule Networks
Rami Aly | Steffen Remus | Chris Biemann
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Capsule networks have been shown to demonstrate good performance on structured data in the area of visual inference. In this paper we apply and compare simple shallow capsule networks for hierarchical multi-label text classification and show that they can perform superior to other neural networks, such as CNNs and LSTMs, and non-neural network architectures such as SVMs. For our experiments, we use the established Web of Science (WOS) dataset and introduce a new real-world scenario dataset, the BlurbGenreCollection (BGC). Our results confirm the hypothesis that capsule networks are especially advantageous for rare events and structurally diverse categories, which we attribute to their ability to combine latent encoded information.

Every Child Should Have Parents: A Taxonomy Refinement Algorithm Based on Hyperbolic Term Embeddings
Rami Aly | Shantanu Acharya | Alexander Ossa | Arne Köhn | Chris Biemann | Alexander Panchenko
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce the use of Poincaré embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extraction. We demonstrate the superiority of Poincaré embeddings over distributional semantic representations, supporting the hypothesis that they can better capture hierarchical lexical-semantic relationships than embeddings in the Euclidean space.

Improving Neural Entity Disambiguation with Graph Embeddings
Özge Sevgili | Alexander Panchenko | Chris Biemann
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Entity Disambiguation (ED) is the task of linking an ambiguous entity mention to a corresponding entry in a knowledge base. Current methods have mostly focused on unstructured text data to learn representations of entities, however, there is structured information in the knowledge base itself that should be useful to disambiguate entities. In this work, we propose a method that uses graph embeddings for integrating structured information from the knowledge base with unstructured information from text-based representations. Our experiments confirm that graph embeddings trained on a graph of hyperlinks between Wikipedia articles improve the performances of simple feed-forward neural ED model and a state-of-the-art neural ED system.

HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings
Saba Anwar | Dmitry Ustalov | Nikolay Arefyev | Simone Paolo Ponzetto | Chris Biemann | Alexander Panchenko
Proceedings of the 13th International Workshop on Semantic Evaluation

We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (Qasem-iZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages.

On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings
Abhik Jana | Dima Puzyrev | Alexander Panchenko | Pawan Goyal | Chris Biemann | Animesh Mukherjee
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincaré embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincaré similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincaré embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus.

Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records
Max Friedrich | Arne Köhn | Gregor Wiedemann | Chris Biemann
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHR) to be shared for research. Automatic de-identification classifiers can significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works well across many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records. We introduce a method to create privacy-preserving shareable representations of medical text (i.e. they contain no PHI) that does not require expensive manual pseudonymization. These representations can be shared between organizations to create unified datasets for training de-identification models. Our representation allows training a simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable to a strong baseline that exposes private information in its representation. A robust, widely available de-identification classifier based on our representation could potentially enable studies for which de-identification would otherwise be too costly.

Reviving a psychometric measure: Classification and prediction of the Operant Motive Test
Dirk Johannßen | Chris Biemann | David Scheffer
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

Implicit motives allow for the characterization of behavior, subsequent success and long-term development. While this has been operationalized in the operant motive test, research on motives has declined mainly due to labor-intensive and costly human annotation. In this study, we analyze over 200,000 labeled data items from 40,000 participants and utilize them for engineering features for training a logistic model tree machine learning model. It captures manually assigned motives well with an F-score of 80%, coming close to the pairwise annotator intraclass correlation coefficient of r = .85. In addition, we found a significant correlation of r = .2 between subsequent academic success and data automatically labeled with our model in an extrinsic evaluation.

2018

Demonstrating Par4Sem - A Semantic Writing Aid with Adaptive Paraphrasing
Seid Muhie Yimam | Chris Biemann
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

In this paper, we present Par4Sem, a semantic writing aid tool based on adaptive paraphrasing. Unlike many annotation tools that are primarily used to collect training examples, Par4Sem is integrated into a real word application, in this case a writing aid tool, in order to collect training examples from usage data. Par4Sem is a tool, which supports an adaptive, iterative, and interactive process where the underlying machine learning models are updated for each iteration using new training examples from usage data. After motivating the use of ever-learning tools in NLP applications, we evaluate Par4Sem by adopting it to a text simplification task through mere usage.

A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

Improving Hypernymy Extraction with Distributional Semantic Classes
Alexander Panchenko | Dmitry Ustalov | Stefano Faralli | Simone P. Ponzetto | Chris Biemann
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern- and Graph-based Information to Identify Discriminative Attributes
Enrico Santus | Chris Biemann | Emmanuele Chersoni
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern- and word embedding-based features. It participated in the SemEval-2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0.73 and ranking 2nd out of 26 participant systems.

Unsupervised Semantic Frame Induction using Triclustering
Dmitry Ustalov | Alexander Panchenko | Andrey Kutuzov | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task.

Retrofitting Word Representations for Unsupervised Sense Aware Word Similarities
Steffen Remus | Chris Biemann
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl
Alexander Panchenko | Eugen Ruppert | Stefano Faralli | Simone P. Ponzetto | Chris Biemann
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Document-based Recommender System for Job Postings using Dense Representations
Ahmed Elsafty | Martin Riedl | Chris Biemann
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

Job boards and professional social networks heavily use recommender systems in order to better support users in exploring job advertisements. Detecting the similarity between job advertisements is important for job recommendation systems as it allows, for example, the application of item-to-item based recommendations. In this work, we research the usage of dense vector representations to enhance a large-scale job recommendation system and to rank German job advertisements regarding their similarity. We follow a two-folded evaluation scheme: (1) we exploit historic user interactions to automatically create a dataset of similar jobs that enables an offline evaluation. (2) In addition, we conduct an online A/B test and evaluate the best performing method on our platform reaching more than 1 million users. We achieve the best results by combining job titles with full-text job descriptions. In particular, this method builds dense document representation using words of the titles to weigh the importance of words of the full-text description. In the online evaluation, this approach allows us to increase the click-through rate on job recommendations for active users by 8.0%.

Enriching Frame Representations with Distributionally Induced Senses
Stefano Faralli | Alexander Panchenko | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A Multilingual Information Extraction Pipeline for Investigative Journalism
Gregor Wiedemann | Seid Muhie Yimam | Chris Biemann
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks.

Par4Sim – Adaptive Paraphrasing for Text Simplification
Seid Muhie Yimam | Chris Biemann
Proceedings of the 27th International Conference on Computational Linguistics

Learning from a real-world data stream and continuously updating the model without explicit supervision is a new challenge for NLP applications with machine learning components. In this work, we have developed an adaptive learning system for text simplification, which improves the underlying learning-to-rank model from usage data, i.e. how users have employed the system for the task of simplification. Our experimental result shows that, over a period of time, the performance of the embedded paraphrase ranking model increases steadily improving from a score of 62.88% up to 75.70% based on the NDCG@10 evaluation metrics. To our knowledge, this is the first study where an NLP component is adaptively improved through usage.

An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages
Dmitry Ustalov | Denis Teslenko | Alexander Panchenko | Mikhail Chernoskutov | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Using Semantics for Granularities of Tokenization
Martin Riedl | Chris Biemann
Computational Linguistics, Volume 44, Issue 3 - September 2018

Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.

2017

IITPB at SemEval-2017 Task 5: Sentiment Prediction in Financial Text
Abhishek Kumar | Abhishek Sethi | Md Shad Akhtar | Asif Ekbal | Chris Biemann | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper reports team IITPB’s participation in the SemEval 2017 Task 5 on ‘Fine-grained sentiment analysis on financial microblogs and news’. We developed 2 systems for the two tracks. One system was based on an ensemble of Support Vector Classifier and Logistic Regression. This system relied on Distributional Thesaurus (DT), word embeddings and lexicon features to predict a floating sentiment value between -1 and +1. The other system was based on Support Vector Regression using word embeddings, lexicon features, and PMI scores as features. The system was ranked 5th in track 1 and 8th in track 2.

Using Pseudowords for Algorithm Comparison: An Evaluation Framework for Graph-based Word Sense Induction
Flavio Massimiliano Cecchini | Chris Biemann | Martin Riedl
Proceedings of the 21st Nordic Conference on Computational Linguistics

Multilingual and Cross-Lingual Complex Word Identification
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.

Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation
Alexander Panchenko | Eugen Ruppert | Stefano Faralli | Simone Paolo Ponzetto | Chris Biemann
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The current trend in NLP is the use of highly opaque models, e.g. neural networks and word embeddings. While these models yield state-of-the-art results on a range of tasks, their drawback is poor interpretability. On the example of word sense induction and disambiguation (WSID), we show that it is possible to develop an interpretable model that matches the state-of-the-art models in accuracy. Namely, we present an unsupervised, knowledge-free WSID approach, which is interpretable at three levels: word sense inventory, sense feature representations, and disambiguation procedure. Experiments show that our model performs on par with state-of-the-art word sense embeddings and other unsupervised systems while offering the possibility to justify its decisions in human-readable form.

Negative Sampling Improves Hypernymy Extraction Based on Projection Learning
Dmitry Ustalov | Nikolay Arefyev | Chris Biemann | Alexander Panchenko
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a new approach to extraction of hypernyms based on projection learning and word embeddings. In contrast to classification-based approaches, projection-based methods require no candidate hyponym-hypernym pairs. While it is natural to use both positive and negative training examples in supervised relation extraction, the impact of positive examples on hypernym prediction was not studied so far. In this paper, we show that explicit negative examples used for regularization of the model significantly improve performance compared to the state-of-the-art approach of Fu et al. (2014) on three datasets from different languages.

Watset: Automatic Induction of Synsets from a Graph of Synonyms
Dmitry Ustalov | Alexander Panchenko | Chris Biemann
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents a new graph-based approach that induces synsets using synonymy dictionaries and word embeddings. First, we build a weighted graph of synonyms extracted from commonly available resources, such as Wiktionary. Second, we apply word sense induction to deal with ambiguous words. Finally, we cluster the disambiguated version of the ambiguous input graph into synsets. Our meta-clustering approach lets us use an efficient hard clustering algorithm to perform a fuzzy clustering of the graph. Despite its simplicity, our approach shows excellent results, outperforming five competitive state-of-the-art methods in terms of F-score on three gold standard datasets for English and Russian derived from large-scale manually constructed lexical resources.

Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Alexander Panchenko | Stefano Faralli | Simone Paolo Ponzetto | Chris Biemann
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

We introduce a new method for unsupervised knowledge-based word sense disambiguation (WSD) based on a resource that links two types of sense-aware lexical networks: one is induced from a corpus using distributional semantics, the other is manually constructed. The combination of two networks reduces the sparsity of sense representations used for WSD. We evaluate these enriched representations within two lexical sample sense disambiguation benchmarks. Our results indicate that (1) features extracted from the corpus-based resource help to significantly outperform a model based solely on the lexical resource; (2) our method achieves results comparable or better to four state-of-the-art unsupervised knowledge-based WSD systems including three hybrid systems that also rely on text corpora. In contrast to these hybrid methods, our approach does not require access to web search engines, texts mapped to a sense inventory, or machine translation systems.

STS-UHH at SemEval-2017 Task 1: Scoring Semantic Textual Similarity Using Supervised and Unsupervised Ensemble
Sarah Kohail | Amr Rekaby Salama | Chris Biemann
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper reports the STS-UHH participation in the SemEval 2017 shared Task 1 of Semantic Textual Similarity (STS). Overall, we submitted 3 runs covering monolingual and cross-lingual STS tracks. Our participation involves two approaches: unsupervised approach, which estimates a word alignment-based similarity score, and supervised approach, which combines dependency graph similarity and coverage features with lexical similarity measures using regression methods. We also present a way on ensembling both models. Out of 84 submitted runs, our team best multi-lingual run has been ranked 12th in overall performance with correlation of 0.61, 7th among 31 participating teams.

IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Question Answering and Implicit Dialogue Identification
Titas Nandi | Chris Biemann | Seid Muhie Yimam | Deepak Gupta | Sarah Kohail | Asif Ekbal | Pushpak Bhattacharyya
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present the system for Answer Selection and Ranking in Community Question Answering, which we build as part of our participation in SemEval-2017 Task 3. We develop a Support Vector Machine (SVM) based system that makes use of textual, domain-specific, word-embedding and topic-modeling features. In addition, we propose a novel method for dialogue chain identification in comment threads. Our primary submission won subtask C, outperforming other systems in all the primary evaluation metrics. We performed well in other English subtasks, ranking third in subtask A and eighth in subtask B. We also developed open source toolkits for all the three English subtasks by the name cQARank [https://github.com/TitasNandi/cQARank].

Entity-Centric Information Access with Human in the Loop for the Biomedical Domain
Seid Muhie Yimam | Steffen Remus | Alexander Panchenko | Andreas Holzinger | Chris Biemann
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

In this paper, we describe the concept of entity-centric information access for the biomedical domain. With entity recognition technologies approaching acceptable levels of accuracy, we put forward a paradigm of document browsing and searching where the entities of the domain and their relations are explicitly modeled to provide users the possibility of collecting exhaustive information on relations of interest. We describe three working prototypes along these lines: NEW/S/LEAK, which was developed for investigative journalists who need a quick overview of large leaked document collections; STORYFINDER, which is a personalized organizer for information found in web pages that allows adding entities as well as relations, and is capable of personalized information management; and adaptive annotation capabilities of WEBANNO, which is a general-purpose linguistic annotation tool. We will discuss future steps towards the adaptation of these tools to biomedical data, which is subject to a recently started project on biomedical knowledge acquisition. A key difference to other approaches is the centering around the user in a Human-in-the-Loop machine learning approach, where users define and extend categories and enable the system to improve via feedback and interaction.

There’s no ‘Count or Predict’ but task-based selection for distributional models
Martin Riedl | Chris Biemann
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
Alexander Panchenko | Fide Marten | Eugen Ruppert | Stefano Faralli | Dmitry Ustalov | Simone Paolo Ponzetto | Chris Biemann
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Interpretability of a predictive model is a powerful feature that gains the trust of users in the correctness of the predictions. In word sense disambiguation (WSD), knowledge-based systems tend to be much more interpretable than knowledge-free counterparts as they rely on the wealth of manually-encoded elements representing word senses, such as hypernyms, usage examples, and images. We present a WSD system that bridges the gap between these two so far disconnected groups of methods. Namely, our system, providing access to several state-of-the-art WSD models, aims to be interpretable as a knowledge-based system while it remains completely unsupervised and knowledge-free. The presented tool features a Web interface for all-word disambiguation of texts that makes the sense predictions human readable by providing interpretable word sense inventories, sense representations, and disambiguation results. We provide a public API, enabling seamless integration.

The ContrastMedium Algorithm: Taxonomy Induction From Noisy Knowledge Graphs With Just A Few Links
Stefano Faralli | Alexander Panchenko | Chris Biemann | Simone Paolo Ponzetto
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we present ContrastMedium, an algorithm that transforms noisy semantic networks into full-fledged, clean taxonomies. ContrastMedium is able to identify the embedded taxonomy structure from a noisy knowledge graph without explicit human supervision such as, for instance, a set of manually selected input root and leaf concepts. This is achieved by leveraging structural information from a companion reference taxonomy, to which the input knowledge graph is linked (either automatically or manually). When used in conjunction with methods for hypernym acquisition and knowledge base linking, our methodology provides a complete solution for end-to-end taxonomy induction. We conduct experiments using automatically acquired knowledge graphs, as well as a SemEval benchmark, and show that our method is able to achieve high performance on the task of taxonomy induction.

Replacing OOV Words For Dependency Parsing With Distributional Semantics
Prasanth Kolachina | Martin Riedl | Chris Biemann
Proceedings of the 21st Nordic Conference on Computational Linguistics

CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Complex word identification (CWI) is an important task in text accessibility. However, due to the scarcity of CWI datasets, previous studies have only addressed this problem on Wikipedia sentences and have solely taken into account the needs of non-native English speakers. We collect a new CWI dataset (CWIG3G2) covering three text genres News, WikiNews, and Wikipedia) annotated by both native and non-native English speakers. Unlike previous datasets, we cover single words, as well as complex phrases, and present them for judgment in a paragraph context. We present the first study on cross-genre and cross-group CWI, showing measurable influences in native language and genre types.

2016

new/s/leak – Information Extraction and Visualization for Investigative Data Journalists
Seid Muhie Yimam | Heiner Ulrich | Tatiana von Landesberger | Marcel Rosenbach | Michaela Regneri | Alexander Panchenko | Franziska Lehmann | Uli Fahrer | Chris Biemann | Kathrin Ballweg
Proceedings of ACL-2016 System Demonstrations

IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis
Ayush Kumar | Sarah Kohail | Amit Kumar | Asif Ekbal | Chris Biemann
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Learning Paraphrasing for Multiword Expressions
Seid Muhie Yimam | Héctor Martínez Alonso | Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions

Impact of MWE Resources on Multiword Recognition
Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions

Ambient Search: A Document Retrieval System for Speech Streams
Benjamin Milde | Jonas Wacker | Stefan Radomski | Max Mühlhäuser | Chris Biemann
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present Ambient Search, an open source system for displaying and retrieving relevant documents in real time for speech input. The system works ambiently, that is, it unobstructively listens to speech streams in the background, identifies keywords and keyphrases for query construction and continuously serves relevant documents from its index. Query terms are ranked with Word2Vec and TF-IDF and are continuously updated to allow for ongoing querying of a document collection. The retrieved documents, in our case Wikipedia articles, are visualized in real time in a browser interface. Our evaluation shows that Ambient Search compares favorably to another implicit information retrieval system on speech streams. Furthermore, we extrinsically evaluate multiword keyphrase generation, showing positive impact for manual transcriptions.

TAXI at SemEval-2016 Task 13: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling
Alexander Panchenko | Stefano Faralli | Eugen Ruppert | Steffen Remus | Hubert Naets | Cédrick Fairon | Simone Paolo Ponzetto | Chris Biemann
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines
Darina Benikova | Chris Biemann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Semantic relations play an important role in linguistic knowledge representation. Although their role is relevant in the context of written text, there is no approach or dataset that makes use of contextuality of classic semantic relations beyond the boundary of one sentence. We present the SemRelData dataset that contains annotations of semantic relations between nominals in the context of one paragraph. To be able to analyse the universality of this context notion, the annotation was performed on a multi-lingual and multi-genre corpus. To evaluate the dataset, it is compared to large, manually created knowledge resources in the respective languages. The comparison shows that knowledge bases not only have coverage gaps; they also do not account for semantic relations that are manifested in particular contexts only, yet still play an important role for text cohesion.

Language Transfer Learning for Supervised Lexical Substitution
Gerold Hintz | Chris Biemann
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Towards a resource based on users’ knowledge to overcome the Tip of the Tongue problem.
Michael Zock | Chris Biemann
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

Language production is largely a matter of words which, in the case of access problems, can be searched for in an external resource (lexicon, thesaurus). In this kind of dialogue the user provides the momentarily available knowledge concerning the target and the system responds with the best guess(es) it can make given this input. As tip-of-the-tongue (ToT)-studies have shown, people always have some knowledge concerning the target (meaning fragments, number of syllables, ...) even if its complete form is eluding them. We will show here how to tap on this knowledge to build a resource likely to help authors (speakers/writers) to overcome the ToT-problem. Yet, before doing so we need a better understanding of the various kinds of knowledge people have when looking for a word. To this end, we asked crowdworkers to provide some cues to describe a given target and to specify then how each one of them relates to the target, in the hope that this could help others to find the elusive word. Next, we checked how well a given search strategy worked when being applied to differently built lexical networks. The results showed quite dramatic differences, which is not really surprising. After all, different networks are built for different purposes; hence each one of them is more or less suited for a given task. What was more surprising though is the fact that the relational information given by the users did not allow us to find the elusive word in WordNet better than without it.

Making Sense of Word Embeddings
Maria Pelevina | Nikolay Arefiev | Chris Biemann | Alexander Panchenko
Proceedings of the 1st Workshop on Representation Learning for NLP

Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods
Martin Riedl | Chris Biemann
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Vectors or Graphs? On Differences of Representations for Distributional Semantic Models
Chris Biemann
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

Distributional Semantic Models (DSMs) have recently received increased attention, together with the rise of neural architectures for scalable training of dense vector embeddings. While some of the literature even includes terms like ‘vectors’ and ‘dimensionality’ in the definition of DSMs, there are some good reasons why we should consider alternative formulations of distributional models. As an instance, I present a scalable graph-based solution to distributional semantics. The model belongs to the family of ‘count-based’ DSMs, keeps its representation sparse and explicit, and thus fully interpretable. I will highlight some important differences between sparse graph-based and dense vector approaches to DSMs: while dense vector-based models are computationally easier to handle and provide a nice uniform representation that can be compared and combined in many ways, they lack interpretability, provenance and robustness. On the other hand, graph-based sparse models have a more straightforward interpretation, handle sense distinctions more naturally and can straightforwardly be linked to knowledge bases, while lacking the ability to compare arbitrary lexical units and a compositionality operation. Since both representations have their merits, I opt for exploring their combination in the outlook.

A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures
Richard Eckart de Castilho | Éva Mújdricza-Maydt | Seid Muhie Yimam | Silvana Hartmann | Iryna Gurevych | Anette Frank | Chris Biemann
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We introduce the third major release of WebAnno, a generic web-based annotation tool for distributed teams. New features in this release focus on semantic annotation tasks (e.g. semantic role labelling or event annotation) and allow the tight integration of semantic annotations with syntactic annotations. In particular, we introduce the concept of slot features, a novel constraint mechanism that allows modelling the interaction between semantic and syntactic annotations, as well as a new annotation user interface. The new features were developed and used in an annotation project for semantic roles on German texts. The paper briefly introduces this project and reports on experiences performing annotations with the new tool. On a comparative evaluation, our tool reaches significant speedups over WebAnno 2 for a semantic annotation task.

EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres
Steffen Remus | Gerold Hintz | Chris Biemann | Christian M. Meyer | Darina Benikova | Judith Eckle-Kohler | Margot Mieskes | Thomas Arnold
Proceedings of the 10th Web as Corpus Workshop

Domain-Specific Corpus Expansion with Focused Webcrawling
Steffen Remus | Chris Biemann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software.

Demonstrating Ambient Search: Implicit Document Retrieval for Speech Streams
Benjamin Milde | Jonas Wacker | Stefan Radomski | Max Mühlhäuser | Chris Biemann
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this demonstration paper we describe Ambient Search, a system that displays and retrieves documents in real time based on speech input. The system operates continuously in ambient mode, i.e. it generates speech transcriptions and identifies main keywords and keyphrases, while also querying its index to display relevant documents without explicit query. Without user intervention, the results are dynamically updated; users can choose to interact with the system at any time, employing a conversation protocol that is enriched with the ambient information gathered continuously. Our evaluation shows that Ambient Search outperforms another implicit speech-based information retrieval system. Ambient search is available as open source software.

2015

Distributional Semantics for Resolving Bridging Mentions
Tim Feuerbach | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing

A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics
Martin Riedl | Chris Biemann
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Do Supervised Distributional Methods Really Learn Lexical Inference Relations?
Omer Levy | Steffen Remus | Chris Biemann | Ido Dagan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Book Reviews: Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger and John McCrae
Chris Biemann
Computational Linguistics, Volume 41, Issue 2 - June 2015

JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models
Eugen Ruppert | Manuel Kaufmann | Martin Riedl | Chris Biemann
Proceedings of ACL-IJCNLP 2015 System Demonstrations

2014

Multiobjective Optimization and Unsupervised Lexical Acquisition for Named Entity Recognition and Classification
Govind | Asif Ekbal | Chris Biemann
Proceedings of the 11th International Conference on Natural Language Processing

Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno
Seid Muhie Yimam | Chris Biemann | Richard Eckart de Castilho | Iryna Gurevych
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Distributed Distributional Similarities of Google Books Over the Centuries
Martin Riedl | Richard Steuer | Chris Biemann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.

That’s sick dude!: Automatic identification of word sense change across different timescales
Sunny Mitra | Ritwik Mitra | Martin Riedl | Chris Biemann | Animesh Mukherjee | Pawan Goyal
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Combining Supervised and Unsupervised Parsing for Distributional Similarity
Martin Riedl | Irina Alles | Chris Biemann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Lexical Substitution Dataset for German
Kostadin Cholakov | Chris Biemann | Judith Eckle-Kohler | Iryna Gurevych
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article describes a lexical substitution dataset for German. The whole dataset contains 2,040 sentences from the German Wikipedia, with one target word in each sentence. There are 51 target nouns, 51 adjectives, and 51 verbs randomly selected from 3 frequency groups based on the lemma frequency list of the German WaCKy corpus. 200 sentences have been annotated by 4 professional annotators and the remaining sentences by 1 professional annotator and 5 additional annotators who have been recruited via crowdsourcing. The resulting dataset can be used to evaluate not only lexical substitution systems, but also different sense inventories and word sense disambiguation systems.

NoSta-D Named Entity Annotation for German: Guidelines and Dataset
Darina Benikova | Chris Biemann | Marc Reznicek
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe the annotation of a new dataset for German Named Entity Recognition (NER). The need for this dataset is motivated by licensing issues and consistency issues of existing datasets. We describe our approach to creating annotation guidelines based on linguistic and semantic considerations, and how we iteratively refined and tested them in the early stages of annotation in order to arrive at the largest publicly available dataset for German NER, consisting of over 31,000 manually annotated sentences (over 591,000 tokens) from German Wikipedia and German online news. We provide a number of statistics on the dataset, which indicate its high quality, and discuss legal aspects of distributing the data as a compilation of citations. The data is released under the permissive CC-BY license, and will be fully available for download in September 2014 after it has been used for the GermEval 2014 shared task on NER. We further provide the full annotation guidelines and links to the annotation tool used for the creation of this resource.

2013

Supervised All-Words Lexical Substitution using Delexicalized Features
György Szarvas | Chris Biemann | Iryna Gurevych
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity
Chris Biemann | Bonaventura Coppola | Michael R. Glass | Alfio Gliozzo | Matthew Hatem | Martin Riedl
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing

Scaling to Large³ Data: An Efficient and Effective Method to Compute Distributional Thesauri
Martin Riedl | Chris Biemann
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

From Global to Local Similarities: A Graph-Based Contextualization Method using Distributional Thesauri
Martin Riedl | Chris Biemann
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing

Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis
Janneke Rauscher | Leonard Swiezinski | Martin Riedl | Chris Biemann
Proceedings of the Workshop on Computational Linguistics for Literature

SemEval-2013 Task 5: Evaluating Phrasal Semantics
Ioannis Korkontzelos | Torsten Zesch | Fabio Massimo Zanzotto | Chris Biemann
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
Seid Muhie Yimam | Iryna Gurevych | Richard Eckart de Castilho | Chris Biemann
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Three Knowledge-Free Methods for Automatic Lexical Chain Extraction
Steffen Remus | Chris Biemann
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev
Chris Biemann
Computational Linguistics, Volume 38, Issue 1 - March 2012

UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures
Daniel Bär | Chris Biemann | Iryna Gurevych | Torsten Zesch
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Sweeping through the Topic Space: Bad luck? Roll again!
Martin Riedl | Chris Biemann
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

Quantifying Semantics using Complex Network Analysis
Chris Biemann | Stefanie Roos | Karsten Weihe
Proceedings of COLING 2012

Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource for Lexical Substitution
Chris Biemann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the Turk Bootstrap Word Sense Inventory (TWSI) 2.0. This lexical resource, created by a crowdsourcing process using Amazon Mechanical Turk (http://www.mturk.com), encompasses a sense inventory for lexical substitution for 1,012 highly frequent English common nouns. Along with each sense, a large number of sense-annotated occurrences in context are given, as well as a weighted list of substitutions. Sense distinctions are not motivated by lexicographic considerations, but driven by substitutability: two usages belong to the same sense if their substitutions overlap considerably. After laying out the need for such a resource, the data is characterized in terms of organization and quantity. Then, we briefly describe how this data was used to create a system for lexical substitutions. Training a supervised lexical substitution system on a smaller version of the resource resulted in well over 90% acceptability for lexical substitutions provided by the system. Thus, this resource can be used to set up reliable, enabling technologies for semantic natural language processing (NLP), some of which we discuss briefly.

How Text Segmentation Algorithms Gain from Topic Models
Martin Riedl | Chris Biemann
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP
Omri Abend | Chris Biemann | Anna Korhonen | Ari Rappoport | Roi Reichart | Anders Søgaard
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

TopicTiling: A Text Segmentation Algorithm based on LDA
Martin Riedl | Chris Biemann
Proceedings of ACL 2012 Student Research Workshop

Using Distributional Similarity for Lexical Expansion in Knowledge-based Word Sense Disambiguation
Tristan Miller | Chris Biemann | Torsten Zesch | Iryna Gurevych
Proceedings of COLING 2012

2011

Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing
Chris Biemann | Anders Søgaard
Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing

Proceedings of the Workshop on Distributional Semantics and Compositionality
Chris Biemann | Eugenie Giesbrecht
Proceedings of the Workshop on Distributional Semantics and Compositionality

Distributional Semantics and Compositionality 2011: Shared Task Description and Results
Chris Biemann | Eugenie Giesbrecht
Proceedings of the Workshop on Distributional Semantics and Compositionality

2010

Co-Occurrence Cluster Features for Lexical Substitutions in Context
Chris Biemann
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing

2009

Syntax is from Mars while Semantics from Venus! Insights from Spectral Analysis of Distributional Similarity Networks
Chris Biemann | Monojit Choudhury | Animesh Mukherjee
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

ASV Toolbox: a Modular Collection of Language Exploration Tools
Chris Biemann | Uwe Quasthoff | Gerhard Heyer | Florian Holz
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

ASV Toolbox is a modular collection of tools for the exploration of written language data both for scientific and educational purposes. It includes modules that operate on word lists or texts and allow to perform various linguistic annotation, classification and clustering tasks, including language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern-based and statistical approaches. The collection can be used to work on large real-world data sets as well as for studying the underlying algorithms. Each module of the ASV Toolbox is designed to work either on a plain text files or with a connection to a MySQL database. While it is especially designed to work with corpora of the Leipzig Corpora Collection, it can easily be adapted to other sources.

Unsupervised Parts-of-Speech Induction for Bengali
Joydeep Nath | Monojit Choudhury | Animesh Mukherjee | Christian Biemann | Niloy Ganguly
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a study of the word interaction networks of Bengali in the framework of complex networks. The topological properties of these networks reveal interesting insights into the morpho-syntax of the language, whereas clustering helps in the induction of the natural word classes leading to a principled way of designing POS tagsets. We compare different network construction techniques and clustering algorithms based on the cohesiveness of the word clusters. Cohesiveness is measured against two gold-standard tagsets by means of the novel metric of tag-entropy. The approach presented here is a generic one that can be easily extended to any language.

Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing
Irina Matveeva | Chris Biemann | Monojit Choudhury | Mona Diab
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

2007

A Random Text Model for the Generation of Statistical Language Invariants
Chris Biemann
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

Unsupervised Natural Language Processing Using Graph Models
Chris Biemann
Proceedings of the NAACL-HLT 2007 Doctoral Consortium

Combining Contexts in Lexicon Learning for Semantic Parsing
Richard Socher | Chris Biemann | Rainer Osswald
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing
Chris Biemann | Irina Matveeva | Rada Mihalcea | Dragomir Radev
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

Proceedings of the ACL 2007 Student Research Workshop
Chris Biemann | Violeta Seretan | Ellen Riloff
Proceedings of the ACL 2007 Student Research Workshop

Íslenskur Orðasjóður – Building a Large Icelandic Corpus
Erla Hallsteinsdóttir | Thomas Eckart | Chris Biemann | Uwe Quasthoff | Matthias Richter
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems
Chris Biemann
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering
Chris Biemann
Proceedings of the COLING/ACL 2006 Student Research Workshop

Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization
Hans Friedrich Witschel | Chris Biemann
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

Corpus Portal for Search in Monolingual Corpora
Uwe Quasthoff | Matthias Richter | Christian Biemann
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A simple and flexible schema for storing and presenting monolingual language resources is proposed. In this format, data for 18 different languages is already available in various sizes. The data is provided free of charge for online use and download. The main target is to ease the application of algorithms for monolingual and interlingual studies.

Dictionary acquisition using parallel text and co-occurrence statistics
Chris Biemann | Uwe Quasthoff
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

2004

Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences
Chris Biemann | Sa-Im Shin | Key-Sun Choi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Web Services for Language Resources and Language Technology Applications
Christian Biemann | Stefan Bordag | Uwe Quasthoff | Christian Wolff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Linguistic Corpus Search
Christian Biemann | Uwe Quasthoff | Christian Wolff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences
Chris Biemann | Stefan Bordag | Uwe Quasthoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

Named Entity Learning and Verification: Expectation Maximization in Large Corpora
Uwe Quasthoff | Christian Biemann | Christian Wolff
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

Co-authors

Simone Paolo Ponzetto 13

Fynn Petersen-Frey 9

Dmitry Ustalov 9

Stefano Faralli 8

Hans Ole Hatzel 8

Irina Nikishina 8

Uwe Quasthoff 8

Abinew Ali Ayele 7

Iryna Gurevych 7

Eugen Ruppert 7

Martin Semmann 7

Animesh Mukherjee 6

Gregor Wiedemann 6

Meriem Beloucif 5

Robert Geislinger 4

Gertraud Koch 4

Andrey Kutuzov 4

Benjamin Milde 4

Nikolay Arefyev 3

Darina Benikova 3

Pushpak Bhattacharyya 3

Alexander Bondarenko 3

Monojit Choudhury 3

Richard Eckart De Castilho 3

Dirk Johannßen 3

Oleksiy Oliynyk 3

Özge Sevgili 3

Artem Shelmanov 3

Christian Wolff 3

Torsten Zesch 3

Sanja Štajner 3

Narges Baba Ahmadi 2

Md. Shad Akhtar 2

Tadesse Destaw Belay 2

Stefan Bordag 2

Mohammad Dorgham 2

Judith Eckle-Kohler 2

Rudy Alexandro Garrido Veliz 2

Eugenie Giesbrecht 2

Goran Glavaš 2

Matthias Hagen 2

Carolin Holtermann 2

Abhishek Kumar 2

Anne Lauscher 2

Varvara Logacheva 2

Irina Matveeva 2

Max Mühlhäuser 2

Stefan Radomski 2

Matthias Richter 2

Punyajoy Saha 2

David Scheffer 2

Fynn Schröder 2

Ahmad Shallouf 2

Anders Søgaard 2

Denis Teslenko 2

Mohamed Abdalla 1

Idris Abdulmumin 1

Shantanu Acharya 1

Aalok Agrawal 1

Ibrahim Said Ahmad 1

Sanchit Ahuja 1

Hizkiel Mitiku Alemayehu 1

Adem Chanie Ali 1

Klejda Alushi 1

Selenia Anastasi 1

Jakob Smedegaard Andersen 1

Vladimir Araujo 1

Thomas Arnold 1

Ekaterina Artemova 1

Niloufar Baba Ahmadi 1

Kathrin Ballweg 1

Debayan Banerjee 1

Pavan Baswani 1

Anirban Bhowmick 1

Sebastian Blank 1

Aarushi Ajay Borkar 1

Sofia Bourhim 1

Prateek Chaudhury 1

Viktoriia Chekalina 1

Artem Chernodub 1

Polina Chernomorchenko 1

Mikhail Chernoskutov 1

Emmanuele Chersoni 1

Kostadin Cholakov 1

Bonaventura Coppola 1

Daryna Dementieva 1

Anastasiia Demidova 1

Daniel Djahangir 1

Nikolay Dolgov 1

Thomas Eckart 1

Ahmed Elsafty 1

Cédrick Fairon 1

Tim Feuerbach 1

Mirco Franzek 1

Alexander Friedrich 1

Max Friedrich 1

Niloy Ganguly 1

Gregor Geigle 1

Michael Glass 1

Alfio Gliozzo 1

Eduard Gorbunov 1

Robert Günzler 1

Christian Haase 1

Marlo Haering 1

Erla Hallsteinsdóttir 1

Anja Silvia Mollah Haque 1

Silvana Hartmann 1

Matthew Hatem 1

Philipp Heidenreich 1

Florian Helfer 1

Hanna Herasimchyk 1

Gerhard Heyer 1

Markus J. Hofmann 1

Andreas Holzinger 1

Samuel Horváth 1

Oumaima Hourrane 1

Daria Ignatenko 1

Enes Kutay Isgorur 1

Esubalew Alemneh Jalew 1

Longqin Jiang 1

Melese Ayichlie Jigar 1

Melf Johannsen 1

Gopichand Kanumolu 1

Manuel Kaufmann 1

Katharina Kleinen-von Königslöw 1

Ekaterina Kochmar 1

Prasanth Kolachina 1

Anna Korhonen 1

Ioannis Korkontzelos 1

Govind Kothari 1

John S. Y. Lee 1

Franziska Lehmann 1

Wiebke Loosen 1

Lokesh Madasu 1

Ganeshan Malhotra 1

Shervin Malmasi 1

Héctor Martínez Alonso 1

Flavio Massimiliano Cecchini 1

Natia Mestvirishvili 1

Christian M. Meyer 1

Margot Mieskes 1

Rada Mihalcea 1

Tristan Miller 1

Saif Mohammad 1

Daniil Moskovskiy 1

Viktor Moskvoretskii 1

Shamsuddeen Hassan Muhammad 1

Éva Mújdricza-Maydt 1

Alexander Ossa 1

Rainer Osswald 1

Nedjma Ousidhoum 1

Gustavo Paetzold 1

Maria Pelevina 1

Ali Ebrahimi Pourasad 1

Dragomir Radev 1

Ari Rappoport 1

Janneke Rauscher 1

Michaela Regneri 1

Marc Reznicek 1

Naquee Rizwan 1

Stefanie Roos 1

Marcel Rosenbach 1

Samuel Rutunda 1

Amr Rekaby Salama 1

Mikhail Salnikov 1

Enrico Santus 1

Florian Schleid 1

Fabian David Schmidt 1

Violeta Seretan 1

Abhishek Sethi 1

Manish Shrivastava 1

Dr. Florian Skupin 1

Richard Socher 1

Thamar Solorio 1

Steffen Stahlhacke 1

Richard Steuer 1

Haimo Stiemer 1

Christian Stöcker 1

Nirmal Surange 1

Leonard Swiezinski 1

György Szarvas 1

Hailegnaw Tilaye 1

Maximilian Trescher 1

Nazarii Tupitsa 1

Heiner Ulrich 1

Ricardo Usbeck 1

Gopalakrishnan Venkatesh 1

Krishnapriya Vishnubhotla 1

Karsten Weihe 1

Max Wiechmann 1

Genta Indra Winata 1

Hans Friedrich Witschel 1

Marcos Zampieri 1

Fabio Massimo Zanzotto 1

Sina Zarrieß 1

Hans-Peter Zorn 1

Christine de Kock 1

Niklas von Boguszewski 1

Tatiana von Landesberger 1

Gerret von Nordheim 1

Venues