Gerard De Melo - ACL Anthology

Gerard De Melo

Also published as: Gerard de Melo

2026

PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
Zetian Ouyang | Linlin Wang | Gerard de Melo | Liang He
Findings of the Association for Computational Linguistics: ACL 2026

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 27,215 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs’ performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs’ numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Yindong Wang | Martin Preiß | Margarita Bugueño | Jan Vincent Hoffbauer | Abdullatif Ghajar | Tolga Buz | Gerard de Melo
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT, a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit’s r/AskScience. Evaluating 9 state-of-the-art LLMs reveals two critical limitations. First, models exhibit a dominant salient distractor failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors. Crucially, this pattern persists across all model scales (1B to 70B), indicating a fundamental semantic grounding deficit that scaling alone fails to resolve. Second, we find that comparative judgment is paradoxically harder than independent detection–even GPT-4o’s F1 score drops from 0.67 to 0.53 when comparing answers side-by-side. These findings directly challenge the reliability of LLM-as-Judge paradigms for scientific factuality. Code and data are released at https://github.com/ddz5431/ReFACT.

2025

The Hidden Cost of Structure: How Constrained Decoding Affects Language Model Performance
Maximilian Schall | Gerard de Melo
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large Language Models excel at generating fluent text, but real-world applications increasingly demand structured outputs like JSON that can be programmatically processed. While prior work examines either task performance or format compliance in isolation, we investigate their interaction through comprehensive experiments across 11 models and multiple benchmarks. We uncover a fundamental divergence between base and instruction-tuned models under structural constraints. Base models often benefit from constrained decoding, producing more precise outputs, while instruction-tuned models frequently suffer performance degradation on generation tasks despite maintaining stability on classification tasks. Our log probability analysis reveals the underlying mechanism: constrained decoding forces models away from their preferred natural language patterns into lower-confidence structured alternatives. We demonstrate that successful constrained generation requires both adapted prompts and sufficient few-shot examples, with constrained models showing steeper performance gains from additional demonstrations compared to unconstrained generation. Notably, we find that base model performance under constraints can serve as an early indicator of post-training structured output capabilities, offering a practical evaluation tool for model development. These findings suggest that current instruction-tuning practices may inadvertently reduce models’ structured output capabilities and highlight the need for training-time integration of structural constraints in future model development.

GraphLSS: Integrating Lexical, Structural, and Semantic Features for Long Document Extractive Summarization
Margarita Bugueño | Hazem Abou Hamdan | Gerard De Melo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Heterogeneous graph neural networks have recently gained attention for long document summarization, modeling the extraction as a node classification task. Although effective, these models often require external tools or additional machine learning models to define graph components, producing highly complex and less intuitive structures. We present GraphLSS, a heterogeneous graph construction for long document extractive summarization, incorporating Lexical, Structural, and Semantic features. It defines two levels of information (words and sentences) and four types of edges (sentence semantic similarity, sentence occurrence order, word in sentence, and word semantic similarity) without any need for auxiliary learning models. Experiments on two benchmark datasets show that GraphLSS is competitive with top-performing graph-based methods, outperforming recent non-graph models. We release our code on GitHub.

InFact: Informativeness Alignment for Improved LLM Factuality
Roi Cohen | Russa Biswas | Gerard de Melo
Findings of the Association for Computational Linguistics: EMNLP 2025

Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence “Barack Obama was born in the United States” is factually correct, though less informative than the factual sentence “Barack Obama was born in Honolulu, Hawaii, United States”. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual informativeness benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. We find that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.

Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
Yue Li | Xin Yi | Dongsheng Shi | Gerard De Melo | Xiaoling Wang | Linlin Wang
Findings of the Association for Computational Linguistics: ACL 2025

With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.

SLlama: Parameter-Efficient Language Model Architecture for Enhanced Linguistic Competence Under Strict Data Constraints
Victor Adelakun Omolaoye | Babajide Alamu Owoyele | Gerard de Melo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Scaling data and model size has driven recent advances in language modeling, but this strategy falters under scenarios with strict data constraints, as in the BabyLM Challenge. However, insights from Chinchilla highlights that smaller models trained on more data outperform larger counterparts trained inadequately, emphasizing the need for compact architectures. Furthermore, while embedding weight tying is a common parameter-saving technique, we find it significantly diminishes linguistic competence in compact models.In response, we explore alternative architectural strategies that preserve the parameter efficiency of tied models without sacrificing the representational benefits of untied embeddings. Consequently, we introduce SLlama a Llama3 architecture variant which incorporates targeted modifications—Repeated Reduced Hidden Size and Projection (RRHP), Permutated Weight Attention (PWA), Shared Projection Multi-Layer Perceptron (SPMLP), and Layer Weight Sharing—to compress Transformer components. Without relying on distillation, SLlama achieves a 31.72% improvement in linguistic knowledge acquisition over the BabyLlama baseline, with a comparable GLUE score and significantly lower parameter count. These results demonstrate that well-designed, compact models can rival larger ones under strict data constraints.

ACE-M³: Automatic Capability Evaluator for Multimodal Medical Models
Xiechi Zhang | Shunfan Zheng | Linlin Wang | Gerard de Melo | Zhu Cao | Xiaoling Wang | Liang He
Proceedings of the 31st International Conference on Computational Linguistics

As multimodal large language models (MLLMs) gain prominence in the medical field, the need for precise evaluation methods to assess their effectiveness has become critical. While benchmarks provide a reliable means to evaluate the capabilities of MLLMs, traditional metrics like ROUGE and BLEU employed for open domain evaluation only focus on token overlap and may not align with human judgment. While human evaluation is more reliable, it is labor-intensive, costly, and not scalable. LLM-based evaluation methods have proven promising, but to date, there is still an urgent need for open-source multimodal LLM-based evaluators in the medical field. To address this issue, we introduce ACE-M³, an open-sourced Automatic Capability Evaluator for Multimodal Medical Models that specifically designed to assess the question answering abilities of medical MLLMs. It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria. Subsequently, a reward token-based direct preference optimization (RTDPO) strategy is incorporated to save training time without compromising performance of our model. Extensive experiments have demonstrated the effectiveness of our ACE-M³ model in evaluating the capabilities of medical MLLMs.

AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Xiechi Zhang | Zetian Ouyang | Linlin Wang | Gerard De Melo | Zhu Cao | Xiaoling Wang | Ya Zhang | Yanfeng Wang | Liang He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.

2024

Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit’s Showerthoughts
Tolga Buz | Benjamin Frost | Nikola Genchev | Moritz Schneider | Lucie-Aimée Kaffee | Gerard de Melo
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa-based classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide the dataset for creative, witty text generation based on Reddit Showerthoughts posts.

GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages
Jonathan Janetzki | Gerard De Melo | Joshua Nemecek | Daniel Whitenack
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Over 7,000 of the world’s 7,168 living languages are still low-resourced. This paper aims to narrow the language documentation gap by creating multiparallel dictionaries, clustered by SIL’s semantic domains. This task is new for machine learning and has previously been done manually by native speakers. We propose GUIDE, a language-agnostic tool that uses a GNN to create and populate semantic domain dictionaries, using seed dictionaries and Bible translations as a parallel text corpus. Our work sets a new benchmark, achieving an exemplary average precision of 60% in eight zero-shot evaluation languages and predicting an average of 2,400 dictionary entries. We share the code, model, multilingual evaluation data, and new dictionaries with the research community: https://github.com/janetzki/GUIDE

Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians
Jan Hoffbauer | Sylwester Sawicki | Marc Ulrich | Tolga Buz | Konstantin Dobler | Moritz Schneider | Gerard De Melo
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)

Powerful LLMs like ChatGPT are adopted rapidly for a wide array of tasks, but their limitations in domain-specific areas become apparent, particularly when prompted to recite facts. This is critical especially for knowledge workers, who are adopting LLM-based tools rapidly.While there are various techniques that can help ingest knowledge into LLMs such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different models sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining.In addition, we measure linguistic metrics and human and LLM-based preference. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – unless the model is overfit to the tested knowledge.Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domain-specific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with a company’s datasets.

Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)
Russa Biswas | Lucie-Aimée Kaffee | Oshin Agarwal | Pasquale Minervini | Sameer Singh | Gerard de Melo
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

LLMs Cannot (Yet) Match the Specificity and Simplicity of Online Communities in Long Form Question Answering
Kris-Fillip Kahl | Tolga Buz | Russa Biswas | Gerard De Melo
Findings of the Association for Computational Linguistics: EMNLP 2024

Retail investing is on the rise, and a growing number of users is relying on online finance communities to educate themselves.However, recent years have positioned Large Language Models (LLMs) as powerful question answering (QA) tools, shifting users away from interacting in communities towards discourse with AI-driven conversational interfaces.These AI tools are currently limited by the availability of labelled data containing domain-specific financial knowledge.Therefore, in this work, we curate a QA preference dataset SocialFinanceQA for fine-tuning and aligning LLMs, extracted from more than 7.4 million submissions and 82 million comments from 2008 to 2022 in Reddit’s 15 largest finance communities. Additionally, we propose a novel framework called SocialQA-Eval as a generally-applicable method to evaluate generated QA responses.We evaluate various LLMs fine-tuned on this dataset, using traditional metrics, LLM-based evaluation, and human annotation. Our results demonstrate the value of high-quality Reddit data, with even state-of-the-art LLMs improving on producing simpler and more specific responses.

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios
Zetian Ouyang | Yishuai Qiu | Linlin Wang | Gerard De Melo | Ya Zhang | Yanfeng Wang | Liang He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

With the proliferation of Large Language Models (LLMs) in diverse domains, there is a particular need for unified evaluation standards in clinical medical scenarios, where models need to be examined very thoroughly. We present CliMedBench, a comprehensive benchmark with 14 expert-guided core clinical scenarios specifically designed to assess the medical ability of LLMs across 7 pivot dimensions. It comprises 33,735 questions derived from real-world medical reports of top-tier tertiary hospitals and authentic examination exercises. The reliability of this benchmark has been confirmed in several ways. Subsequent experiments with existing LLMs have led to the following findings: (i) Chinese medical LLMs underperform on this benchmark, especially where medical reasoning and factual consistency are vital, underscoring the need for advances in clinical knowledge and diagnostic accuracy. (ii) Several general-domain LLMs demonstrate substantial potential in medical clinics, while the limited input capacity of many medical LLMs hinders their practical use. These findings reveal both the strengths and limitations of LLMs in clinical scenarios and offer critical insights for medical research.

Wiki-VEL: Visual Entity Linking for Structured Data on Wikimedia Commons
Philipp Bielefeld | Jasmin Geppert | Necdet Güven | Melna John | Adrian Ziupka | Lucie-Aimée Kaffee | Russa Biswas | Gerard De Melo
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Describing Wikimedia Commons images using Wikidata’s structured data enables a wide range of automation tasks, such as search and organization, as well as downstream tasks, such as labeling images or training machine learning models. However, there is currently a lack of structured data-labelled images on Wikimedia Commons.To close this gap, we propose the task of Visual Entity Linking (VEL) for Wikimedia Commons, in which we create new labels for Wikimedia Commons images from Wikidata items. VEL is a crucial tool for improving information retrieval, search, content understanding, cross-modal applications, and various machine-learning tasks. In this paper, we propose a method to create new labels for Wikimedia Commons images from Wikidata items. To this end, we create a novel dataset leveraging community-created structured data on Wikimedia Commons and fine-tuning pre-trained models based on the CLIP architecture. Although the best-performing models show promising results, the study also identifies key challenges of the data and the task.

NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents
Tamara Czinczoll | Christoph Hönes | Maximilian Schall | Gerard De Melo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While (large) language models have significantly improved over the last years, they still struggle to sensibly process long sequences found, e.g., in books, due to the quadratic scaling of the underlying attention mechanism. To address this, we propose NextLevelBERT, a Masked Language Model operating not on tokens, but on higher-level semantic representations in the form of text embeddings. We pretrain NextLevelBERT to predict the vector representation of entire masked text chunks and evaluate the effectiveness of the resulting document vectors on three types of tasks: 1) Semantic Textual Similarity via zero-shot document embeddings, 2) Long document classification, 3) Multiple-choice question answering. We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperform much larger embedding models as long as the required level of detail of semantic information is not too fine. Our models and code are publicly available online.

2023

A Closer Look at Transformer Attention for Multilingual Translation
Jingyi Zhang | Gerard de Melo | Hongfei Xu | Kehai Chen
Proceedings of the Eighth Conference on Machine Translation

Transformers are the predominant model for machine translation. Recent works also showed that a single Transformer model can be trained to learn translation for multiple different language pairs, achieving promising results. In this work, we investigate how the multilingual Transformer model pays attention for translating different language pairs. We first performed automatic pruning to eliminate a large number of noisy heads and then analyzed the functions and behaviors of the remaining heads in both self-attention and cross-attention. We find that different language pairs, in spite of having different syntax and word orders, tended to share the same heads for the same functions, such as syntax heads and reordering heads. However, the different characteristics of different language pairs clearly caused interference in function heads and affected head accuracies. Additionally, we reveal an interesting behavior of the Transformer cross-attention: the deep-layer cross-attention heads work in a clear cooperative way to learn different options for word reordering, which can be caused by the nature of translation tasks having multiple different gold translations in the target language for the same source sentence.

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh Dhole | Varun Gangal | Sebastian Gehrmann | Aadesh Gupta | Zhenhao Li | Saad Mahamood | Abinaya Mahadiran | Simon Mille | Ashish Shrivastava | Samson Tan | Tongshang Wu | Jascha Sohl-Dickstein | Jinho Choi | Eduard Hovy | Ondřej Dušek | Sebastian Ruder | Sajant Anand | Nagender Aneja | Rabin Banjade | Lisa Barthe | Hanna Behnke | Ian Berlot-Attwell | Connor Boyle | Caroline Brun | Marco Antonio Sobrevilla Cabezudo | Samuel Cahyawijaya | Emile Chapuis | Wanxiang Che | Mukund Choudhary | Christian Clauss | Pierre Colombo | Filip Cornell | Gautier Dagan | Mayukh Das | Tanay Dixit | Thomas Dopierre | Paul-Alexis Dray | Suchitra Dubey | Tatiana Ekeinhor | Marco Di Giovanni | Tanya Goyal | Rishabh Gupta | Louanes Hamla | Sang Han | Fabrice Harel-Canada | Antoine Honoré | Ishan Jindal | Przemysław Joniak | Denis Kleyko | Venelin Kovatchev | Kalpesh Krishna | Ashutosh Kumar | Stefan Langer | Seungjae Ryan Lee | Corey James Levinson | Hualou Liang | Kaizhao Liang | Zhexiong Liu | Andrey Lukyanenko | Vukosi Marivate | Gerard de Melo | Simon Meoni | Maxine Meyer | Afnan Mir | Nafise Sadat Moosavi | Niklas Meunnighoff | Timothy Sum Hon Mun | Kenton Murray | Marcin Namysl | Maria Obedkova | Priti Oli | Nivranshu Pasricha | Jan Pfister | Richard Plant | Vinay Prabhu | Vasile Pais | Libo Qin | Shahab Raji | Pawan Kumar Rajpoot | Vikas Raunak | Roy Rinberg | Nicholas Roberts | Juan Diego Rodriguez | Claude Roux | Vasconcellos Samus | Ananya Sai | Robin Schmidt | Thomas Scialom | Tshephisho Sefara | Saqib Shamsi | Xudong Shen | Yiwen Shi | Haoyue Shi | Anna Shvets | Nick Siegel | Damien Sileo | Jamie Simon | Chandan Singh | Roman Sitelew | Priyank Soni | Taylor Sorensen | William Soto | Aman Srivastava | Aditya Srivatsa | Tony Sun | Mukund Varma | A Tabassum | Fiona Tan | Ryan Teehan | Mo Tiwari | Marie Tolkiehn | Athena Wang | Zijian Wang | Zijie Wang | Gloria Wang | Fuxuan Wei | Bryan Wilie | Genta Indra Winata | Xinyu Wu | Witold Wydmanski | Tianbao Xie | Usama Yaseen | Michael Yee | Jing Zhang | Yue Zhang
Northern European Journal of Language Technology, Volume 9

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

Connecting the Dots: What Graph-Based Text Representations Work Best for Text Classification using Graph Neural Networks?
Margarita Bugueño | Gerard de Melo
Findings of the Association for Computational Linguistics: EMNLP 2023

Given the success of Graph Neural Networks (GNNs) for structure-aware machine learning, many studies have explored their use for text classification, but mostly in specific domains with limited data characteristics. Moreover, some strategies prior to GNNs relied on graph mining and classical machine learning, making it difficult to assess their effectiveness in modern settings. This work extensively investigates graph representation methods for text classification, identifying practical implications and open challenges. We compare different graph construction schemes using a variety of GNN architectures and setups across five datasets, encompassing short and long documents as well as unbalanced scenarios in diverse domains. Two Transformer-based large language models are also included to complement the study. The results show that i) although the effectiveness of graphs depends on the textual input features and domain, simple graph constructions perform better the longer the documents are, ii) graph representations are especially beneficial for longer documents, outperforming Transformer-based models, iii) graph methods are particularly efficient for solving the task.

PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?
Sedigheh Eslami | Christoph Meinel | Gerard de Melo
Findings of the Association for Computational Linguistics: EACL 2023

Contrastive Language–Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image–text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. In this work, we evaluate the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). We present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments conducted on two MedVQA benchmark datasets illustrate that PubMedCLIP achieves superior results improving the overall accuracy up to 3% in comparison to the state-of-the-art Model-Agnostic Meta-Learning (MAML) networks pre-trained only on visual data. The PubMedCLIP model with different back-ends, the source code for pre-training them and reproducing our MedVQA pipeline is publicly available at https://github.com/sarahESL/PubMedCLIP.

ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities
Terry Yue Zhuo | Yaqing Liao | Yuecheng Lei | Lizhen Qu | Gerard de Melo | Xiaojun Chang | Yazhou Ren | Zenglin Xu
Findings of the Association for Computational Linguistics: EACL 2023

We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.

Model-Agnostic Bias Measurement in Link Prediction
Lena Schwertmann | Manoj Prabhakar Kannan Ravi | Gerard de Melo
Findings of the Association for Computational Linguistics: EACL 2023

Link prediction models based on factual knowledge graphs are commonly used in applications such as search and question answering. However, work investigating social bias in these models has been limited. Previous work focused on knowledge graph embeddings, so more recent classes of models achieving superior results by fine-tuning Transformers have not yet been investigated. We therefore present a model-agnostic approach for bias measurement leveraging fairness metrics to compare bias in knowledge graph embedding-based predictions (KG only) with models that use pre-trained, Transformer-based language models (KG+LM). We further create a dataset to measure gender bias in occupation predictions and assess whether the KG+LM models are more or less biased than KG only models. We find that gender bias tends to be higher for the KG+LM models and analyze potential connections to the accuracy of the models and the data bias inherent in our dataset. Finally, we discuss the limitations and ethical considerations of our work. The repository containing the source code and the data set is publicly available at https://github.com/lena-schwert/comparing-bias-in-KG-models.

FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models
Konstantin Dobler | Gerard de Melo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Using model weights pretrained on a high-resource language as a warm start can reduce the need for data and compute to obtain high-quality language models for other, especially low-resource, languages. However, if we want to use a new tokenizer specialized for the target language, we cannot transfer the source model’s embedding matrix. In this paper, we propose FOCUS - **F**ast **O**verlapping Token **C**ombinations **U**sing **S**parsemax, a novel embedding initialization method that effectively initializes the embedding matrix for a new tokenizer based on information in the source model’s embedding matrix. FOCUS represents newly added tokens as combinations of tokens in the overlap of the source and target vocabularies. The overlapping tokens are selected based on semantic similarity in an auxiliary static token embedding space. We focus our study on using the multilingual XLM-R as a source model and empirically show that FOCUS outperforms random initialization and previous work on language modeling and on a range of downstream tasks (NLI, QA, and NER). We publish our model checkpoints and code on GitHub.

Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models
Sepehr Janghorbani | Gerard De Melo
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Recent breakthroughs in self-supervised training have led to a new class of pretrained vision–language models. While there have been investigations of bias in multimodal models, they have mostly focused on gender and racial bias, giving much less attention to other relevant groups, such as minorities with regard to religion, nationality, sexual orientation, or disabilities. This is mainly due to lack of suitable benchmarks for such groups. We seek to address this gap by providing a visual and textual bias benchmark called MMBias, consisting of around 3,800 images and phrases covering 14 population subgroups. We utilize this dataset to assess bias in several prominent self-supervised multimodal models, including CLIP, ALBEF, and ViLT. Our results show that these models demonstrate meaningful bias favoring certain groups. Finally, we introduce a debiasing method designed specifically for such large pretrained models that can be applied as a post-processing step to mitigate bias, while preserving the remaining accuracy of the model.

Resolving Elliptical Compounds in German Medical Text
Niklas Kammer | Florian Borchert | Silvia Winkler | Gerard de Melo | Matthieu-P. Schapranow
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Elliptical coordinated compound noun phrases (ECCNPs), a special kind of coordination ellipsis, are a common phenomenon in German medical texts. As their presence is known to affect the performance in downstream tasks such as entity extraction and disambiguation, their resolution can be a useful preprocessing step in information extraction pipelines. In this work, we present a new comprehensive dataset of more than 4,000 manually annotated ECCNPs in German medical text, along with the respective ground truth resolutions. Based on this data, we propose a generative encoder-decoder Transformer model, allowing for a simple end-to-end resolution of ECCNPs from raw input strings with very high accuracy (90.5% exact match score). We compare our approach to an elaborate rule-based baseline, which the generative model outperforms by a large margin. We further investigate different scenarios for prompting large language models (LLM) to resolve ECCNPs. In a zero-shot setting, performance is remarkably poor (21.6% exact matches), as the LLM tends to apply complex changes to the inputs unrelated to our specific task. We also find no improvement over the generative model when using the LLM for post-filtering of generated candidate resolutions.

2022

Multi-Scale Distribution Deep Variational Autoencoder for Explanation Generation
ZeFeng Cai | Linlin Wang | Gerard de Melo | Fei Sun | Liang He
Findings of the Association for Computational Linguistics: ACL 2022

Generating explanations for recommender systems is essential for improving their transparency, as users often wish to understand the reason for receiving a specified recommendation. Previous methods mainly focus on improving the generation quality, but often produce generic explanations that fail to incorporate user and item specific details. To resolve this problem, we present Multi-Scale Distribution Deep Variational Autoencoders (MVAE).These are deep hierarchical VAEs with a prior network that eliminates noise while retaining meaningful signals in the input, coupled with a recognition network serving as the source of information to guide the learning of the prior network. Further, the Multi-scale distribution Learning Framework (MLF) along with a Target Tracking Kullback-Leibler divergence (TKL) mechanism are proposed to employ multi KL divergences at different scales for more effective learning. Extensive empirical experiments demonstrate that our methods can generate explanations with concrete input-specific contents.

Curriculum Prompt Learning with Self-Training for Abstractive Dialogue Summarization
Changqun Li | Linlin Wang | Xin Lin | Gerard de Melo | Liang He
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Succinctly summarizing dialogue is a task of growing interest, but inherent challenges, such as insufficient training data and low information density impede our ability to train abstractive models. In this work, we propose a novel curriculum-based prompt learning method with self-training to address these problems. Specifically, prompts are learned using a curriculum learning strategy that gradually increases the degree of prompt perturbation, thereby improving the dialogue understanding and modeling capabilities of our model. Unlabeled dialogue is incorporated by means of self-training so as to reduce the dependency on labeled data. We further investigate topic-aware prompts to better plan for the generation of summaries. Experiments confirm that our model substantially outperforms strong baselines and achieves new state-of-the-art results on the AMI and ICSI datasets. Human evaluations also show the superiority of our model with regard to the summary generation quality.

Fast-R2D2: A Pretrained Recursive Neural Network based on Pruned CKY for Grammar Induction and Text Representation
Xiang Hu | Haitao Mi | Liang Li | Gerard de Melo
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Chart-based models have shown great potential in unsupervised grammar induction, running recursively and hierarchically, but requiring O(n³) time-complexity. The Recursive Transformer based on Differentiable Trees (R2D2) makes it possible to scale to large language model pretraining even with a complex tree encoder, by introducing a heuristic pruning method.However, its rule-based pruning process suffers from local optima and slow inference. In this paper, we propose a unified R2D2 method that overcomes these issues. We use a top-down unsupervised parser as a model-guided pruning method, which also enables parallel encoding during inference. Our parser casts parsing as a split point scoring task by first scoring all split points for a given sentence and then using the highest-scoring one to recursively split a span into two parts. The reverse order of the splits is considered as the order of pruning in the encoder. We optimize the unsupervised parser by minimizing the Kullback–Leibler distance between tree probabilities from the parser and the R2D2 model.Our experiments show that our Fast-R2D2 significantly improves the grammar induction quality and achieves competitive results in downstream tasks.

Improving Personalized Explanation Generation through Visualization
Shijie Geng | Zuohui Fu | Yingqiang Ge | Lei Li | Gerard de Melo | Yongfeng Zhang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In modern recommender systems, there are usually comments or reviews from users that justify their ratings for different items. Trained on such textual corpus, explainable recommendation models learn to discover user interests and generate personalized explanations. Though able to provide plausible explanations, existing models tend to generate repeated sentences for different items or empty sentences with insufficient details. This begs an interesting question: can we immerse the models in a multimodal environment to gain proper awareness of real-world concepts and alleviate above shortcomings? To this end, we propose a visually-enhanced approach named METER with the help of visualization generation and text–image matching discrimination: the explainable recommendation model is encouraged to visualize what it refers to while incurring a penalty if the visualization is incongruent with the textual explanation. Experimental results and a manual assessment demonstrate that our approach can improve not only the text quality but also the diversity and explainability of the generated explanations.

Assessing Combinational Generalization of Language Models in Biased Scenarios
Yanbo Fang | Zuohui Fu | Xin Dong | Yongfeng Zhang | Gerard de Melo
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In light of the prominence of Pre-trained Language Models (PLMs) across numerous downstream tasks, shedding light on what they learn is an important endeavor. Whereas previous work focuses on assessing in-domain knowledge, we evaluate the generalization ability in biased scenarios through component combinations where it could be easy for the PLMs to learn shortcuts from the training corpus. This would lead to poor performance on the testing corpus, which is combinationally reconstructed from the training components. The results show that PLMs are able to overcome such distribution shifts for specific tasks and with sufficient data. We further find that overfitting can lead the models to depend more on biases for prediction, thus hurting the combinational generalization ability of PLMs.

2021

Personality Predictive Lexical Cues and Their Correlations
Xiaoli He | Gerard de Melo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In recent years, a number of studies have used linear models for personality prediction based on text. In this paper, we empirically analyze and compare the lexical signals captured in such models. We identify lexical cues for each dimension of the MBTI personality scheme in several different ways, considering different datasets, feature sets, and learning algorithms. We conduct a series of correlation analyses between the resulting MBTI data and explore their connection to other signals, such as for Big-5 traits, emotion, sentiment, age, and gender. The analysis shows intriguing correlation patterns between different personality dimensions and other traits, and also provides evidence for the robustness of the data.

Faithfully Explainable Recommendation via Neural Logic Reasoning
Yaxin Zhu | Yikun Xian | Zuohui Fu | Gerard de Melo | Yongfeng Zhang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Knowledge graphs (KG) have become increasingly important to endow modern recommender systems with the ability to generate traceable reasoning paths to explain the recommendation process. However, prior research rarely considers the faithfulness of the derived explanations to justify the decision-making process. To the best of our knowledge, this is the first work that models and evaluates faithfully explainable recommendation under the framework of KG reasoning. Specifically, we propose neural logic reasoning for explainable recommendation (LOGER) by drawing on interpretable logical rules to guide the path-reasoning process for explanation generation. We experiment on three large-scale datasets in the e-commerce domain, demonstrating the effectiveness of our method in delivering high-quality recommendations as well as ascertaining the faithfulness of the derived explanation.

Fast and Effective Biomedical Entity Linking Using a Dual Encoder
Rajarshi Bhowmik | Karl Stratos | Gerard de Melo
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Biomedical entity linking is the task of identifying mentions of biomedical concepts in text documents and mapping them to canonical entities in a target thesaurus. Recent advancements in entity linking using BERT-based models follow a retrieve and rerank paradigm, where the candidate entities are first selected using a retriever model, and then the retrieved candidates are ranked by a reranker model. While this paradigm produces state-of-the-art results, they are slow both at training and test time as they can process only one mention at a time. To mitigate these issues, we propose a BERT-based dual encoder model that resolves multiple mentions in a document in one shot. We show that our proposed model is multiple times faster than existing BERT-based models while being competitive in accuracy for biomedical entity linking. Additionally, we modify our dual encoder model for end-to-end biomedical entity linking that performs both mention span detection and entity disambiguation and out-performs two recently proposed models.

Exploiting Image–Text Synergy for Contextual Image Captioning
Sreyasi Nag Chowdhury | Rajarshi Bhowmik | Hareesh Ravi | Gerard de Melo | Simon Razniewski | Gerhard Weikum
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Modern web content - news articles, blog posts, educational resources, marketing brochures - is predominantly multimodal. A notable trait is the inclusion of media such as images placed at meaningful locations within a textual narrative. Most often, such images are accompanied by captions - either factual or stylistic (humorous, metaphorical, etc.) - making the narrative more engaging to the reader. While standalone image captioning has been extensively studied, captioning an image based on external knowledge such as its surrounding text remains under-explored. In this paper, we study this new task: given an image and an associated unstructured knowledge snippet, the goal is to generate a contextual caption for the image.

Guilt by Association: Emotion Intensities in Lexical Representations
Shahab Raji | Gerard de Melo
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

What do linguistic models reveal about the emotions associated with words? In this study, we consider the task of estimating word-level emotion intensity scores for specific emotions, exploring unsupervised, supervised, and finally a self-supervised method of extracting emotional associations from pretrained vectors and models. Overall, we find that linguistic models carry substantial potential for inducing fine-grained emotion intensity scores, showing a far higher correlation with human ground truth ratings than state-of-the-art emotion lexicons based on labeled data.

Context-Aware Interaction Network for Question Matching
Zhe Hu | Zuohui Fu | Yu Yin | Gerard de Melo
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Impressive milestones have been achieved in text matching by adopting a cross-attention mechanism to capture pertinent semantic connections between two sentence representations. However, regular cross-attention focuses on word-level links between the two input sequences, neglecting the importance of contextual information. We propose a context-aware interaction network (COIN) to properly align two sequences and infer their semantic relationship. Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information when aligning two sequences, and (2) a gate fusion layer to flexibly interpolate aligned representations. We apply multiple stacked interaction blocks to produce alignments at different levels and gradually refine the attention results. Experiments on two question matching datasets and detailed analyses demonstrate the effectiveness of our model.

Data Augmentation with Adversarial Training for Cross-Lingual NLI
Xin Dong | Yaxin Zhu | Zuohui Fu | Dongkuan Xu | Gerard de Melo
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Due to recent pretrained multilingual representation models, it has become feasible to exploit labeled data from one language to train a cross-lingual model that can then be applied to multiple new languages. In practice, however, we still face the problem of scarce labeled data, leading to subpar results. In this paper, we propose a novel data augmentation strategy for better cross-lingual natural language inference by enriching the data to reflect more diversity in a semantically faithful way. To this end, we propose two methods of training a generative model to induce synthesized examples, and then leverage the resulting data using an adversarial training regimen for more robustness. In a series of detailed experiments, we show that this fruitful combination leads to substantial gains in cross-lingual inference.

R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling
Xiang Hu | Haitao Mi | Zujie Wen | Yafang Wang | Yi Su | Jing Zheng | Gerard de Melo
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. In this paper, we propose a recursive Transformer model based on differentiable CKY style binary trees to emulate this composition process, and we extend the bidirectional language model pre-training objective to this architecture, attempting to predict each word given its left and right abstraction nodes. To scale up our approach, we also introduce an efficient pruning and growing algorithm to reduce the time complexity and enable encoding in linear time. Experimental results on language modeling and unsupervised parsing show the effectiveness of our approach.

Assessing Emoji Use in Modern Text Processing Tools
Abu Awal Md Shoeb | Gerard de Melo
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Emojis have become ubiquitous in digital communication, due to their visual appeal as well as their ability to vividly convey human emotion, among other factors. This also leads to an increased need for systems and tools to operate on text containing emojis. In this study, we assess this support by considering test sets of tweets with emojis, based on which we perform a series of experiments investigating the ability of prominent NLP and text processing tools to adequately process them. In particular, we consider tokenization, part-of-speech tagging, dependency parsing, as well as sentiment analysis. Our findings show that many systems still have notable shortcomings when operating on text containing emojis.

2020

Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation
Kshitij Shah | Gerard de Melo
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data.

Inducing Universal Semantic Tag Vectors
Da Huo | Gerard de Melo
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the well-established usefulness of part-of-speech tag annotations in many syntactically oriented downstream NLP tasks, the recently proposed notion of semantic tagging (Bjerva et al. 2016) aims at tagging words with tags informed by semantic distinctions, which are likely to be useful across a range of semantic tasks. To this end, their annotation scheme distinguishes, for instance, privative attributes from subsective ones. While annotated corpora exist, their size is limited and thus many words are out-of-vocabulary words. In this paper, we study to what extent we can automatically predict the tags associated with unseen words. We draw on large-scale word representation data to derive a large new Semantic Tag lexicon. Our experiments show that we can infer semantic tags for words with high accuracy both monolingually and cross-lingually.

EmoTag1200: Understanding the Association between Emojis and Emotions
Abu Awal Md Shoeb | Gerard de Melo
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Given the growing ubiquity of emojis in language, there is a need for methods and resources that shed light on their meaning and communicative role. One conspicuous aspect of emojis is their use to convey affect in ways that may otherwise be non-trivial to achieve. In this paper, we seek to explore the connection between emojis and emotions by means of a new dataset consisting of human-solicited association ratings. We additionally conduct experiments to assess to what extent such associations can be inferred from existing data in an unsupervised manner. Our experiments show that this succeeds when high-quality word-level information is available.

Domain-Specific Sentiment Lexicons Induced from Labeled Documents
SM Mazharul Islam | Xin Dong | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Sentiment analysis is an area of substantial relevance both in industry and in academia, including for instance in social studies. Although supervised learning algorithms have advanced considerably in recent years, in many settings it remains more practical to apply an unsupervised technique. The latter are oftentimes based on sentiment lexicons. However, existing sentiment lexicons reflect an abstract notion of polarity and do not do justice to the substantial differences of word polarities between different domains. In this work, we draw on a collection of domain-specific data to induce a set of 24 domain-specific sentiment lexicons. We rely on initial linear models to induce initial word intensity scores, and then train new deep models based on word vector representations to overcome the scarcity of the original seed data. Our analysis shows substantial differences between domains, which make domain-specific sentiment lexicons a promising form of lexical resource in downstream tasks, and the predicted lexicons indeed perform effectively on tasks such as review classification and cross-lingual word sentiment prediction.

Cross-Lingual Emotion Lexicon Induction using Representation Alignment in Low-Resource Settings
Arun Ramachandran | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Emotion lexicons provide information about associations between words and emotions. They have proven useful in analyses of reviews, literary texts, and posts on social media, among other things. We evaluate the feasibility of deriving emotion lexicons cross-lingually, especially for low-resource languages, from existing emotion lexicons in resource-rich languages. For this, we start out from very small corpora to induce cross-lingually aligned vector spaces. Our study empirically analyses the effectiveness of the induced emotion lexicons by measuring translation precision and correlations with existing emotion lexicons, along with measurements on a downstream task of sentence emotion prediction.

Data Augmentation for Multiclass Utterance Classification – A Systematic Study
Binxia Xu | Siyuan Qiu | Jie Zhang | Yafang Wang | Xiaoyu Shen | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

Utterance classification is a key component in many conversational systems. However, classifying real-world user utterances is challenging, as people may express their ideas and thoughts in manifold ways, and the amount of training data for some categories may be fairly limited, resulting in imbalanced data distributions. To alleviate these issues, we conduct a comprehensive survey regarding data augmentation approaches for text classification, including simple random resampling, word-level transformations, and neural text generation to cope with imbalanced data. Our experiments focus on multi-class datasets with a large number of data samples, which has not been systematically studied in previous work. The results show that the effectiveness of different data augmentation schemes depends on the nature of the dataset under consideration.

Sentence Analogies: Linguistic Regularities in Sentence Embeddings
Xunjie Zhu | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics

While important properties of word vector representations have been studied extensively, far less is known about the properties of sentence vector representations. Word vectors are often evaluated by assessing to what degree they exhibit regularities with regard to relationships of the sort considered in word analogies. In this paper, we investigate to what extent commonly used sentence vector representation spaces as well reflect certain kinds of regularities. We propose a number of schemes to induce evaluation data, based on lexical analogy data as well as semantic relationships between sentences. Our experiments consider a wide range of sentence embedding methods, including ones based on BERT-style contextual embeddings. We find that different models differ substantially in their ability to reflect such regularities.

Interactive Question Clarification in Dialogue via Reinforcement Learning
Xiang Hu | Zujie Wen | Yafang Wang | Xiaolong Li | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Coping with ambiguous questions has been a perennial problem in real-world dialogue systems. Although clarification by asking questions is a common form of human interaction, it is hard to define appropriate questions to elicit more specific intents from a user. In this work, we propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. We first formulate a collection partitioning problem to select a set of labels enabling us to distinguish potential unambiguous intents. We list the chosen labels as intent phrases to the user for further confirmation. The selected label along with the original user query then serves as a refined query, for which a suitable response can more easily be identified. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements across several different experiments.

Query Distillation: BERT-based Distillation for Ensemble Ranking
Wangshu Zhang | Junhong Liu | Zujie Wen | Yafang Wang | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Recent years have witnessed substantial progress in the development of neural ranking networks, but also an increasingly heavy computational burden due to growing numbers of parameters and the adoption of model ensembles. Knowledge Distillation (KD) is a common solution to balance the effectiveness and efficiency. However, it is not straightforward to apply KD to ranking problems. Ranking Distillation (RD) has been proposed to address this issue, but only shows effectiveness on recommendation tasks. We present a novel two-stage distillation method for ranking problems that allows a smaller student model to be trained while benefitting from the better performance of the teacher model, providing better control of the inference latency and computational burden. We design a novel BERT-based ranking model structure for list-wise ranking to serve as our student model. All ranking candidates are fed to the BERT model simultaneously, such that the self-attention mechanism can enable joint inference to rank the document list. Our experiments confirm the advantages of our method, not just with regard to the inference latency but also in terms of higher-quality rankings compared to the original teacher model.

2019

Using Multi-Sense Vector Embeddings for Reverse Dictionaries
Michael A. Hedderich | Andrew Yates | Dietrich Klakow | Gerard de Melo
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Popular word embedding methods such as word2vec and GloVe assign a single vector representation to each word, even if a word has multiple distinct meanings. Multi-sense embeddings instead provide different vectors for each sense of a word. However, they typically cannot serve as a drop-in replacement for conventional single-sense embeddings, because the correct sense vector needs to be selected for each word. In this work, we study the effect of multi-sense embeddings on the task of reverse dictionaries. We propose a technique to easily integrate them into an existing neural network architecture using an attention mechanism. Our experiments demonstrate that large improvements can be obtained when employing multi-sense embeddings both in the input sequence as well as for the target representation. An analysis of the sense distributions and of the learned attention is provided as well.

EmoTag – Towards an Emotion-Based Analysis of Emojis
Abu Awal Md Shoeb | Shahab Raji | Gerard de Melo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Despite being a fairly recent phenomenon, emojis have quickly become ubiquitous. Besides their extensive use in social media, they are now also invoked in customer surveys and feedback forms. Hence, there is a need for techniques to understand their sentiment and emotion. In this work, we provide a method to quantify the emotional association of basic emotions such as anger, fear, joy, and sadness for a set of emojis. We collect and process a unique corpus of 20 million emoji-centric tweets, such that we can capture rich emoji semantics using a comparably small dataset. We evaluate the induced emotion profiles of emojis with regard to their ability to predict word affect intensities as well as sentiment scores.

Rhetorically Controlled Encoder-Decoder for Modern Chinese Poetry Generation
Zhiqiang Liu | Zuohui Fu | Jie Cao | Gerard de Melo | Yik-Cheung Tam | Cheng Niu | Jie Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rhetoric is a vital element in modern poetry, and plays an essential role in improving its aesthetics. However, to date, it has not been considered in research on automatic poetry generation. In this paper, we propose a rhetorically controlled encoder-decoder for modern Chinese poetry generation. Our model relies on a continuous latent variable as a rhetoric controller to capture various rhetorical patterns in an encoder, and then incorporates rhetoric-based mixtures while generating modern Chinese poetry. For metaphor and personification, an automated evaluation shows that our model outperforms state-of-the-art baselines by a substantial margin, while human evaluation shows that our model generates better poems than baseline methods in terms of fluency, coherence, meaningfulness, and rhetorical aesthetics.

CITE: A Corpus of Image-Text Discourse Relations
Malihe Alikhani | Sreyasi Nag Chowdhury | Gerard de Melo | Matthew Stone
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper presents a novel crowd-sourced resource for multimodal discourse: our resource characterizes inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations. Like previous corpora annotating discourse structure between text arguments, such as the Penn Discourse Treebank, our new corpus aids in establishing a better understanding of natural communication and common-sense reasoning, while our findings have implications for a wide range of applications, such as understanding and generation of multimodal documents.

A Robust Self-Learning Framework for Cross-Lingual Text Classification
Xin Dong | Gerard de Melo
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Based on massive amounts of data, recent pretrained contextual representation models have made significant strides in advancing a number of different English NLP tasks. However, for other languages, relevant training data may be lacking, while state-of-the-art deep learning methods are known to be data-hungry. In this paper, we present an elegantly simple robust self-learning framework to include unlabeled non-English samples in the fine-tuning process of pretrained multilingual representation models. We leverage a multilingual model’s own predictions on unlabeled non-English data in order to obtain additional information that can be used during further fine-tuning. Compared with original multilingual models and other cross-lingual classification models, we observe significant gains in effectiveness on document and sentiment classification for a range of diverse languages.

2018

Video Captioning with Multi-Faceted Attention
Xiang Long | Chuang Gan | Gerard de Melo
Transactions of the Association for Computational Linguistics, Volume 6

Video captioning has attracted an increasing amount of interest, due in part to its potential for improved accessibility and information retrieval. While existing methods rely on different kinds of visual features and model architectures, they do not make full use of pertinent semantic cues. We present a unified and extensible framework to jointly leverage multiple sorts of visual features and semantic attributes. Our novel architecture builds on LSTMs with two multi-faceted attention layers. These first learn to automatically select the most salient visual features or semantic attributes, and then yield overall representations for the input and output of the sentence generation component via custom feature scaling operations. Experimental results on the challenging MSVD and MSR-VTT datasets show that our framework outperforms previous work and performs robustly even in the presence of added noise to the features and attributes.

Exploring Semantic Properties of Sentence Embeddings
Xunjie Zhu | Tingfeng Li | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Neural vector representations are ubiquitous throughout all subfields of NLP. While word vectors have been studied in much detail, thus far only little light has been shed on the properties of sentence embeddings. In this paper, we assess to what extent prominent sentence embedding methods exhibit select semantic properties. We propose a framework that generate triplets of sentences to explore how changes in the syntactic structure or semantics of a given sentence affect the similarities obtained between their sentence embeddings.

A Helping Hand: Transfer Learning for Deep Sentiment Analysis
Xin Dong | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Deep convolutional neural networks excel at sentiment polarity classification, but tend to require substantial amounts of training data, which moreover differs quite significantly between domains. In this work, we present an approach to feed generic cues into the training process of such networks, leading to better generalization abilities given limited training data. We propose to induce sentiment embeddings via supervision on extrinsic data, which are then fed into the model via a dedicated memory-based component. We observe significant gains in effectiveness on a range of different datasets in seven different languages.

Generating Fine-Grained Open Vocabulary Entity Type Descriptions
Rajarshi Bhowmik | Gerard de Melo
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While large-scale knowledge graphs provide vast amounts of structured facts about entities, a short textual description can often be useful to succinctly characterize an entity and its type. Unfortunately, many knowledge graphs entities lack such textual descriptions. In this paper, we introduce a dynamic memory-based network that generates a short open vocabulary description of an entity by jointly leveraging induced fact embeddings as well as the dynamic context of the generated sequence of words. We demonstrate the ability of our architecture to discern relevant information for more accurate generation of type description by pitting the system against several strong baselines.

Metaphor Suggestions based on a Semantic Metaphor Repository
Gerard de Melo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

FontLex: A Typographical Lexicon based on Affective Associations
Tugba Kulahcioglu | Gerard de Melo
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation
Niket Tandon | Gerard de Melo | Gerhard Weikum
Proceedings of ACL 2017, System Demonstrations

Multilingual Vector Representations of Words, Sentences, and Documents
Gerard de Melo
Proceedings of the IJCNLP 2017, Tutorial Abstracts

Neural vector representations are now ubiquitous in all subfields of natural language processing and text mining. While methods such as word2vec and GloVe are well-known, this tutorial focuses on multilingual and cross-lingual vector representations, of words, but also of sentences and documents as well.

PACRR: A Position-Aware Neural IR Model for Relevance Matching
Kai Hui | Andrew Yates | Klaus Berberich | Gerard de Melo
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years’ TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.

2016

Detecting Cross-Cultural Differences Using a Multilingual Topic Model
E.D. Gutiérrez | Ekaterina Shutova | Patricia Lichtenstein | Gerard de Melo | Luca Gilardi
Transactions of the Association for Computational Linguistics, Volume 4

Understanding cross-cultural differences has important implications for world affairs and many aspects of the life of society. Yet, the majority of text-mining methods to date focus on the analysis of monolingual texts. In contrast, we present a statistical model that simultaneously learns a set of common topics from multilingual, non-parallel data and automatically discovers the differences in perspectives on these topics across linguistic communities. We perform a behavioural evaluation of a subset of the differences identified by our model in English and Spanish to investigate their psychological validity.

Visualizing and Curating Knowledge Graphs over Time and Space
Tong Ge | Yafang Wang | Gerard de Melo | Haofeng Li | Baoquan Chen
Proceedings of ACL-2016 System Demonstrations

Relation Classification via Multi-Level Attention CNNs
Linlin Wang | Zhu Cao | Gerard de Melo | Zhiyuan Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Medical Concept Embeddings via Labeled Background Corpora
Eneldo Loza Mencía | Gerard de Melo | Jinseok Nam
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In recent years, we have seen an increasing amount of interest in low-dimensional vector representations of words. Among other things, these facilitate computing word similarity and relatedness scores. The most well-known example of algorithms to produce representations of this sort are the word2vec approaches. In this paper, we investigate a new model to induce such vector spaces for medical concepts, based on a joint objective that exploits not only word co-occurrences but also manually labeled documents, as available from sources such as PubMed. Our extensive experimental analysis shows that our embeddings lead to significantly higher correlations with human similarity and relatedness assessments than previous work. Due to the simplicity and versatility of vector representations, these findings suggest that our resource can easily be used as a drop-in replacement to improve any systems relying on medical concept similarity measures.

The Open Linguistics Working Group: Developing the Linguistic Linked Open Data Cloud
John P. McCrae | Christian Chiarcos | Francis Bond | Philipp Cimiano | Thierry Declerck | Gerard de Melo | Jorge Gracia | Sebastian Hellmann | Bettina Klimek | Steven Moran | Petya Osenova | Antonio Pareja-Lora | Jonathan Pool
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.

2015

Semantic Information Extraction for Improved Word Embeddings
Jiaqiang Chen | Gerard de Melo
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

Perceptually Grounded Selectional Preferences
Ekaterina Shutova | Niket Tandon | Gerard de Melo
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Sentiment-Aspect Extraction based on Restricted Boltzmann Machines
Linlin Wang | Kang Liu | Zhu Cao | Jun Zhao | Gerard de Melo
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Wiktionary-based word embeddings
Gerard de Melo
Proceedings of Machine Translation Summit XV: Papers

2014

Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)
Miriam R. L. Petruck | Gerard de Melo
Proceedings of Frame Semantics in NLP: A Workshop in Honor of Chuck Fillmore (1929-2014)

OpenWordNet-PT: A Project Report
Alexandre Rademaker | Valeria de Paiva | Gerard de Melo | Livy Real | Maira Gatti
Proceedings of the Seventh Global Wordnet Conference

Embedding NomLex-BR nominalizations into OpenWordnet-PT
Alexandre Rademaker | Valeria de Paiva | Gerard de Melo | Livy Maria Real Coelho
Proceedings of the Seventh Global Wordnet Conference

Structured Learning for Taxonomy Induction with Belief Propagation
Mohit Bansal | David Burkett | Gerard de Melo | Dan Klein
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Bring vs. MTRoget: Evaluating automatic thesaurus translation
Lars Borin | Jens Allwood | Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Evaluation of automatic language-independent methods for language technology resource creation is difficult, and confounded by a largely unknown quantity, viz. to what extent typological differences among languages are significant for results achieved for one language or language pair to be applicable across languages generally. In the work presented here, as a simplifying assumption, language-independence is taken as axiomatic within certain specified bounds. We evaluate the automatic translation of Roget’s “Thesaurus” from English into Swedish using an independently compiled Roget-style Swedish thesaurus, S.C. Bring’s “Swedish vocabulary arranged into conceptual classes” (1930). Our expectation is that this explicit evaluation of one of the thesaureses created in the MTRoget project will provide a good estimate of the quality of the other thesauruses created using similar methods.

Etymological Wordnet: Tracing The History of Words
Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Research on the history of words has led to remarkable insights about language and also about the history of human civilization more generally. This paper presents the Etymological Wordnet, the first database that aims at making word origin information available as a large, machine-readable network of words in many languages. The information in this resource is obtained from Wiktionary. Extracting a network of etymological information from Wiktionary requires significant effort, as much of the etymological information is only given in prose. We rely on custom pattern matching techniques and mine a large network with over 500,000 word origin links as well as over 2 million derivational/compositional links.

NomLex-PT: A Lexicon of Portuguese Nominalizations
Valeria de Paiva | Livy Real | Alexandre Rademaker | Gerard de Melo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet.

2013

Good, Great, Excellent: Global Inference of Semantic Intensities
Gerard de Melo | Mohit Bansal
Transactions of the Association for Computational Linguistics, Volume 1

Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. Intensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excellent) to rank words by assigning them positions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly determine the ranks, such that individual decisions benefit from global information. When ranking English adjectives, our global algorithm achieves substantial improvements over previous work on both pairwise and rank correlation metrics (specifically, 70% pairwise accuracy as compared to only 56% by previous work). Moreover, our approach can incorporate external synonymy information (increasing its pairwise accuracy to 78%) and extends easily to new languages. We also make our code and data freely available.

2012

UWN: A Large Multilingual Lexical Knowledge Base
Gerard de Melo | Gerhard Weikum
Proceedings of the ACL 2012 System Demonstrations

Empirical Comparisons of MASC Word Sense Annotations
Gerard de Melo | Collin F. Baker | Nancy Ide | Rebecca J. Passonneau | Christiane Fellbaum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We analyze how different conceptions of lexical semantics affect sense annotations and how multiple sense inventories can be compared empirically, based on annotated text. Our study focuses on the MASC project, where data has been annotated using WordNet sense identifiers on the one hand, and FrameNet lexical units on the other. This allows us to compare the sense inventories of these lexical resources empirically rather than just theoretically, based on their glosses, leading to new insights. In particular, we compute contingency matrices and develop a novel measure, the Expected Jaccard Index, that quantifies the agreement between annotations of the same data based on two different resources even when they have different sets of categories.

Markov Chains for Robust Graph-Based Commonsense Information Extraction
Niket Tandon | Dheeraj Rajagopal | Gerard de Melo
Proceedings of COLING 2012: Demonstration Papers

OpenWordNet-PT: An Open Brazilian Wordnet for Reasoning
Valeria de Paiva | Alexandre Rademaker | Gerard de Melo
Proceedings of COLING 2012: Demonstration Papers

2010

Untangling the Cross-Lingual Link Structure of Wikipedia
Gerard de Melo | Gerhard Weikum
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Providing Multilingual, Multimodal Answers to Lexical Database Queries
Gerard de Melo | Gerhard Weikum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Language users are increasingly turning to electronic resources to address their lexical information needs, due to their convenience and their ability to simultaneously capture different facets of lexical knowledge in a single interface. In this paper, we discuss techniques to respond to a user's lexical queries by providing multilingual and multimodal information, and facilitating navigating along different types of links. To this end, structured information from sources like WordNet, Wikipedia, Wiktionary, as well as Web services is linked and integrated to provide a multi-faceted yet consistent response to user queries. The meanings of words in many different languages are characterized by mapping them to appropriate WordNet sense identifiers and adding multilingual gloss descriptions as well as example sentences. Relationships are derived from WordNet and Wiktionary to allow users to discover semantically related words, etymologically related words, alternative spellings, as well as misspellings. Last but not least, images, audio recordings, and geographical maps extracted from Wikipedia and Wiktionary allow for a multimodal experience.

2009

Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Gerard de Melo | Gerhard Weikum
Proceedings of the 1st Workshop on Definition Extraction

2008

Mapping Roget’s Thesaurus and WordNet to French
Gerard de Melo | Gerhard Weikum
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Rogets Thesaurus and WordNet are very widely used lexical reference works. We describe an automatic mapping procedure that effectively produces French translations of the terms in these two resources. Our approach to the challenging task of disambiguation is based on structural statistics as well as measures of semantic relatedness that are utilized to learn a classification model for associations between entries in the thesaurus and French terms taken from bilingual dictionaries. By building and applying such models, we have produced French versions of Rogets Thesaurus and WordNet with a considerable level of accuracy, which can be used for a variety of different purposes, by humans as well as in computational applications.

Co-authors

Alexandre Rademaker 4

Valeria de Paiva 4

Rajarshi Bhowmik 3

Margarita Bugueño 3

Lucie-Aimée Kaffee 3

Zetian Ouyang 3

Abu Awal Md Shoeb 3

Xiaoling Wang 3

Yongfeng Zhang 3

Konstantin Dobler 2

Sreyasi Nag Chowdhury 2

Maximilian Schall 2

Moritz Schneider 2

Ekaterina Shutova 2

Oshin Agarwal 1

Malihe Alikhani 1

Nagender Aneja 1

Collin F. Baker 1

Rabin Banjade 1

Klaus Berberich 1

Ian Berlot-Attwell 1

Philipp Bielefeld 1

Florian Borchert 1

Caroline Brun 1

David Burkett 1

Samuel Cahyawijaya 1

Xiaojun Chang 1

Emile Chapuis 1

Wanxiang Che (车万翔) 1

Jiaqiang Chen 1

Kehai Chen (陈科海) 1

Christian Chiarcos 1

Jinho D. Choi 1

Mukund Choudhary 1

Philipp Cimiano 1

Christian Clauss 1

Livy Maria Real Coelho 1

Pierre Colombo 1

Filip Cornell 1

Tamara Czinczoll 1

Gautier Dagan 1

Thierry Declerck 1

Kaustubh Dhole 1

Marco Di Giovanni 1

Thomas Dopierre 1

Paul-Alexis Dray 1

Suchitra Dubey 1

Ondřej Dušek 1

Tatiana Ekeinhor 1

Sedigheh Eslami 1

Christiane Fellbaum 1

Benjamin Frost 1

Sebastian Gehrmann 1

Nikola Genchev 1

Jasmin Geppert 1

Abdullatif Ghajar 1

Rishabh Gupta 1

E.D. Gutiérrez 1

Necdet Güven 1

Hazem Abou Hamdan 1

Louanes Hamla 1

Fabrice Harel-Canada 1

Michael A. Hedderich 1

Sebastian Hellmann 1

Jan Hoffbauer 1

Jan Vincent Hoffbauer 1

Antoine Honoré 1

Christoph Hönes 1

SM Mazharul Islam 1

Jonathan Janetzki 1

Sepehr Janghorbani 1

Przemysław Joniak 1

Kris-Fillip Kahl 1

Niklas Kammer 1

Manoj Prabhakar Kannan Ravi 1

Dietrich Klakow 1

Bettina Klimek 1

Venelin Kovatchev 1

Kalpesh Krishna 1

Tugba Kulahcioglu 1

Ashutosh Kumar 1

Stefan Langer 1

Seungjae Ryan Lee 1

Corey James Levinson 1

Kaizhao Liang 1

Patricia Lichtenstein 1

Zhiqiang Liu (刘志强) 1

Eneldo Loza Mencía 1

Andrey Lukyanenko 1

Abinaya Mahadiran 1

Saad Mahamood 1

Vukosi Marivate 1

John Philip McCrae 1

Christoph Meinel 1

Niklas Meunnighoff 1

Pasquale Minervini 1

Nafise Sadat Moosavi 1

Timothy Sum Hon Mun 1

Kenton Murray 1

Marcin Namysl 1

Joshua Nemecek 1

Maria Obedkova 1

Victor Adelakun Omolaoye 1

Petya Osenova 1

Babajide Alamu Owoyele 1

Antonio Pareja Lora 1

Nivranshu Pasricha 1

Rebecca J. Passonneau 1

Miriam R. L. Petruck 1

Richard Plant 1

Jonathan Pool 1

Martin Preiß 1

Dheeraj Rajagopal 1

Pawan Kumar Rajpoot 1

Arun Ramachandran 1

Simon Razniewski 1

Nicholas Roberts 1

Juan Diego Rodriguez 1

Sebastian Ruder 1

Vasconcellos Samus 1

Sylwester Sawicki 1

Matthieu-P. Schapranow 1

Robin Schmidt 1

Lena Schwertmann 1

Thomas Scialom 1

Tshephisho Sefara 1

Dongsheng Shi 1

Ashish Shrivastava 1

Chandan Singh 1

Roman Sitelew 1

Marco Antonio Sobrevilla Cabezudo 1

Jascha Sohl-Dickstein 1

Taylor Sorensen 1

William Soto Martinez 1

Aman Srivastava 1

Aditya Srivatsa 1

Matthew Stone 1

Yik-Cheung Tam 1

Marie Tolkiehn 1

Daniel Whitenack 1

Genta Indra Winata 1

Silvia Winkler 1

Witold Wydmanski 1

Hongfei Xu (许鸿飞) 1

Wangshu Zhang 1

Shunfan Zheng 1

Terry Yue Zhuo 1

Adrian Ziupka 1

Venues