Nigel Collier


2022

pdf bib
Incorporating Stock Market Signals for Twitter Stance Detection
Costanza Conforti | Jakob Berndt | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Research in stance detection has so far focused on models which leverage purely textual input. In this paper, we investigate the integration of textual and financial signals for stance detection in the financial domain. Specifically, we propose a robust multi-task neural architecture that combines textual input with high-frequency intra-day time series from stock market prices. Moreover, we extend wt–wt, an existing stance detection dataset which collects tweets discussing Mergers and Acquisitions operations, with the relevant financial signal. Importantly, the obtained dataset aligns with Stander, an existing news stance detection dataset, thus resulting in a unique multimodal, multi-genre stance detection resource. We show experimentally and through detailed result analysis that our stance detection system benefits from financial information, and achieves state-of-the-art results on the wt–wt dataset: this demonstrates that the combination of multiple input signals is effective for cross-target stance detection, and opens interesting research directions for future work.

pdf bib
Improving Word Translation via Two-Stage Contrastive Learning
Yaoyiran Li | Fangyu Liu | Nigel Collier | Anna Korhonen | Ivan Vulić
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word translation or bilingual lexicon induction (BLI) is a key cross-lingual task, aiming to bridge the lexical gap between different languages. In this work, we propose a robust and effective two-stage contrastive learning framework for the BLI task. At Stage C1, we propose to refine standard cross-lingual linear maps between static word embeddings (WEs) via a contrastive learning objective; we also show how to integrate it into the self-learning procedure for even more refined cross-lingual maps. In Stage C2, we conduct BLI-oriented contrastive fine-tuning of mBERT, unlocking its word translation capability. We also show that static WEs induced from the ‘C2-tuned’ mBERT complement static WEs from Stage C1. Comprehensive experiments on standard BLI datasets for diverse languages and different experimental setups demonstrate substantial gains achieved by our framework. While the BLI method from Stage C1 already yields substantial gains over all state-of-the-art BLI methods in our comparison, even stronger improvements are met with the full two-stage framework: e.g., we report gains for 112/112 BLI setups, spanning 28 language pairs.

pdf bib
Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models
Zaiqiao Meng | Fangyu Liu | Ehsan Shareghi | Yixuan Su | Charlotte Collins | Nigel Collier
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as the biomedical domain are vastly under-explored. To facilitate this, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, constructed based on the Unified Medical Language System (UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to 28%, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain. Our code and dataset are publicly available at https://github.com/cambridgeltl/medlama.

pdf bib
Prix-LM: Pretraining for Multilingual Knowledge Base Construction
Wenxuan Zhou | Fangyu Liu | Ivan Vulić | Nigel Collier | Muhao Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge bases (KBs) contain plenty of structured world and commonsense knowledge. As such, they often complement distributional text-based information and facilitate various downstream tasks. Since their manual construction is resource- and time-intensive, recent efforts have tried leveraging large pretrained language models (PLMs) to generate additional monolingual knowledge facts for KBs. However, such methods have not been attempted for building and enriching multilingual KBs. Besides wider application, such multilingual KBs can provide richer combined knowledge than monolingual (e.g., English) KBs. Knowledge expressed in different languages may be complementary and unequally distributed: this implies that the knowledge available in high-resource languages can be transferred to low-resource ones. To achieve this, it is crucial to represent multilingual knowledge in a shared/unified space. To this end, we propose a unified representation model, Prix-LM, for multilingual KB construction and completion. We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs, and tune a multilingual language encoder XLM-R via a causal language modeling objective. Prix-LM integrates useful multilingual and KB-based factual knowledge into a single model. Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness, with gains reported over strong task-specialised baselines.

pdf bib
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning
Yixuan Su | Fangyu Liu | Zaiqiao Meng | Tian Lan | Lei Shu | Ehsan Shareghi | Nigel Collier
Findings of the Association for Computational Linguistics: NAACL 2022

Masked language models (MLMs) such as BERT have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed analysis to reveal the merits and inner-workings of our approach.

2021

pdf bib
Learning Sparse Sentence Encoding without Supervision: An Exploration of Sparsity in Variational Autoencoders
Victor Prokhorov | Yingzhen Li | Ehsan Shareghi | Nigel Collier
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

It has been long known that sparsity is an effective inductive bias for learning efficient representation of data in vectors with fixed dimensionality, and it has been explored in many areas of representation learning. Of particular interest to this work is the investigation of the sparsity within the VAE framework which has been explored a lot in the image domain, but has been lacking even a basic level of exploration in NLP. Additionally, NLP is also lagging behind in terms of learning sparse representations of large units of text e.g., sentences. We use the VAEs that induce sparse latent representations of large units of text to address the aforementioned shortcomings. First, we move in this direction by measuring the success of unsupervised state-of-the-art (SOTA) and other strong VAE-based sparsification baselines for text and propose a hierarchical sparse VAE model to address the stability issue of SOTA. Then, we look at the implications of sparsity on text classification across 3 datasets, and highlight a link between performance of sparse latent representations on downstream tasks and its ability to encode task-related information.

pdf bib
Self-Alignment Pretraining for Biomedical Entity Representations
Fangyu Liu | Ehsan Shareghi | Zaiqiao Meng | Marco Basaldella | Nigel Collier
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.

pdf bib
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
Fangyu Liu | Ivan Vulić | Anna Korhonen | Nigel Collier
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous work has indicated that pretrained Masked Language Models (MLMs) are not effective as universal lexical and sentence encoders off-the-shelf, i.e., without further task-specific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective lexical and sentence encoders even without any additional data, relying simply on self-supervision. We propose an extremely simple, fast, and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in 20-30 seconds with no access to additional external knowledge. Mirror-BERT relies on identical and slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during “identity fine-tuning”. We report huge gains over off-the-shelf MLMs with Mirror-BERT both in lexical-level and in sentence-level tasks, across different domains and different languages. Notably, in sentence similarity (STS) and question-answer entailment (QNLI) tasks, our self-supervised Mirror-BERT model even matches the performance of the Sentence-BERT models from prior work which rely on annotated task data. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple Mirror-BERT fine-tuning approach can yield effective universal lexical and sentence encoders.

pdf bib
Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT
Zaiqiao Meng | Fangyu Liu | Thomas Clark | Ehsan Shareghi | Nigel Collier
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Infusing factual knowledge into pre-trained models is fundamental for many knowledge-intensive tasks. In this paper, we proposed Mixture-of-Partitions (MoP), an infusion approach that can handle a very large knowledge graph (KG) by partitioning it into smaller sub-graphs and infusing their specific knowledge into various BERT models using lightweight adapters. To leverage the overall factual knowledge for a target task, these sub-graph adapters are further fine-tuned along with the underlying BERT through a mixture layer. We evaluate our MoP with three biomedical BERTs (SciBERT, BioBERT, PubmedBERT) on six downstream tasks (inc. NLI, QA, Classification), and the results show that our MoP consistently enhances the underlying BERTs in task performance, and achieves new SOTA performances on five evaluated datasets.

pdf bib
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu | Emanuele Bugliarello | Edoardo Maria Ponti | Siva Reddy | Nigel Collier | Desmond Elliott
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.

pdf bib
Keep the Primary, Rewrite the Secondary: A Two-Stage Approach for Paraphrase Generation
Yixuan Su | David Vandyke | Simon Baker | Yan Wang | Nigel Collier
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Plan-then-Generate: Controlled Data-to-Text Generation via Planning
Yixuan Su | David Vandyke | Sihui Wang | Yimai Fang | Nigel Collier
Findings of the Association for Computational Linguistics: EMNLP 2021

Recent developments in neural networks have led to the advance in data-to-text generation. However, the lack of ability of neural models to control the structure of generated output can be limiting in certain real-world applications. In this study, we propose a novel Plan-then-Generate (PlanGen) framework to improve the controllability of neural data-to-text models. Extensive experiments and analyses are conducted on two benchmark datasets, ToTTo and WebNLG. The results show that our model is able to control both the intra-sentence and inter-sentence structure of the generated output. Furthermore, empirical comparisons against previous state-of-the-art methods show that our model improves the generation quality as well as the output diversity as judged by human and automatic evaluations.

pdf bib
Few-Shot Table-to-Text Generation with Prototype Memory
Yixuan Su | Zaiqiao Meng | Simon Baker | Nigel Collier
Findings of the Association for Computational Linguistics: EMNLP 2021

Neural table-to-text generation models have achieved remarkable progress on an array of tasks. However, due to the data-hungry nature of neural models, their performances strongly rely on large-scale training examples, limiting their applicability in real-world applications. To address this, we propose a new framework: Prototype-to-Generate (P2G), for table-to-text generation under the few-shot scenario. The proposed framework utilizes the retrieved prototypes, which are jointly selected by an IR system and a novel prototype selector to help the model bridging the structural gap between tables and texts. Experimental results on three benchmark datasets with three state-of-the-art models demonstrate that the proposed framework significantly improves the model performance across various evaluation metrics.

pdf bib
Adversarial Training for News Stance Detection: Leveraging Signals from a Multi-Genre Corpus.
Costanza Conforti | Jakob Berndt | Marco Basaldella | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Cross-target generalization constitutes an important issue for news Stance Detection (SD). In this short paper, we investigate adversarial cross-genre SD, where knowledge from annotated user-generated data is leveraged to improve news SD on targets unseen during training. We implement a BERT-based adversarial network and show experimental performance improvements over a set of strong baselines. Given the abundance of user-generated data, which are considerably less expensive to retrieve and annotate than news articles, this constitutes a promising research direction.

pdf bib
Synthetic Examples Improve Cross-Target Generalization: A Study on Stance Detection on a Twitter corpus.
Costanza Conforti | Jakob Berndt | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Cross-target generalization is a known problem in stance detection (SD), where systems tend to perform poorly when exposed to targets unseen during training. Given that data annotation is expensive and time-consuming, finding ways to leverage abundant unlabeled in-domain data can offer great benefits. In this paper, we apply a weakly supervised framework to enhance cross-target generalization through synthetically annotated data. We focus on Twitter SD and show experimentally that integrating synthetic data is helpful for cross-target generalization, leading to significant improvements in performance, with gains in F1 scores ranging from +3.4 to +5.1.

pdf bib
MirrorWiC: On Eliciting Word-in-Context Representations from Pretrained Language Models
Qianchu Liu | Fangyu Liu | Nigel Collier | Anna Korhonen | Ivan Vulić
Proceedings of the 25th Conference on Computational Natural Language Learning

Recent work indicated that pretrained language models (PLMs) such as BERT and RoBERTa can be transformed into effective sentence and word encoders even via simple self-supervised techniques. Inspired by this line of work, in this paper we propose a fully unsupervised approach to improving word-in-context (WiC) representations in PLMs, achieved via a simple and efficient WiC-targeted fine-tuning procedure: MirrorWiC. The proposed method leverages only raw texts sampled from Wikipedia, assuming no sense-annotated data, and learns context-aware word representations within a standard contrastive learning setup. We experiment with a series of standard and comprehensive WiC benchmarks across multiple languages. Our proposed fully unsupervised MirrorWiC models obtain substantial gains over off-the-shelf PLMs across all monolingual, multilingual and cross-lingual setups. Moreover, on some standard WiC benchmarks, MirrorWiC is even on-par with supervised models fine-tuned with in-task data and sense labels.

pdf bib
Integrating Transformers and Knowledge Graphs for Twitter Stance Detection
Thomas Clark | Costanza Conforti | Fangyu Liu | Zaiqiao Meng | Ehsan Shareghi | Nigel Collier
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Stance detection (SD) entails classifying the sentiment of a text towards a given target, and is a relevant sub-task for opinion mining and social media analysis. Recent works have explored knowledge infusion supplementing the linguistic competence and latent knowledge of large pre-trained language models with structured knowledge graphs (KGs), yet few works have applied such methods to the SD task. In this work, we first perform stance-relevant knowledge probing on Transformers-based pre-trained models in a zero-shot setting, showing these models’ latent real-world knowledge about SD targets and their sensitivity to context. We then train and evaluate new knowledge-enriched stance detection models on two Twitter stance datasets, achieving state-of-the-art performance on both.

pdf bib
Dialogue Response Selection with Hierarchical Curriculum Learning
Yixuan Su | Deng Cai | Qingyu Zhou | Zibo Lin | Simon Baker | Yunbo Cao | Shuming Shi | Nigel Collier | Yan Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that models trained with random negative samples are not ideal in real-world scenarios, we propose a hierarchical curriculum learning framework that trains the matching model in an “easy-to-difficult” scheme. Our learning framework consists of two complementary curricula: (1) corpus-level curriculum (CC); and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and a response candidate. As for IC, it progressively strengthens the model’s ability in identifying the mismatching information between the dialogue context and a response candidate. Empirical studies on three benchmark datasets with three state-of-the-art matching models demonstrate that the proposed learning framework significantly improves the model performance across various evaluation metrics.

pdf bib
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking
Fangyu Liu | Ivan Vulić | Anna Korhonen | Nigel Collier
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Injecting external domain-specific knowledge (e.g., UMLS) into pretrained language models (LMs) advances their capability to handle specialised in-domain tasks such as biomedical entity linking (BEL). However, such abundant expert knowledge is available only for a handful of languages (e.g., English). In this work, by proposing a novel cross-lingual biomedical entity linking task (XL-BEL) and establishing a new XL-BEL benchmark spanning 10 typologically diverse languages, we first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. The scores indicate large gaps to English performance. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones. To this end, we propose and evaluate a series of cross-lingual transfer methods for the XL-BEL task, and demonstrate that general-domain bitext helps propagate the available English knowledge to languages with little to no in-domain data. Remarkably, we show that our proposed domain-specific transfer methods yield consistent gains across all target languages, sometimes up to 20 Precision@1 points, without any in-domain knowledge in the target language, and without any in-domain parallel data.

pdf bib
Non-Autoregressive Text Generation with Pre-trained Language Models
Yixuan Su | Deng Cai | Yan Wang | David Vandyke | Simon Baker | Piji Li | Nigel Collier
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Non-autoregressive generation (NAG) has recently attracted great attention due to its fast inference speed. However, the generation quality of existing NAG models still lags behind their autoregressive counterparts. In this work, we show that BERT can be employed as the backbone of a NAG model for a greatly improved performance. Additionally, we devise two mechanisms to alleviate the two common problems of vanilla NAG models: the inflexibility of prefixed output length and the conditional independence of individual token predictions. To further strengthen the speed advantage of the proposed model, we propose a new decoding strategy, ratio-first, for applications where the output lengths can be approximately estimated beforehand. For a comprehensive evaluation, we test the proposed model on three text generation tasks, including text summarization, sentence compression and machine translation. Experimental results show that our model significantly outperforms existing non-autoregressive baselines and achieves competitive performance with many strong autoregressive models. In addition, we also conduct extensive analysis experiments to reveal the effect of each proposed component.

2020

pdf bib
STANDER: An Expert-Annotated Dataset for News Stance Detection and Evidence Retrieval
Costanza Conforti | Jakob Berndt | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Findings of the Association for Computational Linguistics: EMNLP 2020

We present a new challenging news dataset that targets both stance detection (SD) and fine-grained evidence retrieval (ER). With its 3,291 expert-annotated articles, the dataset constitutes a high-quality benchmark for future research in SD and multi-task learning. We provide a detailed description of the corpus collection methodology and carry out an extensive analysis on the sources of disagreement between annotators, observing a correlation between their disagreement and the diffusion of uncertainty around a target in the real world. Our experiments show that the dataset poses a strong challenge to recent state-of-the-art models. Notably, our dataset aligns with an existing Twitter SD dataset: their union thus addresses a key shortcoming of previous works, by providing the first dedicated resource to study multi-genre SD as well as the interplay of signals from social media and news sources in rumour verification.

pdf bib
Will-They-Won’t-They: A Very Large Dataset for Stance Detection on Twitter
Costanza Conforti | Jakob Berndt | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a new challenging stance detection dataset, called Will-They-Won’t-They (WT--WT), which contains 51,284 tweets in English, making it by far the largest available dataset of the type. All the annotations are carried out by experts; therefore, the dataset constitutes a high-quality and reliable benchmark for future research in stance detection. Our experiments with a wide range of recent state-of-the-art stance detection systems show that the dataset poses a strong challenge to existing models in this domain.

pdf bib
COMETA: A Corpus for Medical Entity Linking in the Social Media
Marco Basaldella | Fangyu Liu | Ehsan Shareghi | Nigel Collier
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman’s language. Meanwhile, there is a growing need for applications that can understand the public’s voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.

2019

pdf bib
Generating Knowledge Graph Paths from Textual Definitions using Sequence-to-Sequence Models
Victor Prokhorov | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a novel method for mapping unrestricted text to knowledge graph entities by framing the task as a sequence-to-sequence problem. Specifically, given the encoded state of an input text, our decoder directly predicts paths in the knowledge graph, starting from the root and ending at the the target node following hypernym-hyponym relationships. In this way, and in contrast to other text-to-entity mapping systems, our model outputs hierarchically structured predictions that are fully interpretable in the context of the underlying ontology, in an end-to-end manner. We present a proof-of-concept experiment with encouraging results, comparable to those of state-of-the-art systems.

pdf bib
A Richer-but-Smarter Shortest Dependency Path with Attentive Augmentation for Relation Extraction
Duy-Cat Can | Hoang-Quynh Le | Quang-Thuy Ha | Nigel Collier
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

To extract the relationship between two entities in a sentence, two common approaches are (1) using their shortest dependency path (SDP) and (2) using an attention model to capture a context-based representation of the sentence. Each approach suffers from its own disadvantage of either missing or redundant information. In this work, we propose a novel model that combines the advantages of these two approaches. This is based on the basic information in the SDP enhanced with information selected by several attention mechanisms with kernel filters, namely RbSP (Richer-but-Smarter SDP). To exploit the representation behind the RbSP structure effectively, we develop a combined deep neural model with a LSTM network on word sequences and a CNN on RbSP. Experimental results on the SemEval-2010 dataset demonstrate improved performance over competitive baselines. The data and source code are available at https://github.com/catcd/RbSP.

pdf bib
On the Importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation
Victor Prokhorov | Ehsan Shareghi | Yingzhen Li | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 3rd Workshop on Neural Generation and Translation

Variational Autoencoders (VAEs) are known to suffer from learning uninformative latent representation of the input due to issues such as approximated posterior collapse, or entanglement of the latent space. We impose an explicit constraint on the Kullback-Leibler (KL) divergence term inside the VAE objective function. While the explicit constraint naturally avoids posterior collapse, we use it to further understand the significance of the KL term in controlling the information transmitted through the VAE channel. Within this framework, we explore different properties of the estimated posterior distribution, and highlight the trade-off between the amount of information encoded in a latent code during training, and the generative capacity of the model.

pdf bib
BioReddit: Word Embeddings for User-Generated Biomedical NLP
Marco Basaldella | Nigel Collier
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Word embeddings, in their different shapes and iterations, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, scholars in the field largely trained their embeddings on scientific documents only, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied on user-generated content.

2018

pdf bib
Towards Automatic Fake News Detection: Cross-Level Stance Detection in News Articles
Costanza Conforti | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

In this paper, we propose to adapt the four-staged pipeline proposed by Zubiaga et al. (2018) for the Rumor Verification task to the problem of Fake News Detection. We show that the recently released FNC-1 corpus covers two of its steps, namely the Tracking and the Stance Detection task. We identify asymmetry in length in the input to be a key characteristic of the latter step, when adapted to the framework of Fake News Detection, and propose to handle it as a specific type of Cross-Level Stance Detection. Inspired by theories from the field of Journalism Studies, we implement and test two architectures to successfully model the internal structure of an article and its interactions with a claim.

pdf bib
Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models
Mohammad Taher Pilehvar | Dimitri Kartsaklis | Victor Prokhorov | Nigel Collier
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Rare word representation has recently enjoyed a surge of interest, owing to the crucial role that effective handling of infrequent words can play in accurate semantic understanding. However, there is a paucity of reliable benchmarks for evaluation and comparison of these techniques. We show in this paper that the only existing benchmark (the Stanford Rare Word dataset) suffers from low-confidence annotations and limited vocabulary; hence, it does not constitute a solid comparison framework. In order to fill this evaluation gap, we propose Cambridge Rare word Dataset (Card-660), an expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques. Through a set of experiments we show that even the best mainstream word embeddings, with millions of words in their vocabularies, are unable to achieve performances higher than 0.43 (Pearson correlation) on the dataset, compared to a human-level upperbound of 0.90. We release the dataset and the annotation materials at https://pilehvar.github.io/card-660/.

pdf bib
Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
Dimitri Kartsaklis | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper addresses the problem of mapping natural language text to knowledge base entities. The mapping process is approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model is an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM), addressing polysemy issues. Further, the knowledge base space is prepared by collecting random walks from a graph enhanced with textual features, which act as a set of semantic bridges between text and knowledge base entities. The ideas of this work are demonstrated on large-scale text-to-entity mapping and entity classification tasks, with state of the art results.

pdf bib
Large-scale Exploration of Neural Relation Classification Architectures
Hoang-Quynh Le | Duy-Cat Can | Sinh T. Vu | Thanh Hai Dang | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Experimental performance on the task of relation classification has generally improved using deep neural network architectures. One major drawback of reported studies is that individual models have been evaluated on a very narrow range of datasets, raising questions about the adaptability of the architectures, while making comparisons between approaches difficult. In this work, we present a systematic large-scale analysis of neural relation classification architectures on six benchmark datasets with widely varying characteristics. We propose a novel multi-channel LSTM model combined with a CNN that takes advantage of all currently popular linguistic and architectural features. Our ‘Man for All Seasons’ approach achieves state-of-the-art performance on two datasets. More importantly, in our view, the model allowed us to obtain direct insights into the continued challenges faced by neural language models on this task.

pdf bib
Which Melbourne? Augmenting Geocoding with Maps
Milan Gritta | Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The purpose of text geolocation is to associate geographic information contained in a document with a set (or sets) of coordinates, either implicitly by using linguistic features and/or explicitly by using geographic metadata combined with heuristics. We introduce a geocoder (location mention disambiguator) that achieves state-of-the-art (SOTA) results on three diverse datasets by exploiting the implicit lexical clues. Moreover, we propose a new method for systematic encoding of geographic metadata to generate two distinct views of the same text. To that end, we introduce the Map Vector (MapVec), a sparse representation obtained by plotting prior geographic probabilities, derived from population figures, on a World Map. We then integrate the implicit (language) and explicit (map) features to significantly improve a range of metrics. We also introduce an open-source dataset for geoparsing of news events covering global disease outbreaks and epidemics to help future evaluation in geoparsing.

2017

pdf bib
SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
Jose Camacho-Collados | Mohammad Taher Pilehvar | Nigel Collier | Roberto Navigli
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper introduces a new task on Multilingual and Cross-lingual SemanticThis paper introduces a new task on Multilingual and Cross-lingual Semantic Word Similarity which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. High quality datasets were manually curated for the five languages with high inter-annotator agreements (consistently in the 0.9 ballpark). These were used for semi-automatic construction of ten cross-lingual datasets. 17 teams participated in the task, submitting 24 systems in subtask 1 and 14 systems in subtask 2. Results show that systems that combine statistical knowledge from text corpora, in the form of word embeddings, and external knowledge from lexical resources are best performers in both subtasks. More information can be found on the task website: http://alt.qcri.org/semeval2017/task2/

pdf bib
Inducing Embeddings for Rare and Unseen Words by Leveraging Lexical Resources
Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We put forward an approach that exploits the knowledge encoded in lexical resources in order to induce representations for words that were not encountered frequently during training. Our approach provides an advantage over the past work in that it enables vocabulary expansion not only for morphological variations, but also for infrequent domain specific terms. We performed evaluations in different settings, showing that the technique can provide consistent improvements on multiple benchmarks across domains.

pdf bib
Vancouver Welcomes You! Minimalist Location Metonymy Resolution
Milan Gritta | Mohammad Taher Pilehvar | Nut Limsopatham | Nigel Collier
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Named entities are frequently used in a metonymic manner. They serve as references to related entities such as people and organisations. Accurate identification and interpretation of metonymy can be directly beneficial to various NLP applications, such as Named Entity Recognition and Geographical Parsing. Until now, metonymy resolution (MR) methods mainly relied on parsers, taggers, dictionaries, external word lists and other handcrafted lexical resources. We show how a minimalist neural approach combined with a novel predicate window method can achieve competitive results on the SemEval 2007 task on Metonymy Resolution. Additionally, we contribute with a new Wikipedia-based MR dataset called RelocaR, which is tailored towards locations as well as improving previous deficiencies in annotation guidelines.

pdf bib
Towards a Seamless Integration of Word Senses into Downstream NLP Applications
Mohammad Taher Pilehvar | Jose Camacho-Collados | Roberto Navigli | Nigel Collier
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.

2016

pdf bib
Improved Semantic Representation for Domain-Specific Entities
Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
Modelling the Combination of Generic and Target Domain Embeddings in a Convolutional Neural Network for Sentence Classification
Nut Limsopatham | Nigel Collier
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
Bidirectional LSTM for Named Entity Recognition in Twitter Messages
Nut Limsopatham | Nigel Collier
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

In this paper, we present our approach for named entity recognition in Twitter messages that we used in our participation in the Named Entity Recognition in Twitter shared task at the COLING 2016 Workshop on Noisy User-generated text (WNUT). The main challenge that we aim to tackle in our participation is the short, noisy and colloquial nature of tweets, which makes named entity recognition in Twitter message a challenging task. In particular, we investigate an approach for dealing with this problem by enabling bidirectional long short-term memory (LSTM) to automatically learn orthographic features without requiring feature engineering. In comparison with other systems participating in the shared task, our system achieved the most effective performance on both the ‘segmentation and categorisation’ and the ‘segmentation only’ sub-tasks.

pdf bib
Learning Orthographic Features in Bi-directional LSTM for Biomedical Named Entity Recognition
Nut Limsopatham | Nigel Collier
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

End-to-end neural network models for named entity recognition (NER) have shown to achieve effective performances on general domain datasets (e.g. newswire), without requiring additional hand-crafted features. However, in biomedical domain, recent studies have shown that hand-engineered features (e.g. orthographic features) should be used to attain effective performance, due to the complexity of biomedical terminology (e.g. the use of acronyms and complex gene names). In this work, we propose a novel approach that allows a neural network model based on a long short-term memory (LSTM) to automatically learn orthographic features and incorporate them into a model for biomedical NER. Importantly, our bi-directional LSTM model learns and leverages orthographic features on an end-to-end basis. We evaluate our approach by comparing against existing neural network models for NER using three well-established biomedical datasets. Our experimental results show that the proposed approach consistently outperforms these strong baselines across all of the three datasets.

pdf bib
NLP and Online Health Reports: What do we say and what do we mean?
Nigel Collier
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation
Nut Limsopatham | Nigel Collier
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
De-Conflated Semantic Representations
Mohammad Taher Pilehvar | Nigel Collier
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages
Nut Limsopatham | Nigel Collier
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
The impact of near domain transfer on biomedical named entity recognition
Nigel Collier | Mai-vu Tran | Ferdinand Paster
Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

pdf bib
Discriminating Rhetorical Analogies in Social Media
Christoph Lofi | Christian Nieke | Nigel Collier
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Exploring a Probabilistic Earley Parser for Event Composition in Biomedical Texts
Mai-Vu Tran | Nigel Collier | Hoang-Quynh Le | Van-Thuy Phi | Thanh-Binh Pham
Proceedings of the BioNLP Shared Task 2013 Workshop

2012

pdf bib
A Hybrid Approach to Finding Phenotype Candidates in Genetic Texts
Nigel Collier | Mai-Vu Tran | Hoang-Quynh Le | Anika Oellrich | Ai Kawazoe | Martin Hall-May | Dietrich Rebholz-Schuhmann
Proceedings of COLING 2012

pdf bib
On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online
Jey Han Lau | Nigel Collier | Timothy Baldwin
Proceedings of COLING 2012

pdf bib
An Experiment in Integrating Sentiment Features for Tech Stock Prediction in Twitter
Tien Thanh Vu | Shu Chang | Quang Thuy Ha | Nigel Collier
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data

2010

pdf bib
An ontology-driven system for detecting global health events
Nigel Collier | Reiko Matsuda Goodwin | John McCrae | Son Doan | Ai Kawazoe | Mike Conway | Asanee Kawtrakul | Koichi Takeuchi | Dinh Dien
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Using Hedges to Enhance a Disease Outbreak Report Text Mining System
Mike Conway | Son Doan | Nigel Collier
Proceedings of the BioNLP 2009 Workshop

2008

pdf bib
Global Health Monitor - A Web-based System for Detecting and Mapping Infectious Diseases
Son Doan | Quoc Hung Ngo | Ai Kawazoe | Nigel Collier
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
The Choice of Features for Classification of Verbs in Biomedical Texts
Anna Korhonen | Yuval Krymolowski | Nigel Collier
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
The Role of Roles in Classifying Annotated Biomedical Text
Son Doan | Ai Kawazoe | Nigel Collier
Biological, translational, and clinical language processing

2006

pdf bib
Automatic Classification of Verbs in Biomedical Texts
Anna Korhonen | Yuval Krymolowski | Nigel Collier
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2004

pdf bib
Annotation of Coreference Relations Among Linguistic Expressions and Images in Biological Articles
Ai Kawazoe | Asanobu Kitamoto | Nigel Collier
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

In this paper, we propose an annotation scheme which can be used not only for annotating coreference relations between linguistic expressions, but also those among linguistic expressions and images, in scientific texts such as biomedical articles. Images in biomedical domain often contain important information for analyses and diagnoses, and we consider that linking images to textual descriptions of their semantic contents in terms of coreference relations is useful for multimodal access to the information. We present our annotation scheme and the concept of a "coreference pool," which plays a central role in the scheme. We also introduce a support tool for text annotation named Open Ontology Forge which we have already developed, and additional functions for the software to cover image annotations (ImageOF) which is now being developed.

pdf bib
An Annotation Scheme for a Rhetorical Analysis of Biology Articles
Yoko Mizuta | Nigel Collier
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Zone Identification in Biology Articles as a Basis for Information Extraction
Yoko Mizuta | Nigel Collier
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

pdf bib
Introduction to the Bio-entity Recognition Task at JNLPBA
Nigel Collier | Jin-Dong Kim
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

pdf bib
Sentiment Analysis using Support Vector Machines with Diverse Information Sources
Tony Mullen | Nigel Collier
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Incorporating topic information into semantic analysis models
Tony Mullen | Nigel Collier
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2003

pdf bib
Bio-Medical Entity Extraction using Support Vector Machines
Koichi Takeuchi | Nigel Collier
Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine

2002

pdf bib
PIA-Core: Semantic Annotation through Example-based Learning
Nigel Collier | Koichi Takeuchi
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Progress on Multi-lingual Named Entity Annotation Guidelines using RDF (S)
Nigel Collier | Koichi Takeuchi | Chikashi Nobata | Junichi Fukumoto | Norihiro Ogata
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Use of Support Vector Machines in Extended Named Entity Recognition
Koichi Takeuchi | Nigel Collier
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

2000

pdf bib
Extracting the Names of Genes and Gene Products with a Hidden Markov Model
Nigel Collier | Chikashi Nobata | Jun-ichi Tsujii
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Comparison between Tagged Corpora for the Named Entity Task
Chikashi Nobata | Nigel Collier | Jun’ichi Tsujii
The Workshop on Comparing Corpora

pdf bib
Building an Annotated Corpus in the Molecular-Biology Domain
Yuka Tateisi | Tomoko Ohta | Nigel Collier | Chikashi Nobata | Jun-ichi Tsujii
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

1999

pdf bib
The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers
Nigel Collier | Hyun Seok Park | Norihiro Ogata | Yuka Tateishi | Chikashi Nobata | Tomoko Ohta | Tateshi Sekimizu | Hisao Imai | Katsutoshi Ibushi | Jun-ichi Tsujii
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
Machine Translation vs. Dictionary Term Translation - a Comparison for English-Japanese News Article Alignment
Nigel Collier | Hideki Hirakawa | Akira Kumano
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
An Experiment in Hybrid Dictionary and Statistical Sentence Alignment
Nigel Collier | Kenji Ono | Hideki Hirakawa
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Machine Translation vs. Dictionary Term Translation - a Comparison for English-Japanese News Article Alignment
Nigel Collier | Hideki Hirakawa | Akira Kumano
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
An Experiment in Hybrid Dictionary and Statistical Sentence Alignment
Nigel Collier | Kenji Ono | Hideki Hirakawa
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1