2024
pdf
bib
abs
Language Concept Erasure for Language-invariant Dense Retrieval
Zhiqi Huang
|
Puxuan Yu
|
Shauli Ravfogel
|
James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Multilingual models aim for language-invariant representations but still prominently encode language identity. This, along with the scarcity of high-quality parallel retrieval data, limits their performance in retrieval. We introduce LANCER, a multi-task learning framework that improves language-invariant dense retrieval by reducing language-specific signals in the embedding space. Leveraging the notion of linear concept erasure, we design a loss function that penalizes cross-correlation between representations and their language labels. LANCER leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic similarity without necessitating a vast parallel corpus. Experimental results on various datasets show our method consistently improves over baselines, with extensive analyses demonstrating greater language agnosticism.
pdf
bib
abs
Discovering Biases in Information Retrieval Models Using Relevance Thesaurus as Global Explanation
Youngwoo Kim
|
Razieh Rahimi
|
James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Most of the efforts in interpreting neural relevance models have been on local explanations, which explain the relevance of a document to a query. However, local explanations are not effective in predicting the model’s behavior on unseen texts. We aim at explaining a neural relevance model by providing lexical explanations that can be globally generalized. Specifically, we construct a relevance thesaurus containing semantically relevant query term and document term pairs, which can augment BM25 scoring functions to better approximate the neural model’s predictions. We propose a novel method to build a relevance thesaurus construction. Our method involves training a neural relevance model which can score the relevance for partial segments of query and documents. The trained model is used to identify relevant terms over the vocabulary space. The resulting thesaurus explanation is evaluated based on ranking effectiveness and fidelity to the targeted neural ranking model. Finally, our thesaurus reveals the existence of brand name bias in ranking models, which further supports the utility of our explanation method.
2023
pdf
bib
abs
Conditional Natural Language Inference
Youngwoo Kim
|
Razieh Rahimi
|
James Allan
Findings of the Association for Computational Linguistics: EMNLP 2023
To properly explain sentence pairs that provide contradictory (different) information for different conditions, we introduce the task of conditional natural language inference (Cond-NLI) and focus on automatically extracting contradictory aspects and their conditions from a sentence pair. Cond-NLI can help to provide a full spectrum of information, such as when there are multiple answers to a question each addressing a specific condition, or reviews with different opinions for different conditions. We show that widely-used feature-attribution explanation models are not suitable for finding conditions, especially when sentences are long and are written independently. We propose a simple yet effective model for the original NLI task that can successfully extract conditions while not requiring token-level annotations. Our model enhances the interpretability of the NLI task while maintaining comparable accuracy. To evaluate models for the Cond-NLI, we build and release a token-level annotated dataset BioClaim which contains potentially contradictory claims from the biomedical domain. Our experiments show that our proposed model outperforms the full cross-encoder and other baselines in extracting conditions. It also performs on-par with GPT-3 which has an order of magnitude more parameters and trained on a huge amount of data.
pdf
bib
abs
AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction
Dong-Ho Lee
|
Ravi Kiran Selvam
|
Sheikh Muhammad Sarwar
|
Bill Yuchen Lin
|
Fred Morstatter
|
Jay Pujara
|
Elizabeth Boschee
|
James Allan
|
Xiang Ren
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging “entity triggers” which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model’s prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average.
2019
pdf
bib
abs
A Multi-Task Architecture on Relevance-based Neural Query Translation
Sheikh Muhammad Sarwar
|
Hamed Bonab
|
James Allan
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
We describe a multi-task learning approach to train a Neural Machine Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search query translation. The translation process for Cross-lingual Information Retrieval (CLIR) task is usually treated as a black box and it is performed as an independent step. However, an NMT model trained on sentence-level parallel data is not aware of the vocabulary distribution of the retrieval corpus. We address this problem and propose a multi-task learning architecture that achieves 16% improvement over a strong baseline on Italian-English query-document dataset. We show using both quantitative and qualitative analysis that our model generates balanced and precise translations with the regularization effect it achieves from multi-task learning paradigm.
pdf
bib
abs
FEVER Breaker’s Run of Team NbAuzDrLqg
Youngwoo Kim
|
James Allan
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)
We describe our submission for the Breaker phase of the second Fact Extraction and VERification (FEVER) Shared Task. Our adversarial data can be explained by two perspectives. First, we aimed at testing model’s ability to retrieve evidence, when appropriate query terms could not be easily generated from the claim. Second, we test model’s ability to precisely understand the implications of the texts, which we expect to be rare in FEVER 1.0 dataset. Overall, we suggested six types of adversarial attacks. The evaluation on the submitted systems showed that the systems were only able get both the evidence and label correct in 20% of the data. We also demonstrate our adversarial run analysis in the data development process.
2017
pdf
bib
abs
Improving Document Clustering by Removing Unnatural Language
Myungha Jang
|
Jinho D. Choi
|
James Allan
Proceedings of the 3rd Workshop on Noisy User-generated Text
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can bean important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of un-natural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various for-mats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that re-moving unnatural language components gives an absolute improvement in document cluster-ing by up to 15%. Our corpus and tool are publicly available
2008
pdf
bib
Proceedings of ACL-08: HLT
Johanna D. Moore
|
Simone Teufel
|
James Allan
|
Sadaoki Furui
Proceedings of ACL-08: HLT
pdf
bib
Proceedings of ACL-08: HLT, Short Papers
Johanna D. Moore
|
Simone Teufel
|
James Allan
|
Sadaoki Furui
Proceedings of ACL-08: HLT, Short Papers
2007
pdf
bib
Information Retrieval On Empty Fields
Victor Lavrenko
|
Xing Yi
|
James Allan
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
pdf
bib
A Case For Shorter Queries, and Helping Users Create Them
Giridhar Kumaran
|
James Allan
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
pdf
bib
Question Answering Using Integrated Information Retrieval and Information Extraction
Barry Schiffman
|
Kathleen McKeown
|
Ralph Grishman
|
James Allan
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference
pdf
bib
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts
Marti Hearst
|
Gina-Anne Levow
|
James Allan
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts
2005
pdf
bib
Using Names and Topics for New Event Detection
Giridhar Kumaran
|
James Allan
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
pdf
bib
Matching Inconsistently Spelled Names in Automatic Speech Recognizer Output for Information Retrieval
Hema Raghavan
|
James Allan
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
2004
pdf
bib
Using Soundex Codes for Indexing Names in ASR Documents
Hema Raghavan
|
James Allan
Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
pdf
bib
Cross-Document Coreference on a Large Scale Corpus
Chung Heong Gooi
|
James Allan
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004
2001
pdf
bib
An Evaluation Corpus For Temporal Summarization
Vikash Khandelwal
|
Rahul Gupta
|
James Allan
Proceedings of the First International Conference on Human Language Technology Research
pdf
bib
Monitoring the News: a TDT demonstration system
David Frey
|
Rahul Gupta
|
Vikas Khandelwal
|
Victor Lavrenko
|
Anton Leuski
|
James Allan
Proceedings of the First International Conference on Human Language Technology Research
1993
pdf
bib
The SMART Information Retrieval Project
C. Buckley
|
G. Salton
|
J. Allan
Human Language Technology: Proceedings of a Workshop Held at Plainsboro, New Jersey, March 21-24, 1993