James Allan

UMass Amherst

Also published as: J. Allan

Other people with similar names: James Allen (Rochester)

2024

pdf bib abs
Language Concept Erasure for Language-invariant Dense Retrieval
Zhiqi Huang | Puxuan Yu | Shauli Ravfogel | James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multilingual models aim for language-invariant representations but still prominently encode language identity. This, along with the scarcity of high-quality parallel retrieval data, limits their performance in retrieval. We introduce LANCER, a multi-task learning framework that improves language-invariant dense retrieval by reducing language-specific signals in the embedding space. Leveraging the notion of linear concept erasure, we design a loss function that penalizes cross-correlation between representations and their language labels. LANCER leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic similarity without necessitating a vast parallel corpus. Experimental results on various datasets show our method consistently improves over baselines, with extensive analyses demonstrating greater language agnosticism.

pdf bib abs
Discovering Biases in Information Retrieval Models Using Relevance Thesaurus as Global Explanation
Youngwoo Kim | Razieh Rahimi | James Allan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Most of the efforts in interpreting neural relevance models have been on local explanations, which explain the relevance of a document to a query. However, local explanations are not effective in predicting the model’s behavior on unseen texts. We aim at explaining a neural relevance model by providing lexical explanations that can be globally generalized. Specifically, we construct a relevance thesaurus containing semantically relevant query term and document term pairs, which can augment BM25 scoring functions to better approximate the neural model’s predictions. We propose a novel method to build a relevance thesaurus construction. Our method involves training a neural relevance model which can score the relevance for partial segments of query and documents. The trained model is used to identify relevant terms over the vocabulary space. The resulting thesaurus explanation is evaluated based on ranking effectiveness and fidelity to the targeted neural ranking model. Finally, our thesaurus reveals the existence of brand name bias in ranking models, which further supports the utility of our explanation method.

2023

Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging “entity triggers” which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model’s prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average.

pdf bib abs
Conditional Natural Language Inference
Youngwoo Kim | Razieh Rahimi | James Allan
Findings of the Association for Computational Linguistics: EMNLP 2023

To properly explain sentence pairs that provide contradictory (different) information for different conditions, we introduce the task of conditional natural language inference (Cond-NLI) and focus on automatically extracting contradictory aspects and their conditions from a sentence pair. Cond-NLI can help to provide a full spectrum of information, such as when there are multiple answers to a question each addressing a specific condition, or reviews with different opinions for different conditions. We show that widely-used feature-attribution explanation models are not suitable for finding conditions, especially when sentences are long and are written independently. We propose a simple yet effective model for the original NLI task that can successfully extract conditions while not requiring token-level annotations. Our model enhances the interpretability of the NLI task while maintaining comparable accuracy. To evaluate models for the Cond-NLI, we build and release a token-level annotated dataset BioClaim which contains potentially contradictory claims from the biomedical domain. Our experiments show that our proposed model outperforms the full cross-encoder and other baselines in extracting conditions. It also performs on-par with GPT-3 which has an order of magnitude more parameters and trained on a huge amount of data.

2019

pdf bib abs
FEVER Breaker’s Run of Team NbAuzDrLqg
Youngwoo Kim | James Allan
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

We describe our submission for the Breaker phase of the second Fact Extraction and VERification (FEVER) Shared Task. Our adversarial data can be explained by two perspectives. First, we aimed at testing model’s ability to retrieve evidence, when appropriate query terms could not be easily generated from the claim. Second, we test model’s ability to precisely understand the implications of the texts, which we expect to be rare in FEVER 1.0 dataset. Overall, we suggested six types of adversarial attacks. The evaluation on the submitted systems showed that the systems were only able get both the evidence and label correct in 20% of the data. We also demonstrate our adversarial run analysis in the data development process.

pdf bib abs
A Multi-Task Architecture on Relevance-based Neural Query Translation
Sheikh Muhammad Sarwar | Hamed Bonab | James Allan
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We describe a multi-task learning approach to train a Neural Machine Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search query translation. The translation process for Cross-lingual Information Retrieval (CLIR) task is usually treated as a black box and it is performed as an independent step. However, an NMT model trained on sentence-level parallel data is not aware of the vocabulary distribution of the retrieval corpus. We address this problem and propose a multi-task learning architecture that achieves 16% improvement over a strong baseline on Italian-English query-document dataset. We show using both quantitative and qualitative analysis that our model generates balanced and precise translations with the regularization effect it achieves from multi-task learning paradigm.

2017

pdf bib abs
Improving Document Clustering by Removing Unnatural Language
Myungha Jang | Jinho D. Choi | James Allan
Proceedings of the 3rd Workshop on Noisy User-generated Text

Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can bean important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of un-natural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various for-mats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that re-moving unnatural language components gives an absolute improvement in document cluster-ing by up to 15%. Our corpus and tool are publicly available

Co-authors

Venues

hlt5
naacl5
emnlp4
acl3
ws2
show all...

James Allan

2024

2023

2019

2017

2008

2007

2005

2004

2001

1993

Co-authors

Venues