Towards Robust Comparisons of NLP Models: A Case Study
Vicente Ivan Sanchez Carmona
Shanshan Jiang
Bin Dong
Proceedings of the 31st International Conference on Computational Linguistics
Comparing the test scores of different NLP models across downstream datasets to determine which model leads to the most accurate results is the ultimate step in any experimental work. Doing so via a single mean score may not accurately quantify the real capabilities of the models. Previous works have proposed diverse statistical tests to improve the comparison of NLP models; however, a key statistical phenomenon remains understudied: variability in test scores. We propose a type of regression analysis which better explains this phenomenon by isolating the effect of both nuisance factors (such as random seeds) and datasets from the effects of the models’ capabilities. We showcase our approach via a case study of some of the most popular biomedical NLP models: after isolating nuisance factors and datasets, our results show that the difference between BioLinkBERT and MSR BiomedBERT is, actually, 7 times smaller than previously reported.
Multilevel Analysis of Biomedical Domain Adaptation of Llama 2: What Matters the Most? A Case Study
Vicente Ivan Sanchez Carmona
Shanshan Jiang
Takeshi Suzuki
Bin Dong
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
Domain adaptation of Large Language Models (LLMs) leads to models better suited for a particular domain by capturing patterns from domain text which leads to improvements in downstream tasks. To the naked eye, these improvements are visible; however, the patterns are not so. How can we know which patterns and how much they contribute to changes in downstream scores? Through a Multilevel Analysis we discover and quantify the effect of text patterns on downstream scores of domain-adapted Llama 2 for the task of sentence similarity (BIOSSES dataset). We show that text patterns from PubMed abstracts such as clear writing and simplicity, as well as the amount of biomedical information, are the key for improving downstream scores. Also, we show how another factor not usually quantified contributes equally to downstream scores: choice of hyperparameters for both domain adaptation and fine-tuning.
A Multilevel Analysis of PubMed-only BERT-based Biomedical Models
Vicente Sanchez Carmona
Shanshan Jiang
Bin Dong
Proceedings of the 6th Clinical Natural Language Processing Workshop
Biomedical NLP models play a big role in the automatic extraction of information from biomedical documents, such as COVID research papers. Three landmark models have led the way in this area: BioBERT, MSR BiomedBERT, and BioLinkBERT. However, their shallow evaluation –a single mean score– forbid us to better understand how the contributions proposed in each model advance the Biomedical NLP field. We show through a Multilevel Analysis how we can assess these contributions. Our analyses across 5000 fine-tuned models show that, actually, BiomedBERT’s true effect is bigger than BioLinkBERT’s effect, and the success of BioLinkBERT does not seem to be due to its contribution –the Link function– but due to an unknown factor.
A Semantic Search Engine for Mathlib4
Guoxiong Gao
Haocheng Ju
Jiedong Jiang
Zihan Qin
Bin Dong
Findings of the Association for Computational Linguistics: EMNLP 2024
The interactive theorem prover Lean enables the verification of formal mathematical proofs and is backed by an expanding community. Central to this ecosystem is its mathematical library, mathlib4, which lays the groundwork for the formalization of an expanding range of mathematical theories. However, searching for theorems in mathlib4 can be challenging. To successfully search in mathlib4, users often need to be familiar with its naming conventions or documentation strings. Therefore, creating a semantic search engine that can be used easily by individuals with varying familiarity with mathlib4 is very important. In this paper, we present a semantic search engine for mathlib4 that accepts informal queries and finds the relevant theorems. We also establish a benchmark for assessing the performance of various search engines for mathlib4.
How Well Can a Genetic Algorithm Fine-tune Transformer Encoders? A First Approach
Vicente Ivan Sanchez Carmona
Shanshan Jiang
Bin Dong
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP
Genetic Algorithms (GAs) have been studied across different fields such as engineering or medicine to optimize diverse problems such as network routing, or medical image segmentation. Moreover, they have been used to automatically find optimal architectures for deep neural networks. However, to our knowledge, they have not been applied as a weight optimizer for the Transformer model. While gradient descent has been the main paradigm for this task, we believe that GAs have advantages to bring to the table. In this paper, we will show that even though GAs are capable of fine-tuning Transformer encoders, their generalization ability is considerably poorer than that from Adam; however, on a closer look, GAs ability to exploit knowledge from 2 different pretraining datasets surpasses Adam’s ability to do so.
ChatGPT Is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models
Ning Bian
Xianpei Han
Le Sun
Hongyu Lin
Yaojie Lu
Ben He
Shanshan Jiang
Bin Dong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point. In this paper, we specifically focus on ChatGPT, a widely used and easily accessible LLM, and ask the following questions: (1) Can ChatGPT effectively answer commonsense questions? (2) Is ChatGPT aware of the underlying commonsense knowledge for answering a specific question? (3) Is ChatGPT knowledgeable in commonsense? (4) Can ChatGPT effectively leverage commonsense for answering questions? We conduct a series of experiments on 11 datasets to evaluate ChatGPT’s commonsense abilities, including answering commonsense questions, identifying necessary knowledge, generating knowledge descriptions, and using knowledge descriptions to answer questions again. Experimental results show that: (1) ChatGPT can achieve good QA accuracies in commonsense tasks, while still struggling with certain domains of datasets. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense for answering a specific question. These findings raise the need to explore improved mechanisms for effectively incorporating commonsense into LLMs like ChatGPT, such as better instruction following and commonsense guidance.
Few-shot Named Entity Recognition via Superposition Concept Discrimination
Jiawei Chen
Hongyu Lin
Xianpei Han
Yaojie Lu
Shanshan Jiang
Bin Dong
Le Sun
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Few-shot NER aims to identify entities of target types with only limited number of illustrative instances. Unfortunately, few-shot NER is severely challenged by the intrinsic precise generalization problem, i.e., it is hard to accurately determine the desired target type due to the ambiguity stemming from information deficiency. In this paper, we propose Superposition Concept Discriminator (SuperCD), which resolves the above challenge via an active learning paradigm. Specifically, a concept extractor is first introduced to identify superposition concepts from illustrative instances, with each concept corresponding to a possible generalization boundary. Then a superposition instance retriever is applied to retrieve corresponding instances of these superposition concepts from large-scale text corpus. Finally, annotators are asked to annotate the retrieved instances and these annotated instances together with original illustrative instances are used to learn FS-NER models. To this end, we learn a universal concept extractor and superposition instance retriever using a large-scale openly available knowledge bases. Experiments show that SuperCD can effectively identify superposition concepts from illustrative instances, retrieve superposition instances from large-scale corpus, and significantly improve the few-shot NER performance with minimal additional efforts.
Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models
Boxi Cao
Qiaoyu Tang
Hongyu Lin
Shanshan Jiang
Bin Dong
Xianpei Han
Jiawei Chen
Tianshu Wang
Le Sun
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Memory is one of the most essential cognitive functions serving as a repository of world knowledge and episodes of activities. In recent years, large-scale pre-trained language models have shown remarkable memorizing ability. On the contrary, vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem. To investigate such a retentive-forgetful contradiction and understand the memorizing dynamic mechanism of language models, we conduct thorough experiments by controlling the target knowledge types, the learning strategies and the learning schedules. We find that: 1) Vanilla language models without pre-training are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation. These conclusions are useful for understanding the abilities of pre-trained language models and shed light on designing and evaluating new learning and inference algorithms of language models.
SRCB at #SMM4H 2024: Making Full Use of LLM-based Data Augmentation in Adverse Drug Event Extraction and Normalization
Hongyu Li
Yuming Zhang
Yongwei Zhang
Shanshan Jiang
Bin Dong
Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
This paper reports on the performance of SRCB’s system in the Social Media Mining for Health (#SMM4H) 2024 Shared Task 1: extraction and normalization of adverse drug events (ADEs) in English tweets. We develop a system composed of an ADE extraction module and an ADE normalization module which furtherly includes a retrieval module and a filtering module. To alleviate the data imbalance and other issues introduced by the dataset, we employ 4 data augmentation techniques based on Large Language Models (LLMs) across both modules. Our best submission achieves an F1 score of 53.6 (49.4 on the unseen subset) on the ADE normalization task and an F1 score of 52.1 on ADE extraction task.
SRCB at SemEval-2023 Task 2: A System of Complex Named Entity Recognition with External Knowledge
Yuming Zhang
Hongyu Li
Yongwei Zhang
Shanshan Jiang
Bin Dong
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
The MultiCoNER II shared task aims at detecting semantically ambiguous and complex named entities in short and low-context settings for multiple languages. The lack of context makes the recognition of ambiguous named entities challenging. To alleviate this issue, our team SRCB proposes an external knowledge based system, where we utilize 3 different types of external knowledge retrieved in different ways. Given an original text, our system retrieves the possible labels and the descriptions for each potential entity detected by a mention detection model. And we also retrieve a related document as extra context from Wikipedia for each original text. We concatenate the original text with the external knowledge as the input of NER models. The informative contextual representations with external knowledge significantly improve the NER performance in both Chinese and English tracks. Our system win the 3rd place in the Chinese track and the 6th place in the English track.
Gazetteer-Enhanced Attentive Neural Networks for Named Entity Recognition
Hongyu Lin
Yaojie Lu
Xianpei Han
Le Sun
Bin Dong
Shanshan Jiang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Current region-based NER models only rely on fully-annotated training data to learn effective region encoder, which often face the training data bottleneck. To alleviate this problem, this paper proposes Gazetteer-Enhanced Attentive Neural Networks, which can enhance region-based NER by learning name knowledge of entity mentions from easily-obtainable gazetteers, rather than only from fully-annotated data. Specially, we first propose an attentive neural network (ANN), which explicitly models the mention-context association and therefore is convenient for integrating externally-learned knowledge. Then we design an auxiliary gazetteer network, which can effectively encode name regularity of mentions only using gazetteers. Finally, the learned gazetteer network is incorporated into ANN for better NER. Experiments show that our ANN can achieve the state-of-the-art performance on ACE2005 named entity recognition benchmark. Besides, incorporating gazetteer network can further improve the performance and significantly reduce the requirement of training data.
Supervised neural machine translation based on data augmentation and improved training & inference process
Yixuan Tong
Liang Liang
Boyan Liu
Shanshan Jiang
Bin Dong
Proceedings of the 6th Workshop on Asian Translation
This is the second time for SRCB to participate in WAT. This paper describes the neural machine translation systems for the shared translation tasks of WAT 2019. We participated in ASPEC tasks and submitted results on English-Japanese, Japanese-English, Chinese-Japanese, and Japanese-Chinese four language pairs. We employed the Transformer model as the baseline and experimented relative position representation, data augmentation, deep layer model, ensemble. Experiments show that all these methods can yield substantial improvements.
SRCB Neural Machine Translation Systems in WAT 2018
Yihan Li
Boyan Liu
Yixuan Tong
Shanshan Jiang
Bin Dong
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation