Lis Pereira


2024

pdf bib
Prior Knowledge-Guided Adversarial Training
Lis Pereira | Fei Cheng | Wan Jou She | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

We introduce a simple yet effective Prior Knowledge-Guided ADVersarial Training (PKG-ADV) algorithm to improve adversarial training for natural language understanding. Our method simply utilizes task-specific label distribution to guide the training process. By prioritizing the use of prior knowledge of labels, we aim to generate more informative adversarial perturbations. We apply our model to several challenging temporal reasoning tasks. Our method enables a more reliable and controllable data training process than relying on randomized adversarial perturbation. Albeit simple, our method achieved significant improvements in these tasks. To facilitate further research, we will release the code and models.

pdf bib
QA-based Event Start-Points Ordering for Clinical Temporal Relation Annotation
Seiji Shimizu | Lis Pereira | Shuntaro Yada | Eiji Aramaki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Temporal relation annotation in the clinical domain is crucial yet challenging due to its workload and the medical expertise required. In this paper, we propose a novel annotation method that integrates event start-points ordering and question-answering (QA) as the annotation format. By focusing only on two points on a timeline, start-points ordering reduces ambiguity and simplifies the relation set to be considered during annotation. QA as annotation recasts temporal relation annotation into a reading comprehension task, allowing annotators to use natural language instead of the formalisms commonly adopted in temporal relation annotation. Based on our method, most of the relations in a document are inferable from a significantly smaller number of explicitly annotated relations, showing the efficiency of our proposed method. Using these inferred relations, we develop a temporal relation classification model that achieves a 0.72 F1 score. Also, by decomposing the annotation process into QA generation and QA validation, our method enables collaboration among medical experts and non-experts. We obtained high inter-annotator agreement (IAA) scores, which indicate the positive prospect of such collaboration in the annotation process. Our annotated corpus, annotation tool, and trained model are publicly available: https://github.com/seiji-shimizu/qa-start-ordering.

2022

pdf bib
OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection
Lis Pereira | Ichiro Kobayashi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression. Given that a key challenge with this task is the limited size of annotated data, our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models (i.e., multilingual BERT and XLM-RoBERTa), and on adversarial training, a training method for further enhancing model generalization and robustness. Without relying on any human-crafted features, knowledgebase, or additional datasets other than the target datasets, our model achieved competitive results and ranked 6thplace in SubTask A (zero-shot) setting and 15thplace in SubTask A (one-shot) setting

2021

pdf bib
OCHADAI at SMM4H-2021 Task 5: Classifying self-reporting tweets on potential cases of COVID-19 by ensembling pre-trained language models
Ying Luo | Lis Pereira | Kobayashi Ichiro
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

Since the outbreak of coronavirus at the end of 2019, there have been numerous studies on coro- navirus in the NLP arena. Meanwhile, Twitter has been a valuable source of news and a pub- lic medium for the conveyance of information and personal expression. This paper describes the system developed by the Ochadai team for the Social Media Mining for Health Appli- cations (SMM4H) 2021 Task 5, which aims to automatically distinguish English tweets that self-report potential cases of COVID-19 from those that do not. We proposed a model ensemble that leverages pre-trained represen- tations from COVID-Twitter-BERT (Müller et al., 2020), RoBERTa (Liu et al., 2019), and Twitter-RoBERTa (Glazkova et al., 2021). Our model obtained F1-scores of 76% on the test set in the evaluation phase, and 77.5% in the post-evaluation phase.

pdf bib
Posterior Differential Regularization with f-divergence for Improving Model Robustness
Hao Cheng | Xiaodong Liu | Lis Pereira | Yaoliang Yu | Jianfeng Gao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We address the problem of enhancing model robustness through regularization. Specifically, we focus on methods that regularize the model posterior difference between clean and noisy inputs. Theoretically, we provide a connection of two recent methods, Jacobian Regularization and Virtual Adversarial Training, under this framework. Additionally, we generalize the posterior differential regularization to the family of f-divergences and characterize the overall framework in terms of the Jacobian matrix. Empirically, we compare those regularizations and standard BERT training on a diverse set of tasks to provide a comprehensive profile of their effect on model generalization. For both fully supervised and semi-supervised settings, we show that regularizing the posterior difference with f-divergence can result in well-improved model robustness. In particular, with a proper f-divergence, a BERT-base model can achieve comparable generalization as its BERT-large counterpart for in-domain, adversarial and domain shift scenarios, indicating the great potential of the proposed framework for enhancing NLP model robustness.

pdf bib
Targeted Adversarial Training for Natural Language Understanding
Lis Pereira | Xiaodong Liu | Hao Cheng | Hoifung Poon | Jianfeng Gao | Ichiro Kobayashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a simple yet effective Targeted Adversarial Training (TAT) algorithm to improve adversarial training for natural language understanding. The key idea is to introspect current mistakes and prioritize adversarial training steps to where the model errs the most. Experiments show that TAT can significantly improve accuracy over standard adversarial training on GLUE and attain new state-of-the-art zero-shot results on XNLI. Our code will be released upon acceptance of the paper.

pdf bib
ALICE++: Adversarial Training for Robust and Effective Temporal Reasoning
Lis Pereira | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf bib
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.

2017

pdf bib
Lexical Simplification with the Deep Structured Similarity Model
Lis Pereira | Xiaodong Liu | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification. Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the state-of-the-art on two standard datasets used for the task.

2015

pdf bib
Collocation Assistant for Learners of Japanese as a Second Language
Lis Pereira | Yuji Matsumoto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

2014

pdf bib
Collocation or Free Combination? — Applying Machine Translation Techniques to identify collocations in Japanese
Lis Pereira | Elga Strafella | Yuji Matsumoto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This work presents an initial investigation on how to distinguish collocations from free combinations. The assumption is that, while free combinations can be literally translated, the overall meaning of collocations is different from the sum of the translation of its parts. Based on that, we verify whether a machine translation system can help us perform such distinction. Results show that it improves the precision compared with standard methods of collocation identification through statistical association measures.

pdf bib
Identifying collocations using cross-lingual association measures
Lis Pereira | Elga Strafella | Kevin Duh | Yuji Matsumoto
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

2013

pdf bib
Automated Collocation Suggestion for Japanese Second Language Learners
Lis Pereira | Erlyn Manguilimotan | Yuji Matsumoto
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop