Lis Pereira


pdf bib
OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection
Lis Pereira | Ichiro Kobayashi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression. Given that a key challenge with this task is the limited size of annotated data, our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models (i.e., multilingual BERT and XLM-RoBERTa), and on adversarial training, a training method for further enhancing model generalization and robustness. Without relying on any human-crafted features, knowledgebase, or additional datasets other than the target datasets, our model achieved competitive results and ranked 6thplace in SubTask A (zero-shot) setting and 15thplace in SubTask A (one-shot) setting


pdf bib
ALICE++: Adversarial Training for Robust and Effective Temporal Reasoning
Lis Pereira | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf bib
OCHADAI at SMM4H-2021 Task 5: Classifying self-reporting tweets on potential cases of COVID-19 by ensembling pre-trained language models
Ying Luo | Lis Pereira | Kobayashi Ichiro
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

Since the outbreak of coronavirus at the end of 2019, there have been numerous studies on coro- navirus in the NLP arena. Meanwhile, Twitter has been a valuable source of news and a pub- lic medium for the conveyance of information and personal expression. This paper describes the system developed by the Ochadai team for the Social Media Mining for Health Appli- cations (SMM4H) 2021 Task 5, which aims to automatically distinguish English tweets that self-report potential cases of COVID-19 from those that do not. We proposed a model ensemble that leverages pre-trained represen- tations from COVID-Twitter-BERT (Müller et al., 2020), RoBERTa (Liu et al., 2019), and Twitter-RoBERTa (Glazkova et al., 2021). Our model obtained F1-scores of 76% on the test set in the evaluation phase, and 77.5% in the post-evaluation phase.

pdf bib
Posterior Differential Regularization with f-divergence for Improving Model Robustness
Hao Cheng | Xiaodong Liu | Lis Pereira | Yaoliang Yu | Jianfeng Gao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We address the problem of enhancing model robustness through regularization. Specifically, we focus on methods that regularize the model posterior difference between clean and noisy inputs. Theoretically, we provide a connection of two recent methods, Jacobian Regularization and Virtual Adversarial Training, under this framework. Additionally, we generalize the posterior differential regularization to the family of f-divergences and characterize the overall framework in terms of the Jacobian matrix. Empirically, we compare those regularizations and standard BERT training on a diverse set of tasks to provide a comprehensive profile of their effect on model generalization. For both fully supervised and semi-supervised settings, we show that regularizing the posterior difference with f-divergence can result in well-improved model robustness. In particular, with a proper f-divergence, a BERT-base model can achieve comparable generalization as its BERT-large counterpart for in-domain, adversarial and domain shift scenarios, indicating the great potential of the proposed framework for enhancing NLP model robustness.

pdf bib
Targeted Adversarial Training for Natural Language Understanding
Lis Pereira | Xiaodong Liu | Hao Cheng | Hoifung Poon | Jianfeng Gao | Ichiro Kobayashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a simple yet effective Targeted Adversarial Training (TAT) algorithm to improve adversarial training for natural language understanding. The key idea is to introspect current mistakes and prioritize adversarial training steps to where the model errs the most. Experiments show that TAT can significantly improve accuracy over standard adversarial training on GLUE and attain new state-of-the-art zero-shot results on XNLI. Our code will be released upon acceptance of the paper.


pdf bib
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.


pdf bib
Lexical Simplification with the Deep Structured Similarity Model
Lis Pereira | Xiaodong Liu | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification. Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the state-of-the-art on two standard datasets used for the task.


pdf bib
Collocation Assistant for Learners of Japanese as a Second Language
Lis Pereira | Yuji Matsumoto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications


pdf bib
Collocation or Free Combination? — Applying Machine Translation Techniques to identify collocations in Japanese
Lis Pereira | Elga Strafella | Yuji Matsumoto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This work presents an initial investigation on how to distinguish collocations from free combinations. The assumption is that, while free combinations can be literally translated, the overall meaning of collocations is different from the sum of the translation of its parts. Based on that, we verify whether a machine translation system can help us perform such distinction. Results show that it improves the precision compared with standard methods of collocation identification through statistical association measures.

pdf bib
Identifying collocations using cross-lingual association measures
Lis Pereira | Elga Strafella | Kevin Duh | Yuji Matsumoto
Proceedings of the 10th Workshop on Multiword Expressions (MWE)


pdf bib
Automated Collocation Suggestion for Japanese Second Language Learners
Lis Pereira | Erlyn Manguilimotan | Yuji Matsumoto
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop