Hikaru Tomonari

2026

Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks
Masaya Tsunokake | Yuta Koreeda | Terufumi Morishita | Koichi Nagatsuka | Hikaru Tomonari | Yasuhiro Sogawa
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations (micro domains).A previous study shows micro domain-adaptive pre-training (mDAPT) with fewer documents is effective, similarly to DAPT in larger domains.However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown.We aim to reveal the potential and bottlenecks of mDAPT for generative tasks.To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) eliciting facts relevant to questions from an LLM’s own knowledge, (2) reasoning over the facts to obtain conclusions, and (3) composing long-form answers based on the conclusions.We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks.This clarifies mDAPT’s effectiveness in the knowledge aspect and its bottlenecks in other aspects.Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.

2024

pdf bib abs

JFLD: A Japanese Benchmark for Deductive Reasoning Based on Formal Logic
Terufumi Morishita | Atsuki Yamaguchi | Gaku Morio | Hikaru Tomonari | Osamu Imaichi | Yasuhiro Sogawa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) have proficiently solved a broad range of tasks with their rich knowledge but often struggle with logical reasoning. To foster the research on logical reasoning, many benchmarks have been proposed so far. However, most of these benchmarks are limited to English, hindering the evaluation of LLMs specialized for each language. To address this, we propose **JFLD** (**J**apanese **F**ormal **L**ogic **D**eduction), a deductive reasoning benchmark for Japanese. JFLD assess whether LLMs can generate logical steps to (dis-)prove a given hypothesis based on a given set of facts. Its key features are assessing pure logical reasoning abilities isolated from knowledge and assessing various reasoning rules. We evaluate various Japanese LLMs and see that they are still poor at logical reasoning, thus highlighting a substantial need for future research.

2022

pdf bib abs

Robustness Evaluation of Text Classification Models Using Mathematical Optimization and Its Application to Adversarial Training
Hikaru Tomonari | Masaaki Nishino | Akihiro Yamamoto
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Neural networks are known to be vulnerable to adversarial examples due to slightly perturbed input data. In practical applications of neural network models, the robustness of the models against perturbations must be evaluated. However, no method can strictly evaluate their robustness in natural language domains. We therefore propose a method that evaluates the robustness of text classification models using an integer linear programming (ILP) solver by an optimization problem that identifies a minimum synonym swap that changes the classification result. Our method allows us to compare the robustness of various models in realistic time. It can also be used for obtaining adversarial examples. Because of the minimal impact on the altered sentences, adversarial examples with our method obtained high scores in human evaluations of grammatical correctness and semantic similarity for an IMDb dataset. In addition, we implemented adversarial training with the IMDb and SST2 datasets and found that our adversarial training method makes the model robust.

Co-authors

Venues

Fix author