Jun Suzuki - ACL Anthology

Jun Suzuki

2025

Batch-wise Convergent Pre-training: Step-by-Step Learning Inspired by Child Language Development
Ko Yoshida | Daiki Shiono | Kai Sato | Toko Miura | Momoka Furuhashi | Jun Suzuki
Proceedings of the First BabyLM Workshop

Human children acquire language from a substantially smaller amount of linguistic input than that typically required for training large language models (LLMs). This gap motivates the search for more efficient pre-training methods. Inspired by child development, curriculum learning, which progresses from simple to complex data, has been widely adopted. In this study, we propose a pre-training framework that mirrors child language acquisition, advancing step by step from words to sentences while retaining prior knowledge. We investigate whether this improves retention and efficiency under limited resources. Our approach is implemented through four components: (i) a curriculum-aligned dataset, (ii) a batch-wise convergence loop, (iii) a distance-controlled loss to mitigate forgetting, and (iv) a constraint-controlled optimizer for stability. Experiments on the BabyLM benchmark show that the proposed method performs slightly below the official baselines in overall accuracy, with larger gaps on grammar-oriented evaluations such as BLiMP. Nonetheless, it yields small but consistent gains on morphology- and discourse-related tasks (e.g., WUG-ADJ, Entity Tracking), suggesting that the approach affects different linguistic aspects unevenly under limited data conditions.

Can Language Neuron Intervention Reduce Non-Target Language Output?
Suchun Xie | Hwichan Kim | Shota Sasaki | Kosuke Yamada | Jun Suzuki
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Large language models (LLMs) often fail to generate text in the intended target language, particularly in non-English interactions. Concurrently, recent work has explored Language Neuron Intervention (LNI) as a promising technique for steering output language. In this paper, we re-evaluate LNI in more practical scenarios—specifically with instruction-tuned models and prompts that explicitly specify the target language. Our experiments show that while LNI also shows potential in such practical scenarios, its average effect is limited and unstable across models and tasks, with a 0.83% reduction in undesired language output and a 0.1% improvement in performance. Our further analysis identifies two key factors for LNI’s limitation: (1) LNI affects both the output language and the content semantics, making it hard to control one without affecting the other, which explains the weak performance gains. (2) LNI increases the target language token probabilities, but they often remain below the top-1generation threshold, resulting in failure to generate the target language in most cases. Our results highlight both the potential and limitations of LNI, paving the way for future improvements.

MQM-Chat: Multidimensional Quality Metrics for Chat Translation
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui
Proceedings of the 31st International Conference on Computational Linguistics

The complexities of chats, such as the stylized contents specific to source segments and dialogue consistency, pose significant challenges for machine translation. Recognizing the need for a precise evaluation metric to address the issues associated with chat translation, this study introduces Multidimensional Quality Metrics for Chat Translation (MQM-Chat), which encompasses seven error types, including three specifically designed for chat translations: ambiguity and disambiguation, buzzword or loanword issues, and dialogue inconsistency. In this study, human annotations were applied to the translations of chat data generated by five translation models. Based on the error distribution of MQM-Chat and the performance of relabeling errors into chat-specific types, we concluded that MQM-Chat effectively classified the errors while highlighting chat-specific issues explicitly. The results demonstrate that MQM-Chat can qualify both the lexical accuracy and semantical accuracy of translation models in chat translation tasks.

Evaluating Model Alignment with Human Perception: A Study on Shitsukan in LLMs and LVLMs
Daiki Shiono | Ana Brassard | Yukiko Ishizuki | Jun Suzuki
Proceedings of the 31st International Conference on Computational Linguistics

We evaluate the alignment of large language models (LLMs) and large vision-language models (LVLMs) with human perception, focusing on the Japanese concept of *shitsukan*, which reflects the sensory experience of perceiving objects. We created a dataset of *shitsukan* terms elicited from individuals in response to object images. With it, we designed benchmark tasks for three dimensions of understanding *shitsukan*: (1) accurate perception in object images, (2) commonsense knowledge of typical *shitsukan* terms for objects, and (3) distinction of valid *shitsukan* terms. Models demonstrated mixed accuracy across benchmark tasks, with limited overlap between model- and human-generated terms. However, manual evaluations revealed that the model-generated terms were still natural to humans. This work identifies gaps in culture-specific understanding and contributes to aligning models with human sensory perception. We publicly release the dataset to encourage further research in this area.

Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
Toshiki Kuramoto | Jun Suzuki
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

This paper proposes a method to estimate the performance of pretrained models fine-tuned with a larger dataset from the result with a smaller dataset. Specifically, we demonstrate that when a pretrained model is fine-tuned, its classification performance increases at the same overall rate, regardless of the original dataset size, as the number of epochs increases. Subsequently, we verify that an approximate formula based on this trend can be used to predict the performance when the model is trained with ten times or more training data, even when the initial training dataset is limited. Our results show that this approach can help resource-limited companies develop machine-learning models.

Enhancing Persuasive Dialogue Agents by Synthesizing Cross‐Disciplinary Communication Strategies
Shinnosuke Nozue | Yuto Nakano | Yotaro Watanabe | Meguru Takasaki | Shoji Moriya | Reina Akama | Jun Suzuki
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.

Can Input Attributions Explain Inductive Reasoning in In-Context Learning?
Mengyu Ye | Tatsuki Kuribayashi | Goro Kobayashi | Jun Suzuki
Findings of the Association for Computational Linguistics: ACL 2025

Interpreting the internal process of neural models has long been a challenge. This challenge remains relevant in the era of large language models (LLMs) and in-context learning (ICL); for example, ICL poses a new issue of interpreting which example in the few-shot examples contributed to identifying/solving the task. To this end, in this paper, we design synthetic diagnostic tasks of inductive reasoning, inspired by the generalization tests in linguistics; here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates the task demonstrated. The question is whether conventional input attribution (IA) methods can track such a reasoning process, i.e., identify the influential example, in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.

STEP: Staged Parameter-Efficient Pre-training for Large Language Models
Kazuki Yano | Takumi Ito | Jun Suzuki
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model weights. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.

How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations
Ikumi Numaya | Shoji Moriya | Shiki Sato | Reina Akama | Jun Suzuki
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent advancements in dialogue generation have broadened the scope of human–bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

KIKIS at WMT 2025 General Translation Task
Koichi Iwakawa | Keito Kudo | Subaru Kimura | Takumi Ito | Jun Suzuki
Proceedings of the Tenth Conference on Machine Translation

We participated in the constrained English–Japanese track of the WMT 2025 General Machine Translation Task.Our system collected the outputs produced by multiple subsystems, each of which consisted of LLM-based translation and reranking models configured differently (e.g., prompting strategies and context sizes), and reranked those outputs.Each subsystem generated multiple segment-level candidates and iteratively selected the most probable one to construct the document translation.We then reranked the document-level outputs from all subsystems to obtain the final translation.For reranking, we adopted a text-based LLM reranking approach with a reasoning model to take long contexts into account.Additionally, we built a bilingual dictionary on the fly from parallel corpora to make the system more robust to rare words.

2024

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
Masato Mita | Keisuke Sakaguchi | Masato Hagiwara | Tomoya Mizumoto | Jun Suzuki | Kentaro Inui
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

Natural language processing (NLP) technology has rapidly improved automated grammatical error correction (GEC) tasks, and the GEC community has begun to explore document-level revision. However, there are two major obstacles to going beyond automated sentence-level GEC to NLP-based document-level revision support: (1) there are few public corpora with document-level revisions annotated by professional editors, and (2) it is infeasible to obtain all possible references and evaluate revision quality using such references because there are infinite revision possibilities. To address these challenges, this paper proposes a new document revision corpus, Text Revision of ACL papers (TETRA), in which multiple professional editors have revised academic papers sampled from the ACL anthology. This corpus enables us to focus on document-level and paragraph-level edits, such as edits related to coherence and consistency. Additionally, as a case study using the TETRA corpus, we investigate reference-less and interpretable methods for meta-evaluation to detect quality improvements according to document revisions. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle.

The Impact of Integration Step on Integrated Gradients
Masahiro Makino | Yuya Asazuma | Shota Sasaki | Jun Suzuki
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Integrated Gradients (IG) serve as a potent tool for explaining the internal structure of a language model. The calculation of IG requires numerical integration, wherein the number of steps serves as a critical hyperparameter. The step count can drastically alter the results, inducing considerable errors in interpretability. To scrutinize the effect of step variation on IG, we measured the difference between theoretical and observed IG totals for each step amount.Our findings indicate that the ideal number of steps to maintain minimal error varies from instance to instance. Consequently, we advocate for customizing the step count for each instance. Our study is the first to quantitatively analyze the variation of IG values with the number of steps.

A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems
Shiki Sato | Reina Akama | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2024

Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driven methods to mitigate contradictions may be enhanced with large-scale contradiction data for training. Nevertheless, no attempt has been made to build an extensive collection of model-generated contradictory responses. In this paper, we build a large dataset of response generation models’ contradictions for the first time. Then, we acquire valuable insights into the characteristics of model-generated contradictions through an extensive analysis of the collected responses. Lastly, we also demonstrate how this dataset substantially enhances the performance of data-driven contradiction suppression methods.

Pruning Multilingual Large Language Models for Multilingual Inference
Hwichan Kim | Jun Suzuki | Tosho Hirasawa | Mamoru Komachi
Findings of the Association for Computational Linguistics: EMNLP 2024

Multilingual large language models (MLLMs), trained on multilingual balanced data, demonstrate better zero-shot learning performance in non-English languages compared to large language models trained on English-dominant data. However, the disparity in performance between English and non-English languages remains a challenge yet to be fully addressed. This study introduces a promising direction for enhancing non-English performance through a specialized pruning approach. Specifically, we prune MLLMs using bilingual sentence pairs from English and other languages and empirically demonstrate that this pruning strategy can enhance the MLLMs’ performance in non-English language.

Detecting Response Generation Not Requiring Factual Judgment
Ryohei Kamei | Daiki Shiono | Reina Akama | Jun Suzuki
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge.However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues.This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset.The model with the highest classification accuracy could yield about 88% accurate classification results.

Document-level Translation with LLM Reranking: Team-J at WMT 2024 General Translation Task
Keito Kudo | Hiroyuki Deguchi | Makoto Morishita | Ryo Fujii | Takumi Ito | Shintaro Ozaki | Koki Natsumi | Kai Sato | Kazuki Yano | Ryosuke Takahashi | Subaru Kimura | Tomomasa Hara | Yusuke Sakai | Jun Suzuki
Proceedings of the Ninth Conference on Machine Translation

We participated in the constrained track for English-Japanese and Japanese-Chinese translations at the WMT 2024 General Machine Translation Task. Our approach was to generate a large number of sentence-level translation candidates and select the most probable translation using minimum Bayes risk (MBR) decoding and document-level large language model (LLM) re-ranking. We first generated hundreds of translation candidates from multiple translation models and retained the top 30 candidates using MBR decoding. In addition, we continually pre-trained LLMs on the target language corpora to leverage document-level information. We utilized LLMs to select the most probable sentence sequentially in context from the beginning of the document.

2023

Hunt for Buried Treasures: Extracting Unclaimed Embodiments from Patent Specifications
Chikara Hashimoto | Gautam Kumar | Shuichiro Hashimoto | Jun Suzuki
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Patent applicants write patent specificationsthat describe embodiments of inventions. Some embodiments are claimed for a patent,while others may be unclaimeddue to strategic considerations. Unclaimed embodiments may be extracted byapplicants later and claimed incontinuing applications togain advantages over competitors. Despite being essential for corporate intellectual property (IP) strategies,unclaimed embodiment extraction is conducted manually,and little research has been conducted on its automation. This paper presents a novel task ofunclaimed embodiment extraction (UEE)and a novel dataset for the task. Our experiments with Transformer-based modelsdemonstratedthat the task was challenging as it requiredconducting natural language inference onpatent specifications, which consisted oftechnical, long, syntactically and semanticallyinvolved sentences. We release the dataset and code to foster this new area of research.

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video
Keito Kudo | Haruki Nagasawa | Jun Suzuki | Nobuyuki Shimizu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying them in a listable format to grasp the video content quickly. This task aims to extract crucial scenes from the video in the form of images (keyframes) and generate corresponding captions explaining each keyframe’s situation. This task is useful as a practical application and presents a highly challenging problem worthy of study. Specifically, achieving simultaneous optimization of the keyframe selection performance and caption quality necessitates careful consideration of the mutual dependence on both preceding and subsequent keyframes and captions. To facilitate subsequent research in this field, we also construct a dataset by expanding upon existing datasets and propose an evaluation framework. Furthermore, we develop two baseline systems and report their respective performance.

Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism
Mengyu Ye | Tatsuki Kuribayashi | Jun Suzuki | Goro Kobayashi | Hiroaki Funayama
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their ability to perform CoT-style reasoning robustly is of interest from a probing perspective. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation, which is a core linguistic phenomenon that is difficult to process. In particular, we introduce several controlled settings (e.g., reasoning in case of fictional entities) to evaluate the logical reasoning abilities of the models. We observed that dozens of modern LLMs were not robust against lexical negation (e.g., plausible→implausible) when performing CoT-style reasoning, and the results highlight unique limitations in each LLM family.

B2T Connection: Serving Stability and Performance in Deep Transformers
Sho Takase | Shun Kiyono | Sosuke Kobayashi | Jun Suzuki
Findings of the Association for Computational Linguistics: ACL 2023

In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN.Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN.We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.

Investigating the Effectiveness of Multiple Expert Models Collaboration
Ikumi Ito | Takumi Ito | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023

This paper aims to investigate the effectiveness of several machine translation (MT) models and aggregation methods in a multi-domain setting under fair conditions and explore a direction for tackling multi-domain MT. We mainly compare the performance of the single model approach by jointly training all domains and the multi-expert models approach with a particular aggregation strategy. We conduct experiments on multiple domain datasets and demonstrate that a combination of smaller domain expert models can outperform a larger model trained for all domain data.

An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

SKIM at WMT 2023 General Translation Task
Keito Kudo | Takumi Ito | Makoto Morishita | Jun Suzuki
Proceedings of the Eighth Conference on Machine Translation

The SKIM team’s submission used a standard procedure to build ensemble Transformer models, including base-model training, back-translation of base models for data augmentation, and retraining of several final models using back-translated training data. Each final model had its own architecture and configuration, including up to 10.5B parameters, and substituted self- and cross-sublayers in the decoder with a cross+self-attention sub-layer. We selected the best candidate from a large candidate pool, namely 70 translations generated from 13 distinct models for each sentence, using an MBR reranking method using COMET and COMET-QE. We also applied data augmentation and selection techniques to the training data of the Transformer models.

2022

Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems
Shiki Sato | Yosuke Kishinami | Hiroaki Sugiyama | Reina Akama | Ryoko Tokuhisa | Jun Suzuki
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop

Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the automatic evaluation using the bipartite-play method mitigates these two drawbacks and correlates as strongly with human subjectivity as existing methods.

Scene-Text Aware Image and Text Retrieval with Dual-Encoder
Shumpei Miyawaki | Taku Hasegawa | Kyosuke Nishida | Takuma Kato | Jun Suzuki
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We tackle the tasks of image and text retrieval using a dual-encoder model in which images and text are encoded independently. This model has attracted attention as an approach that enables efficient offline inferences by connecting both vision and language in the same semantic space; however, whether an image encoder as part of a dual-encoder model can interpret scene-text (i.e., the textual information in images) is unclear. We propose pre-training methods that encourage a joint understanding of the scene-text and surrounding visual information. The experimental results demonstrate that our methods improve the retrieval performances of the dual-encoder models.

Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model
Sosuke Kobayashi | Shun Kiyono | Jun Suzuki | Kentaro Inui
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks and their ensemble outperformed the standard ensemble in some tasks when accurate lottery tickets are found on the tasks.

Target-Guided Open-Domain Conversation Planning
Yosuke Kishinami | Reina Akama | Shiki Sato | Ryoko Tokuhisa | Jun Suzuki | Kentaro Inui
Proceedings of the 29th International Conference on Computational Linguistics

Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. Using the TGCP task, we investigate the conversation planning abilities of existing retrieval models and recent strong generative models. The experimental results reveal the challenges facing current technology.

Domain Adaptation of Machine Translation with Crowdworkers
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain’s data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propose a framework that efficiently and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers.With the collected parallel data, we can quickly adapt a machine translation model to the target domain. Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost. We tested it with five domains, and the domain-adapted model improved the BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose translation model.

Chat Translation Error Detection for Assisting Cross-lingual Communications
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Ryoko Tokuhisa | Ana Brassard | Kentaro Inui
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus
Makoto Morishita | Katsuki Chousa | Jun Suzuki | Masaaki Nagata
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Most current machine translation models are mainly trained with parallel corpora, and their translation accuracy largely depends on the quality and quantity of the corpora. Although there are billions of parallel sentences for a few language pairs, effectively dealing with most language pairs is difficult due to a lack of publicly available parallel corpora. This paper creates a large parallel corpus for English-Japanese, a language pair for which only limited resources are available, compared to such resource-rich languages as English-German. It introduces a new web-based English-Japanese parallel corpus named JParaCrawl v3.0. Our new corpus contains more than 21 million unique parallel sentence pairs, which is more than twice as many as the previous JParaCrawl v2.0 corpus. Through experiments, we empirically show how our new corpus boosts the accuracy of machine translation models on various domains. The JParaCrawl v3.0 corpus will eventually be publicly available online for research purposes.

N-best Response-based Analysis of Contradiction-awareness in Neural Response Generation Models
Shiki Sato | Reina Akama | Hiroki Ouchi | Ryoko Tokuhisa | Jun Suzuki | Kentaro Inui
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen from this n-best list. This study quantitatively analyzes the contextual contradiction-awareness of neural response generation models using the consistency of the n-best lists. Particularly, we used polar questions as stimulus inputs for concise and quantitative analyses. Our tests illustrate the contradiction-awareness of recent neural response generation models and methodologies, followed by a discussion of their properties and limitations.

NT5 at WMT 2022 General Translation Task
Makoto Morishita | Keito Kudo | Yui Oka | Katsuki Chousa | Shun Kiyono | Sho Takase | Jun Suzuki
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the NTT-Tohoku-TokyoTech-RIKEN (NT5) team’s submission system for the WMT’22 general translation task. This year, we focused on the English-to-Japanese and Japanese-to-English translation tracks. Our submission system consists of an ensemble of Transformer models with several extensions. We also applied data augmentation and selection techniques to obtain potentially effective training data for training individual Transformer models in the pre-training and fine-tuning scheme. Additionally, we report our trial of incorporating a reranking module and the reevaluated results of several techniques that have been recently developed and published.

2021

Context-aware Neural Machine Translation with Mini-batch Embedding
Makoto Morishita | Jun Suzuki | Tomoharu Iwata | Masaaki Nagata
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

It is crucial to provide an inter-sentence context in Neural Machine Translation (NMT) models for higher-quality translation. With the aim of using a simple approach to incorporate inter-sentence information, we propose mini-batch embedding (MBE) as a way to represent the features of sentences in a mini-batch. We construct a mini-batch by choosing sentences from the same document, and thus the MBE is expected to have contextual information across sentences. Here, we incorporate MBE in an NMT model, and our experiments show that the proposed method consistently outperforms the translation capabilities of strong baselines and improves writing style or terminology to fit the document’s context.

SHAPE: Shifted Absolute Position Embedding for Transformers
Shun Kiyono | Sosuke Kobayashi | Jun Suzuki | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster.

Instance-Based Neural Dependency Parsing
Hiroki Ouchi | Jun Suzuki | Sosuke Kobayashi | Sho Yokoi | Tatsuki Kuribayashi | Masashi Yoshikawa | Kentaro Inui
Transactions of the Association for Computational Linguistics, Volume 9

Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations.

2020

Language Models as an Alternative Evaluator of Word Order Hypotheses: A Case Study in Japanese
Tatsuki Kuribayashi | Takumi Ito | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We examine a methodology using neural language models (LMs) for analyzing the word order of language. This LM-based method has the potential to overcome the difficulties existing methods face, such as the propagation of preprocessor errors in count-based methods. In this study, we explore whether the LM-based method is valid for analyzing the word order. As a case study, this study focuses on Japanese due to its complex and flexible word order. To validate the LM-based method, we test (i) parallels between LMs and human word order preference, and (ii) consistency of the results obtained using the LM-based method with previous linguistic studies. Through our experiments, we tentatively conclude that LMs display sufficient word order knowledge for usage as an analysis tool. Finally, using the LM-based method, we demonstrate the relationship between the canonical word order and topicalization, which had yet to be analyzed by large-scale experiments.

Evaluating Dialogue Generation Systems via Response Selection
Shiki Sato | Reina Akama | Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose a method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test set developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.

Single Model Ensemble using Pseudo-Tags and Distinct Vectors
Ryosuke Kuwabara | Jun Suzuki | Hideki Nakayama
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Model ensemble techniques often increase task performance in neural networks; however, they require increased time, memory, and management effort. In this study, we propose a novel method that replicates the effects of a model ensemble with a single model. Our approach creates K-virtual models within a single parameter space using K-distinct pseudo-tags and K-distinct vectors. Experiments on text classification and sequence labeling tasks on several datasets demonstrate that our method emulates or outperforms a traditional model ensemble with 1/K-times fewer parameters.

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction
Masahiro Kaneko | Masato Mita | Shun Kiyono | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/kanekomasahiro/bert-gec.

Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition
Hiroki Ouchi | Jun Suzuki | Sosuke Kobayashi | Sho Yokoi | Tatsuki Kuribayashi | Ryuto Konno | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Interpretable rationales for model predictions play a critical role in practical applications. In this study, we develop models possessing interpretable inference process for structured prediction. Specifically, we present a method of instance-based learning that learns similarities between spans. At inference time, each span is assigned a class label based on its similar spans in the training set, where it is easy to understand how much each training instance contributes to the predictions. Through empirical analysis on named entity recognition, we demonstrate that our method enables to build models that have high interpretability without sacrificing performance.

Embeddings of Label Components for Sequence Labeling: A Case Study of Fine-grained Named Entity Recognition
Takuma Kato | Kaori Abe | Hiroki Ouchi | Shumpei Miyawaki | Jun Suzuki | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In general, the labels used in sequence labeling consist of different types of elements. For example, IOB-format entity labels, such as B-Person and I-Person, can be decomposed into span (B and I) and type information (Person). However, while most sequence labeling models do not consider such label components, the shared components across labels, such as Person, can be beneficial for label prediction. In this work, we propose to integrate label component information as embeddings into models. Through experiments on English and Japanese fine-grained named entity recognition, we demonstrate that the proposed method improves performance, especially for instances with low-frequency labels.

Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation
Hiroaki Funayama | Shota Sasaki | Yuichiroh Matsubayashi | Tomoya Mizumoto | Jun Suzuki | Masato Mita | Kentaro Inui
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Many recent Short Answer Scoring (SAS) systems have employed Quadratic Weighted Kappa (QWK) as the evaluation measure of their systems. However, we hypothesize that QWK is unsatisfactory for the evaluation of the SAS systems when we consider measuring their effectiveness in actual usage. We introduce a new task formulation of SAS that matches the actual usage. In our formulation, the SAS systems should extract as many scoring predictions that are not critical scoring errors (CSEs). We conduct the experiments in our new task formulation and demonstrate that a typical SAS system can predict scores with zero CSE for approximately 50% of test data at maximum by filtering out low-reliablility predictions on the basis of a certain confidence estimation. This result directly indicates the possibility of reducing half the scoring cost of human raters, which is more preferable for the evaluation of SAS systems.

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents
Ryo Fujii | Masato Mita | Kaori Abe | Kazuaki Hanawa | Makoto Morishita | Jun Suzuki | Kentaro Inui
Proceedings of the 28th International Conference on Computational Linguistics

Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness
Reina Akama | Sho Yokoi | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large-scale dialogue datasets have recently become available for training neural dialogue agents. However, these datasets have been reported to contain a non-negligible number of unacceptable utterance pairs. In this paper, we propose a method for scoring the quality of utterance pairs in terms of their connectivity and relatedness. The proposed scoring method is designed based on findings widely shared in the dialogue and linguistics research communities. We demonstrate that it has a relatively good correlation with the human judgment of dialogue quality. Furthermore, the method is applied to filter out potentially unacceptable utterance pairs from a large-scale noisy dialogue corpus to ensure its quality. We experimentally confirm that training data filtered by the proposed method improves the quality of neural dialogue agents in response generation.

Word Rotator’s Distance
Sho Yokoi | Ryo Takahashi | Reina Akama | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

One key principle for assessing textual similarity is measuring the degree of semantic overlap between texts by considering the word alignment. Such alignment-based approaches are both intuitive and interpretable; however, they are empirically inferior to the simple cosine similarity between general-purpose sentence vectors. We focus on the fact that the norm of word vectors is a good proxy for word importance, and the angle of them is a good proxy for word similarity. However, alignment-based approaches do not distinguish the norm and direction, whereas sentence-vector approaches automatically use the norm as the word importance. Accordingly, we propose decoupling word vectors into their norm and direction then computing the alignment-based similarity with the help of earth mover’s distance (optimal transport), which we refer to as word rotator’s distance. Furthermore, we demonstrate how to grow the norm and direction of word vectors (vector converter); this is a new systematic approach derived from the sentence-vector estimation methods, which can significantly improve the performance of the proposed method. On several STS benchmarks, the proposed methods outperform not only alignment-based approaches but also strong baselines. The source code is avaliable at https://github.com/eumesy/wrd

Langsmith: An Interactive Academic Text Revision System
Takumi Ito | Tatsuki Kuribayashi | Masatoshi Hidaka | Jun Suzuki | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Despite the current diversity and inclusion initiatives in the academic community, researchers with a non-native command of English still face significant obstacles when writing papers in English. This paper presents the Langsmith editor, which assists inexperienced, non-native researchers to write English papers, especially in the natural language processing (NLP) field. Our system can suggest fluent, academic-style sentences to writers based on their rough, incomplete phrases or sentences. The system also encourages interaction between human writers and the computerized revision system. The experimental results demonstrated that Langsmith helps non-native English-speaker students write papers in English. The system is available at https://emnlp-demo.editor.langsmith.co.jp/.

A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction
Masato Mita | Shun Kiyono | Masahiro Kaneko | Jun Suzuki | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2020

Existing approaches for grammatical error correction (GEC) largely rely on supervised learning with manually created GEC datasets. However, there has been little focus on verifying and ensuring the quality of the datasets, and on how lower-quality data might affect GEC performance. We indeed found that there is a non-negligible amount of “noise” where errors were inappropriately edited or left uncorrected. To address this, we designed a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models, and outperformed strong denoising baseline methods. We further applied task-specific techniques and achieved state-of-the-art performance on the CoNLL-2014, JFLEG, and BEA-2019 benchmarks. We then analyzed the effect of the proposed denoising method, and found that our approach leads to improved coverage of corrections and facilitated fluency edits which are reflected in higher recall and overall performance.

Seeing the World through Text: Evaluating Image Descriptions for Commonsense Reasoning in Machine Reading Comprehension
Diana Galvan-Sosa | Jun Suzuki | Kyosuke Nishida | Koji Matsuda | Kentaro Inui
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Despite recent achievements in natural language understanding, reasoning over commonsense knowledge still represents a big challenge to AI systems. As the name suggests, common sense is related to perception and as such, humans derive it from experience rather than from literary education. Recent works in the NLP and the computer vision field have made the effort of making such knowledge explicit using written language and visual inputs, respectively. Our premise is that the latter source fits better with the characteristics of commonsense acquisition. In this work, we explore to what extent the descriptions of real-world scenes are sufficient to learn common sense about different daily situations, drawing upon visual information to answer script knowledge questions.

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.

Efficient Estimation of Influence of a Training Instance
Sosuke Kobayashi | Sho Yokoi | Jun Suzuki | Kentaro Inui
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model’s prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.

Tohoku-AIP-NTT at WMT 2020 News Translation Task
Shun Kiyono | Takumi Ito | Ryuto Konno | Makoto Morishita | Jun Suzuki
Proceedings of the Fifth Conference on Machine Translation

In this paper, we describe the submission of Tohoku-AIP-NTT to the WMT’20 news translation task. We participated in this task in two language pairs and four language directions: English <–> German and English <–> Japanese. Our system consists of techniques such as back-translation and fine-tuning, which are already widely adopted in translation tasks. We attempted to develop new methods for both synthetic data filtering and reranking. However, the methods turned out to be ineffective, and they provided us with no significant improvement over the baseline. We analyze these negative results to provide insights for future studies.

2019

ESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper
Hirofumi Inaguma | Shun Kiyono | Nelson Enrique Yalta Soplin | Jun Suzuki | Kevin Duh | Shinji Watanabe
Proceedings of the 16th International Conference on Spoken Language Translation

This paper describes the ESPnet submissions to the How2 Speech Translation task at IWSLT2019. In this year, we mainly build our systems based on Transformer architectures in all tasks and focus on the end-to-end speech translation (E2E-ST). We first compare RNN-based models and Transformer, and then confirm Transformer models significantly and consistently outperform RNN models in all tasks and corpora. Next, we investigate pre-training of E2E-ST models with the ASR and MT tasks. On top of the pre-training, we further explore knowledge distillation from the NMT model and the deeper speech encoder, and confirm drastic improvements over the baseline model. All of our codes are publicly available in ESPnet.

Select and Attend: Towards Controllable Content Selection in Text Generation
Xiaoyu Shen | Jun Suzuki | Kentaro Inui | Hui Su | Dietrich Klakow | Satoshi Sekine
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the decoder. The decoupled content selection is human interpretable, whose value can be manually manipulated to control the content of generated text. The model can be trained end-to-end without human annotations by maximizing a lower bound of the marginal likelihood. We further propose an effective way to trade-off between performance and controllability with a single adjustable hyperparameter. In both data-to-text and headline generation tasks, our model achieves promising results, paving the way for controllable content selection in text generation.

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction
Shun Kiyono | Jun Suzuki | Masato Mita | Tomoya Mizumoto | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performance is achieved on the CoNLL-2014 test set (F0.5=65.0) and the official test set of the BEA-2019 shared task (F0.5=70.2) without making any modifications to the model architecture.

Transductive Learning of Neural Language Models for Syntactic and Semantic Analysis
Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In transductive learning, an unlabeled test set is used for model training. Although this setting deviates from the common assumption of a completely unseen test set, it is applicable in many real-world scenarios, wherein the texts to be processed are known in advance. However, despite its practical advantages, transductive learning is underexplored in natural language processing. Here we conduct an empirical study of transductive learning for neural models and demonstrate its utility in syntactic and semantic tasks. Specifically, we fine-tune language models (LMs) on an unlabeled test set to obtain test-set-specific word representations. Through extensive experiments, we demonstrate that despite its simplicity, transductive LM fine-tuning consistently improves state-of-the-art neural models in in-domain and out-of-domain settings.

TEASPN: Framework and Protocol for Integrated Writing Assistance Environments
Masato Hagiwara | Takumi Ito | Tatsuki Kuribayashi | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Language technologies play a key role in assisting people with their writing. Although there has been steady progress in e.g., grammatical error correction (GEC), human writers are yet to benefit from this progress due to the high development cost of integrating with writing software. We propose TEASPN, a protocol and an open-source framework for achieving integrated writing assistance environments. The protocol standardizes the way writing software communicates with servers that implement such technologies, allowing developers and researchers to integrate the latest developments in natural language processing (NLP) with low cost. As a result, users can enjoy the integrated experience in their favorite writing software. The results from experiments with human participants show that users use a wide range of technologies and rate their writing experience favorably, allowing them to write more fluent text.

NTT Neural Machine Translation Systems at WAT 2019
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 6th Workshop on Asian Translation

In this paper, we describe our systems that were submitted to the translation shared tasks at WAT 2019. This year, we participated in two distinct types of subtasks, a scientific paper subtask and a timely disclosure subtask, where we only considered English-to-Japanese and Japanese-to-English translation directions. We submitted two systems (En-Ja and Ja-En) for the scientific paper subtask and two systems (Ja-En, texts, items) for the timely disclosure subtask. Three of our four systems obtained the best human evaluation performances. We also confirmed that our new additional web-crawled parallel corpus improves the performance in unconstrained settings.

Subword-based Compact Reconstruction of Word Embeddings
Shota Sasaki | Jun Suzuki | Kentaro Inui
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The idea of subword-based word embeddings has been proposed in the literature, mainly for solving the out-of-vocabulary (OOV) word problem observed in standard word-based word embeddings. In this paper, we propose a method of reconstructing pre-trained word embeddings using subword information that can effectively represent a large number of subword embeddings in a considerably small fixed space. The key techniques of our method are twofold: memory-shared embeddings and a variant of the key-value-query self-attention mechanism. Our experiments show that our reconstructed subword-based embeddings can successfully imitate well-trained word embeddings in a small fixed space while preventing quality degradation across several linguistic benchmark datasets, and can simultaneously predict effective embeddings of OOV words. We also demonstrate the effectiveness of our reconstruction method when we apply them to downstream tasks.

Effective Adversarial Regularization for Neural Machine Translation
Motoki Sato | Jun Suzuki | Shun Kiyono
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A regularization technique based on adversarial perturbation, which was initially developed in the field of image processing, has been successfully applied to text classification tasks and has yielded attractive improvements. We aim to further leverage this promising methodology into more sophisticated and critical neural models in the natural language processing field, i.e., neural machine translation (NMT) models. However, it is not trivial to apply this methodology to such models. Thus, this paper investigates the effectiveness of several possible configurations of applying the adversarial perturbation and reveals that the adversarial regularization technique can significantly and consistently improve the performance of widely used NMT models, such as LSTM-based and Transformer-based models.

An Empirical Study of Span Representations in Argumentation Structure Parsing
Tatsuki Kuribayashi | Hiroki Ouchi | Naoya Inoue | Paul Reisert | Toshinori Miyoshi | Jun Suzuki | Kentaro Inui
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

For several natural language processing (NLP) tasks, span representation design is attracting considerable attention as a promising new technique; a common basis for an effective design has been established. With such basis, exploring task-dependent extensions for argumentation structure parsing (ASP) becomes an interesting research direction. This study investigates (i) span representation originally developed for other NLP tasks and (ii) a simple task-dependent extension for ASP. Our extensive experiments and analysis show that these representations yield high performance for ASP and provide some challenging types of instances to be parsed.

The Sally Smedley Hyperpartisan News Detector at SemEval-2019 Task 4
Kazuaki Hanawa | Shota Sasaki | Hiroki Ouchi | Jun Suzuki | Kentaro Inui
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system submitted to the formal run of SemEval-2019 Task 4: Hyperpartisan news detection. Our system is based on a linear classifier using several features, i.e., 1) embedding features based on the pre-trained BERT embeddings, 2) article length features, and 3) embedding features of informative phrases extracted from by-publisher dataset. Our system achieved 80.9% accuracy on the test set for the formal run and got the 3rd place out of 42 teams.

Annotating with Pros and Cons of Technologies in Computer Science Papers
Hono Shirai | Naoya Inoue | Jun Suzuki | Kentaro Inui
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

This paper explores a task for extracting a technological expression and its pros/cons from computer science papers. We report ongoing efforts on an annotated corpus of pros/cons and an analysis of the nature of the automatic extraction task. Specifically, we show how to adapt the targeted sentiment analysis task for pros/cons extraction in computer science papers and conduct an annotation study. In order to identify the challenges of the automatic extraction task, we construct a strong baseline model and conduct an error analysis. The experiments show that pros/cons can be consistently annotated by several annotators, and that the task is challenging due to domain-specific knowledge. The annotated dataset is made publicly available for research purposes.

The AIP-Tohoku System at the BEA-2019 Shared Task
Hiroki Asano | Masato Mita | Tomoya Mizumoto | Jun Suzuki
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We introduce the AIP-Tohoku grammatical error correction (GEC) system for the BEA-2019 shared task in Track 1 (Restricted Track) and Track 2 (Unrestricted Track) using the same system architecture. Our system comprises two key components: error generation and sentence-level error detection. In particular, GEC with sentence-level grammatical error detection is a novel and versatile approach, and we experimentally demonstrate that it significantly improves the precision of the base model. Our system is ranked 9th in Track 1 and 2nd in Track 2.

Diamonds in the Rough: Generating Fluent Sentences from Early-Stage Drafts for Academic Writing Assistance
Takumi Ito | Tatsuki Kuribayashi | Hayato Kobayashi | Ana Brassard | Masato Hagiwara | Jun Suzuki | Kentaro Inui
Proceedings of the 12th International Conference on Natural Language Generation

The writing process consists of several stages such as drafting, revising, editing, and proofreading. Studies on writing assistance, such as grammatical error correction (GEC), have mainly focused on sentence editing and proofreading, where surface-level issues such as typographical errors, spelling errors, or grammatical errors should be corrected. We broaden this focus to include the earlier revising stage, where sentences require adjustment to the information included or major rewriting and propose Sentence-level Revision (SentRev) as a new writing assistance task. Well-performing systems in this task can help inexperienced authors by producing fluent, complete sentences given their rough, incomplete drafts. We build a new freely available crowdsourced evaluation dataset consisting of incomplete sentences authored by non-native writers paired with their final versions extracted from published academic papers for developing and evaluating SentRev models. We also establish baseline performance on SentRev using our newly built evaluation dataset.

2018

Improving Neural Machine Translation by Incorporating Hierarchical Subword Features
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 27th International Conference on Computational Linguistics

This paper focuses on subword-based Neural Machine Translation (NMT). We hypothesize that in the NMT model, the appropriate subword units for the following three modules (layers) can differ: (1) the encoder embedding layer, (2) the decoder embedding layer, and (3) the decoder output layer. We find the subword based on Sennrich et al. (2016) has a feature that a large vocabulary is a superset of a small vocabulary and modify the NMT model enables the incorporation of several different subword units in a single embedding layer. We refer these small subword features as hierarchical subword features. To empirically investigate our assumption, we compare the performance of several different subword units and hierarchical subword features for both the encoder and decoder embedding layers. We confirmed that incorporating hierarchical subword features in the encoder consistently improves BLEU scores on the IWSLT evaluation datasets.

Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions
Sho Yokoi | Sosuke Kobayashi | Kenji Fukumizu | Jun Suzuki | Kentaro Inui
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert–Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be interpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sentence embeddings) to be plugged in as kernels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN-based PMI while outperforming PMI in accuracy. In addition, we also demonstrate that PHSIC is beneficial as a criterion of a data selection task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs.

Direct Output Connection for a High-Rank Language Model
Sho Takase | Jun Suzuki | Masaaki Nagata
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also middle layers. This method raises the expressive power of a language model based on the matrix factorization interpretation of language modeling introduced by Yang et al. (2018). Our proposed method improves the current state-of-the-art language model and achieves the best score on the Penn Treebank and WikiText-2, which are the standard benchmark datasets. Moreover, we indicate our proposed method contributes to application tasks: machine translation and headline generation.

An Empirical Study of Building a Strong Baseline for Constituency Parsing
Jun Suzuki | Sho Takase | Hidetaka Kamigaito | Makoto Morishita | Masaaki Nagata
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper investigates the construction of a strong baseline based on general purpose sequence-to-sequence models for constituency parsing. We incorporate several techniques that were mainly developed in natural language generation tasks, e.g., machine translation and summarization, and demonstrate that the sequence-to-sequence model achieves the current top-notch parsers’ performance (almost) without requiring any explicit task-specific knowledge or architecture of constituent parsing.

Unsupervised Token-wise Alignment to Improve Interpretation of Encoder-Decoder Models
Shun Kiyono | Sho Takase | Jun Suzuki | Naoaki Okazaki | Kentaro Inui | Masaaki Nagata
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Developing a method for understanding the inner workings of black-box neural methods is an important research endeavor. Conventionally, many studies have used an attention matrix to interpret how Encoder-Decoder-based models translate a given source sentence to the corresponding target sentence. However, recent studies have empirically revealed that an attention matrix is not optimal for token-wise translation analyses. We propose a method that explicitly models the token-wise alignment between the source and target sequences to provide a better analysis. Experiments show that our method can acquire token-wise alignments that are superior to those of an attention mechanism.

NTT’s Neural Machine Translation Systems for WMT 2018
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes NTT’s neural machine translation systems submitted to the WMT 2018 English-German and German-English news translation tasks. Our submission has three main components: the Transformer model, corpus cleaning, and right-to-left n-best re-ranking techniques. Through our experiments, we identified two keys for improving accuracy: filtering noisy training sentences and right-to-left re-ranking. We also found that the Transformer model requires more training data than the RNN-based model, and the RNN-based model sometimes achieves better accuracy than the Transformer model when the corpus is small.

Reducing Odd Generation from Neural Headline Generation
Shun Kiyono | Sho Takase | Jun Suzuki | Naoaki Okazaki | Kentaro Inui | Masaaki Nagata
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

Enumeration of Extractive Oracle Summaries
Tsutomu Hirao | Masaaki Nishino | Jun Suzuki | Masaaki Nagata
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.

Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization
Jun Suzuki | Masaaki Nagata
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper tackles the reduction of redundant repeating generation that is often observed in RNN-based encoder-decoder models. Our basic idea is to jointly estimate the upper-bound frequency of each target vocabulary in the encoder and control the output words based on the estimation in the decoder. Our method shows significant improvement over a strong RNN-based encoder-decoder baseline and achieved its best results on an abstractive summarization benchmark.

Input-to-Output Gate to Improve RNN Language Models
Sho Takase | Jun Suzuki | Masaaki Nagata
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

This paper proposes a reinforcing method that refines the output layers of existing Recurrent Neural Network (RNN) language models. We refer to our proposed method as Input-to-Output Gate (IOG). IOG has an extremely simple structure, and thus, can be easily combined with any RNN language models. Our experiments on the Penn Treebank and WikiText-2 datasets demonstrate that IOG consistently boosts the performance of several different types of current topline RNN language models.

Improving Neural Text Normalization with Data Augmentation at Character- and Morphological Levels
Itsumi Saito | Jun Suzuki | Kyosuke Nishida | Kugatsu Sadamitsu | Satoshi Kobashikawa | Ryo Masumura | Yuji Matsumoto | Junji Tomita
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this study, we investigated the effectiveness of augmented data for encoder-decoder-based neural normalization models. Attention based encoder-decoder models are greatly effective in generating many natural languages. % such as machine translation or machine summarization. In general, we have to prepare for a large amount of training data to train an encoder-decoder model. Unlike machine translation, there are few training data for text-normalization tasks. In this paper, we propose two methods for generating augmented data. The experimental results with Japanese dialect normalization indicate that our methods are effective for an encoder-decoder model and achieve higher BLEU score than that of baselines. We also investigated the oracle performance and revealed that there is sufficient room for improving an encoder-decoder model.

NTT Neural Machine Translation Systems at WAT 2017
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

In this year, we participated in four translation subtasks at WAT 2017. Our model structure is quite simple but we used it with well-tuned hyper-parameters, leading to a significant improvement compared to the previous state-of-the-art system. We also tried to make use of the unreliable part of the provided parallel corpus by back-translating and making a synthetic corpus. Our submitted system achieved the new state-of-the-art performance in terms of the BLEU score, as well as human evaluation.

2016

Neural Headline Generation on Abstract Meaning Representation
Sho Takase | Jun Suzuki | Naoaki Okazaki | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Right-truncatable Neural Word Embeddings
Jun Suzuki | Masaaki Nagata
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Phrase Table Pruning via Submodular Function Maximization
Masaaki Nishino | Jun Suzuki | Masaaki Nagata
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

A Unified Learning Framework of Skip-Grams and Global Vectors
Jun Suzuki | Masaaki Nagata
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

Dependency-based Discourse Parser for Single-Document Summarization
Yasuhisa Yoshida | Jun Suzuki | Tsutomu Hirao | Masaaki Nagata
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

Shift-Reduce Word Reordering for Machine Translation
Katsuhiko Hayashi | Katsuhito Sudoh | Hajime Tsukada | Jun Suzuki | Masaaki Nagata
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Supervised Model Learning with Feature Grouping based on a Discrete Constraint
Jun Suzuki | Masaaki Nagata
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

Distributed Minimum Error Rate Training of SMT using Particle Swarm Optimization
Jun Suzuki | Kevin Duh | Masaaki Nagata
Proceedings of 5th International Joint Conference on Natural Language Processing

Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning
Jun Suzuki | Hideki Isozaki | Masaaki Nagata
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2009

An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing
Jun Suzuki | Hideki Isozaki | Xavier Carreras | Michael Collins
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

A Syntax-Free Approach to Japanese Sentence Compression
Tsutomu Hirao | Jun Suzuki | Hideki Isozaki
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

NTT statistical machine translation system for IWSLT 2008.
Katsuhito Sudoh | Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

The NTT Statistical Machine Translation System consists of two primary components: a statistical machine translation decoder and a reranker. The decoder generates k-best translation canditates using a hierarchical phrase-based translation based on synchronous context-free grammar. The decoder employs a linear feature combination among several real-valued scores on translation and language models. The reranker reorders the k-best translation candidates using Ranking SVMs with a large number of sparse features. This paper describes the two components and presents the results for the evaluation campaign of IWSLT 2008.

Multi-label Text Categorization with Model Combination based on F1-score Maximization
Akinori Fujino | Hideki Isozaki | Jun Suzuki
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data
Jun Suzuki | Hideki Isozaki
Proceedings of ACL-08: HLT

2007

Larger feature set approach for machine translation in IWSLT 2007
Taro Watanabe | Jun Suzuki | Katsuhito Sudoh | Hajime Tsukada | Hideki Isozaki
Proceedings of the Fourth International Workshop on Spoken Language Translation

The NTT Statistical Machine Translation System employs a large number of feature functions. First, k-best translation candidates are generated by an efficient decoding method of hierarchical phrase-based translation. Second, the k-best translations are reranked. In both steps, sparse binary features — of the order of millions — are integrated during the search. This paper gives the details of the two steps and shows the results for the Evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2007.

Online Large-Margin Training for Statistical Machine Translation
Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Semi-Supervised Structured Output Learning Based on a Hybrid Generative and Discriminative Approach
Jun Suzuki | Akinori Fujino | Hideki Isozaki
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

NTT statistical machine translation for IWSLT 2006
Taro Watanabe | Jun Suzuki | Hajime Tsukada | Hideki Isozaki
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

Training Conditional Random Fields with Multivariate Evaluation Measures
Jun Suzuki | Erik McDermott | Hideki Isozaki
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

The NTT Statistical Machine Translation System for IWSLT2005
Hajime Tsukada | Taro Watanabe | Jun Suzuki | Hideto Kazawa | Hideki Isozaki
Proceedings of the Second International Workshop on Spoken Language Translation

Boosting-based Parse Reranking with Subtree Features
Taku Kudo | Jun Suzuki | Hideki Isozaki
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Dependency-based Sentence Alignment for Multiple Document Summarization
Tsutomu Hirao | Jun Suzuki | Hideki Isozaki | Eisaku Maeda
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Convolution Kernels with Feature Selection for Natural Language Processing Tasks
Jun Suzuki | Hideki Isozaki | Eisaku Maeda
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data
Jun Suzuki | Tsutomu Hirao | Yutaka Sasaki | Eisaku Maeda
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

Question Classification using HDAG Kernel
Jun Suzuki | Hirotoshi Taira | Yutaka Sasaki | Eisaku Maeda
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

2002

SVM Answer Selection for Open-Domain Question Answering
Jun Suzuki | Yutaka Sasaki | Eisaku Maeda
COLING 2002: The 19th International Conference on Computational Linguistics

Co-authors

Tatsuki Kuribayashi 9

Sosuke Kobayashi 7

Tsutomu Hirao 6

Hajime Tsukada 6

Taro Watanabe 5

Tomoya Mizumoto 4

Ryoko Tokuhisa 4

Masato Hagiwara 3

Kyosuke Nishida 3

Naoaki Okazaki 3

Yutaka Sasaki 3

Katsuhito Sudoh 3

Katsuki Chousa 2

Akinori Fujino 2

Hiroaki Funayama 2

Kazuaki Hanawa 2

Masahiro Kaneko 2

Subaru Kimura 2

Yosuke Kishinami 2

Goro Kobayashi 2

Shumpei Miyawaki 2

Masaaki Nishino 2

Eleftherios Avramidis 1

Rachel Bawden 1

Ondřej Bojar 1

Xavier Carreras 1

Michael Collins 1

Hiroyuki Deguchi 1

Anton Dvorkovich 1

Christian Federmann 1

Markus Freitag 1

Kenji Fukumizu 1

Momoka Furuhashi 1

Diana Galván-Sosa 1

Roman Grundkiewicz 1

Tomomasa Hara 1

Taku Hasegawa 1

Chikara Hashimoto 1

Shuichiro Hashimoto 1

Katsuhiko Hayashi 1

Masatoshi Hidaka 1

Tosho Hirasawa 1

Hirofumi Inaguma 1

Yukiko Ishizuki 1

Koichi Iwakawa 1

Tomoharu Iwata 1

Hidetaka Kamigaito 1

Hideto Kazawa 1

Dietrich Klakow 1

Satoshi Kobashikawa 1

Hayato Kobayashi 1

Philipp Koehn 1

Mamoru Komachi 1

Toshiki Kuramoto 1

Ryosuke Kuwabara 1

Masahiro Makino 1

Benjamin Marie 1

Yuichiroh Matsubayashi 1

Yuji Matsumoto 1

Erik McDermott 1

Toshinori Miyoshi 1

Christof Monz 1

Kenton Murray 1

Haruki Nagasawa 1

Hideki Nakayama 1

Toshiaki Nakazawa 1

Shinnosuke Nozue 1

Shintaro Ozaki 1

Maja Popović 1

Kugatsu Sadamitsu 1

Keisuke Sakaguchi 1

Satoshi Sekine 1

Nobuyuki Shimizu 1

Mariya Shmatova 1

Nelson Enrique Yalta Soplin 1

Hiroaki Sugiyama 1

Hirotoshi Taira 1

Ryo Takahashi 1

Ryosuke Takahashi 1

Meguru Takasaki 1

Shinji Watanabe 1

Yotaro Watanabe 1

Kosuke Yamada 1

Yasuhisa Yoshida 1

Masashi Yoshikawa 1

Venues