Jiyi Li

2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Lijia Liu | Takumi Kondo | Kyohei Atarashi | Koh Takeuchi | Jiyi Li | Shigeru Saito | Hisashi Kashima
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

This paper investigates defenses in LLM-based evaluation, where prompt injection attacks can manipulate scores by deceiving the evaluation system. We formalize blind attacks as a class in which candidate answers are crafted independently of the true answer. To counter such attacks, we propose an evaluation framework that combines standard and counterfactual evaluation. Experiments show it significantly improves attack detection with minimal performance trade-offs for recent models.

2024

pdf bib abs

Evaluating Saliency Explanations in NLP by Crowdsourcing
Xiaotian Lu | Jiyi Li | Zhen Wan | Xiaofeng Lin | Koh Takeuchi | Hisashi Kashima
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Deep learning models have performed well on many NLP tasks. However, their internal mechanisms are typically difficult for humans to understand. The development of methods to explain models has become a key issue in the reliability of deep learning models in many important applications. Various saliency explanation methods, which give each feature of input a score proportional to the contribution of output, have been proposed to determine the part of the input which a model values most. Despite a considerable body of work on the evaluation of saliency methods, whether the results of various evaluation metrics agree with human cognition remains an open question. In this study, we propose a new human-based method to evaluate saliency methods in NLP by crowdsourcing. We recruited 800 crowd workers and empirically evaluated seven saliency methods on two datasets with the proposed method. We analyzed the performance of saliency methods, compared our results with existing automated evaluation methods, and identified notable differences between NLP and computer vision (CV) fields when using saliency methods. The instance-level data of our crowdsourced experiments and the code to reproduce the explanations are available at https://github.com/xtlu/lreccoling_evaluation.

pdf bib abs

Enhanced Coherence-Aware Network with Hierarchical Disentanglement for Aspect-Category Sentiment Analysis
Jin Cui | Fumiyo Fukumoto | Xinfeng Wang | Yoshimi Suzuki | Jiyi Li | Noriko Tomuro | Wanzeng Kong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Aspect-category-based sentiment analysis (ACSA), which aims to identify aspect categories and predict their sentiments has been intensively studied due to its wide range of NLP applications. Most approaches mainly utilize intrasentential features. However, a review often includes multiple different aspect categories, and some of them do not explicitly appear in the review. Even in a sentence, there is more than one aspect category with its sentiments, and they are entangled intra-sentence, which makes the model fail to discriminately preserve all sentiment characteristics. In this paper, we propose an enhanced coherence-aware network with hierarchical disentanglement (ECAN) for ACSA tasks. Specifically, we explore coherence modeling to capture the contexts across the whole review and to help the implicit aspect and sentiment identification. To address the issue of multiple aspect categories and sentiment entanglement, we propose a hierarchical disentanglement module to extract distinct categories and sentiment features. Extensive experimental and visualization results show that our ECAN effectively decouples multiple categories and sentiments entangled in the coherence representations and achieves state-of-the-art (SOTA) performance. Our codes and data are available online: https://github.com/cuijin-23/ECAN.

pdf bib abs

AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
Xiaotian Lu | Jiyi Li | Koh Takeuchi | Hisashi Kashima
Findings of the Association for Computational Linguistics: EMNLP 2024

Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.

pdf bib abs

Human-LLM Hybrid Text Answer Aggregation for Crowd Annotations
Jiyi Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The quality is a crucial issue for crowd annotations. Answer aggregation is an important type of solution. The aggregated answers estimated from multiple crowd answers to the same instance are the eventually collected annotations, rather than the individual crowd answers themselves. Recently, the capability of Large Language Models (LLMs) on data annotation tasks has attracted interest from researchers. Most of the existing studies mainly focus on the average performance of individual crowd workers; several recent works studied the scenarios of aggregation on categorical labels and LLMs used as label creators. However, the scenario of aggregation on text answers and the role of LLMs as aggregators are not yet well-studied. In this paper, we investigate the capability of LLMs as aggregators in the scenario of close-ended crowd text answer aggregation. We propose a human-LLM hybrid text answer aggregation method with a Creator-Aggregator Multi-Stage (CAMS) crowdsourcing framework. We make the experiments based on public crowdsourcing datasets. The results show the effectiveness of our approach based on the collaboration of crowd workers and LLMs.

2023

pdf bib abs

Dialogue State Tracking with Sparse Local Slot Attention
Longfei Yang | Jiyi Li | Sheng Li | Takahiro Shinozaki
Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023)

Dialogue state tracking (DST) is designed to track the dialogue state during the conversations between users and systems, which is the core of task-oriented dialogue systems. Mainstream models predict the values for each slot with fully token-wise slot attention from dialogue history. However, such operations may result in overlooking the neighboring relationship. Moreover, it may lead the model to assign probability mass to irrelevant parts, while these parts contribute little. It becomes severe with the increase in dialogue length. Therefore, we investigate sparse local slot attention for DST in this work. Slot-specific local semantic information is obtained at a sub-sampled temporal resolution capturing local dependencies for each slot. Then these local representations are attended with sparse attention weights to guide the model to pay attention to relevant parts of local information for subsequent state value prediction. The experimental results on MultiWOZ 2.0 and 2.4 datasets show that the proposed approach effectively improves the performance of ontology-based dialogue state tracking, and performs better than token-wise attention for long dialogues.

pdf bib abs

Learning Disentangled Meaning and Style Representations for Positive Text Reframing
Xu Sheng | Fumiyo Fukumoto | Jiyi Li | Go Kentaro | Yoshimi Suzuki
Proceedings of the 16th International Natural Language Generation Conference

The positive text reframing (PTR) task which generates a text giving a positive perspective with preserving the sense of the input text, has attracted considerable attention as one of the NLP applications. Due to the significant representation capability of the pre-trained language model (PLM), a beneficial baseline can be easily obtained by just fine-tuning the PLM. However, how to interpret a diversity of contexts to give a positive perspective is still an open problem. Especially, it is more serious when the size of the training data is limited. In this paper, we present a PTR framework, that learns representations where the meaning and style of text are structurally disentangled. The method utilizes pseudo-positive reframing datasets which are generated with two augmentation strategies. A simple but effective multi-task learning-based model is learned to fuse the generation capabilities from these datasets. Experimental results on Positive Psychology Frames (PPF) dataset, show that our approach outperforms the baselines, BART by five and T5 by six evaluation metrics. Our source codes and data are available online.

pdf bib

Intermediate-Task Transfer Learning for Peer Review Score Prediction
Panitan Muangkammuen | Fumiyo Fukumoto | Jiyi Li | Yoshimi Suzuki
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib abs

Aspect-based sentiment analysis (ABSA) has been widely studied since the explosive growth of social networking services. However, the recognition of implicit sentiments that do not contain obvious opinion words remains less explored. In this paper, we propose aspect-category enhanced learning with a neural coherence model (ELCoM). It captures document-level coherence by using contrastive learning, and sentence-level by a hypergraph to mine opinions from explicit sentences to aid implicit sentiment classification. To address the issue of sentences with different sentiment polarities in the same category, we perform cross-category enhancement to offset the impact of anomalous nodes in the hypergraph and obtain sentence representations with enhanced aspect-category. Extensive experiments on benchmark datasets show that the ELCoM achieves state-of-the-art performance. Our source codes and data are released at https://github.com/cuijin-23/ELCoM.

pdf bib abs

Multi-Domain Dialogue State Tracking with Disentangled Domain-Slot Attention
Longfei Yang | Jiyi Li | Sheng Li | Takahiro Shinozaki
Findings of the Association for Computational Linguistics: ACL 2023

As the core of task-oriented dialogue systems, dialogue state tracking (DST) is designed to track the dialogue state through the conversation between users and systems. Multi-domain DST has been an important challenge in which the dialogue states across multiple domains need to consider. In recent mainstream approaches, each domain and slot are aggregated and regarded as a single query feeding into attention with the dialogue history to obtain domain-slot specific representations. In this work, we propose disentangled domain-slot attention for multi-domain dialogue state tracking. The proposed approach disentangles the domain-slot specific information extraction in a flexible and context-dependent manner by separating the query about domains and slots in the attention component. Through a series of experiments on MultiWOZ 2.0 and MultiWOZ 2.4 datasets, we demonstrate that our proposed approach outperforms the standard multi-head attention with aggregated domain-slot query.

2022

pdf bib abs

Multi-Domain Dialogue State Tracking with Top-K Slot Self Attention
Longfei Yang | Jiyi Li | Sheng Li | Takahiro Shinozaki
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

As an important component of task-oriented dialogue systems, dialogue state tracking is designed to track the dialogue state through the conversations between users and systems. Multi-domain dialogue state tracking is a challenging task, in which the correlation among different domains and slots needs to consider. Recently, slot self-attention is proposed to provide a data-driven manner to handle it. However, a full-support slot self-attention may involve redundant information interchange. In this paper, we propose a top-k attention-based slot self-attention for multi-domain dialogue state tracking. In the slot self-attention layers, we force each slot to involve information from the other k prominent slots and mask the rest out. The experimental results on two mainstream multi-domain task-oriented dialogue datasets, MultiWOZ 2.0 and MultiWOZ 2.4, present that our proposed approach is effective to improve the performance of multi-domain dialogue state tracking. We also find that the best result is obtained when each slot interchanges information with only a few slots.

pdf bib abs

Adversarial Speech Generation and Natural Speech Recovery for Speech Content Protection
Sheng Li | Jiyi Li | Qianying Liu | Zhuo Gong
Proceedings of the Thirteenth Language Resources and Evaluation Conference

With the advent of the General Data Protection Regulation (GDPR) and increasing privacy concerns, the sharing of speech data is faced with significant challenges. Protecting the sensitive content of speech is the same important as the voiceprint. This paper proposes an effective speech content protection method by constructing a frame-by-frame adversarial speech generation system. We revisited the adversarial examples generating method in the recent machine learning field and selected the phonetic state sequence of sensitive speech for the adversarial examples generation. We build an adversarial speech collection. Moreover, based on the speech collection, we proposed a neural network-based frame-by-frame mapping method to recover the speech content by converting from the adversarial speech to the human speech. Experiment shows our proposed method can encode and recover any sensitive audio, and our method is easy to be conducted with publicly available resources of speech recognition technology.

pdf bib abs

Exploiting Labeled and Unlabeled Data via Transformer Fine-tuning for Peer-Review Score Prediction
Panitan Muangkammuen | Fumiyo Fukumoto | Jiyi Li | Yoshimi Suzuki
Findings of the Association for Computational Linguistics: EMNLP 2022

Automatic Peer-review Aspect Score Prediction (PASP) of academic papers can be a helpful assistant tool for both reviewers and authors. Most existing works on PASP utilize supervised learning techniques. However, the limited number of peer-review data deteriorates the performance of PASP. This paper presents a novel semi-supervised learning (SSL) method that incorporates the Transformer fine-tuning into the Γ-model, a variant of the Ladder network, to leverage contextual features from unlabeled data. Backpropagation simultaneously minimizes the sum of supervised and unsupervised cost functions, avoiding the need for layer-wise pre-training. The experimental results show that our model outperforms the supervised and naive semi-supervised learning baselines. Our source codes are available online.

2021

pdf bib abs

Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification
Zhiwei Zhang | Jiyi Li | Fumiyo Fukumoto | Yanming Ye
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Scientific claim verification can help the researchers to easily find the target scientific papers with the sentence evidence from a large corpus for the given claim. Some existing works propose pipeline models on the three tasks of abstract retrieval, rationale selection and stance prediction. Such works have the problems of error propagation among the modules in the pipeline and lack of sharing valuable information among modules. We thus propose an approach, named as ARSJoint, that jointly learns the modules for the three tasks with a machine reading comprehension framework by including claim information. In addition, we enhance the information exchanges and constraints among tasks by proposing a regularization term between the sentence attention scores of abstract retrieval and the estimated outputs of rational selection. The experimental results on the benchmark dataset SciFact show that our approach outperforms the existing works.

2020

pdf bib abs

Multi-task Peer-Review Score Prediction
Jiyi Li | Ayaka Sato | Kazuya Shimura | Fumiyo Fukumoto
Proceedings of the First Workshop on Scholarly Document Processing

Automatic prediction on the peer-review aspect scores of academic papers can be a useful assistant tool for both reviewers and authors. To handle the small size of published datasets on the target aspect of scores, we propose a multi-task approach to leverage additional information from other aspects of scores for improving the performance of the target. Because one of the problems of building multi-task models is how to select the proper resources of auxiliary tasks and how to select the proper shared structures. We propose a multi-task shared structure encoding approach which automatically selects good shared network structures as well as good auxiliary resources. The experiments based on peer-review datasets show that our approach is effective and has better performance on the target scores than the single-task method and naive multi-task methods.

pdf bib abs

Machine metaphor understanding is one of the major topics in NLP. Most of the recent attempts consider it as classification or sequence tagging task. However, few types of research introduce the rich linguistic information into the field of computational metaphor by leveraging powerful pre-training language models. We focus a novel reading comprehension paradigm for solving the token-level metaphor detection task which provides an innovative type of solution for this task. We propose an end-to-end deep metaphor detection model named DeepMet based on this paradigm. The proposed approach encodes the global text context (whole sentence), local text context (sentence fragments), and question (query word) information as well as incorporating two types of part-of-speech (POS) features by making use of the advanced pre-training language model. The experimental results by using several metaphor datasets show that our model achieves competitive results in the second shared task on metaphor detection.

pdf bib abs

HSCNN: A Hybrid-Siamese Convolutional Neural Network for Extremely Imbalanced Multi-label Text Classification
Wenshuo Yang | Jiyi Li | Fumiyo Fukumoto | Yanming Ye
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The data imbalance problem is a crucial issue for the multi-label text classification. Some existing works tackle it by proposing imbalanced loss objectives instead of the vanilla cross-entropy loss, but their performances remain limited in the cases of extremely imbalanced data. We propose a hybrid solution which adapts general networks for the head categories, and few-shot techniques for the tail categories. We propose a Hybrid-Siamese Convolutional Neural Network (HSCNN) with additional technical attributes, i.e., a multi-task architecture based on Single and Siamese networks; a category-specific similarity in the Siamese structure; a specific sampling method for training HSCNN. The results using two benchmark datasets and three loss objectives show that our method can improve the performance of Single networks with diverse loss objectives on the tail or entire categories.

pdf bib abs

A Neural Local Coherence Analysis Model for Clarity Text Scoring
Panitan Muangkammuen | Sheng Xu | Fumiyo Fukumoto | Kanda Runapongsa Saikaew | Jiyi Li
Proceedings of the 28th International Conference on Computational Linguistics

Local coherence relation between two phrases/sentences such as cause-effect and contrast gives a strong influence of whether a text is well-structured or not. This paper follows the assumption and presents a method for scoring text clarity by utilizing local coherence between adjacent sentences. We hypothesize that the contextual features of coherence relations learned by utilizing different data from the target training data are also possible to discriminate well-structured of the target text and thus help to score the text clarity. We propose a text clarity scoring method that utilizes local coherence analysis with an out-domain setting, i.e. the training data for the source and target tasks are different from each other. The method with language model pre-training BERT firstly trains the local coherence model as an auxiliary manner and then re-trains it together with clarity text scoring model. The experimental results by using the PeerRead benchmark dataset show the improvement compared with a single model, scoring text clarity model. Our source codes are available online.

2019

pdf bib abs

Text Categorization by Learning Predominant Sense of Words as Auxiliary Task
Kazuya Shimura | Jiyi Li | Fumiyo Fukumoto
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Distributions of the senses of words are often highly skewed and give a strong influence of the domain of a document. This paper follows the assumption and presents a method for text categorization by leveraging the predominant sense of words depending on the domain, i.e., domain-specific senses. The key idea is that the features learned from predominant senses are possible to discriminate the domain of the document and thus improve the overall performance of text categorization. We propose multi-task learning framework based on the neural network model, transformer, which trains a model to simultaneously categorize documents and predicts a predominant sense for each word. The experimental results using four benchmark datasets show that our method is comparable to the state-of-the-art categorization approach, especially our model works well for categorization of multi-label documents.

pdf bib abs

A Dataset of Crowdsourced Word Sequences: Collections and Answer Aggregation for Ground Truth Creation
Jiyi Li | Fumiyo Fukumoto
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

The target outputs of many NLP tasks are word sequences. To collect the data for training and evaluating models, the crowd is a cheaper and easier to access than the oracle. To ensure the quality of the crowdsourced data, people can assign multiple workers to one question and then aggregate the multiple answers with diverse quality into a golden one. How to aggregate multiple crowdsourced word sequences with diverse quality is a curious and challenging problem. People need a dataset for addressing this problem. We thus create a dataset (CrowdWSA2019) which contains the translated sentences generated from multiple workers. We provide three approaches as the baselines on the task of extractive word sequence aggregation. Specially, one of them is an original one we propose which models the reliability of workers. We also discuss some issues on ground truth creation of word sequences which can be addressed based on this dataset.

2018

pdf bib abs

HFT-CNN: Learning Hierarchical Category Structure for Multi-label Short Text Categorization
Kazuya Shimura | Jiyi Li | Fumiyo Fukumoto
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We focus on the multi-label categorization task for short texts and explore the use of a hierarchical structure (HS) of categories. In contrast to the existing work using non-hierarchical flat model, the method leverages the hierarchical relations between the pre-defined categories to tackle the data sparsity problem. The lower the HS level, the less the categorization performance. Because the number of training data per category in a lower level is much smaller than that in an upper level. We propose an approach which can effectively utilize the data in the upper levels to contribute the categorization in the lower levels by applying the Convolutional Neural Network (CNN) with a fine-tuning technique. The results using two benchmark datasets show that proposed method, Hierarchical Fine-Tuning based CNN (HFT-CNN) is competitive with the state-of-the-art CNN based methods.

Co-authors

Venues

ACL1

sdp1

WS1

Fix author