Lei Chen - ACL Anthology

Lei Chen

2025

ArchiDocGen: Multi-Agent Framework for Expository Document Generation in the Architectural Industry
Junjie Jiang | Haodong Wu | Yongqi Zhang | Songyue Guo | Bingcen Liu | Caleb Chen Cao | Ruizhe Shao | Chao Guan | Peng Xu | Lei Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

The architectural industry produces extensive documents, including method statements—expository documents that integrate multi-source data into actionable guidance. Manual drafting however is labor-intensive and time-consuming. This paper introduces ArchiDocGen, a multi-agent framework automating method statement generation. Unlike traditional approaches relying on static templates or single-pass generation, ArchiDocGen decomposes the task into three steps: outline generation, section-based content generation, and polishing, each handled by specialized agents. To provide domain expertise, ArchiDocGen employs a section-based retriever to fetch and synthesize relevant documents from its custom knowledge base. Each section is generated through iterative reasoning of a section-based chain-of-thought (SeCoT) scheme, followed by refinement to meet professional standards. To evaluate the generated method statements, we partner with the industry to establish a multi-dimensional evaluation system by combining automatic and empirical methods. Experiments show that ArchiDocGen achieves 4.38 ContentScore, excelling in specialization, completeness, organization, and clarity. Additionally, a web-based application for ArchiDocGen is developed and deployed with industry partners.

Jinan Smart Education at BEA 2025 Shared Task: Dual Encoder Architecture for Tutor Identification via Semantic Understanding of Pedagogical Conversations
Lei Chen
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

With the rapid development of smart education, educational conversation systems have become an important means to support personalized learning. Identifying tutors and understanding their unique teaching style are crucial to optimizing teaching quality. However, accurately identifying tutors from multi-round educational conversation faces great challenges due to complex contextual semantics, long-term dependencies, and implicit pragmatic relationships. This paper proposes a dual-tower encoding architecture to model the conversation history and tutor responses separately, and enhances semantic fusion through four feature interaction mechanisms. To further improve the robustness, this paper adopts a model ensemble voting strategy based on five-fold cross-validation. Experiments on the BEA 2025 shared task dataset show that our method achieves 89.65% Marco-F1 in tutor identification, ranks fourth among all teams(4/20), demonstrating its effectiveness and potential in educational AI applications.We have made the corresponding code publicly accessible at https://github.com/leibnizchen/Dual-Encoder.

Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Haoyang Li | Xuejia Chen | Zhanchao Xu | Darian Li | Nicole Hu | Fei Teng | Yiming Li | Luyu Qiu | Chen Jason Zhang | Li Qing | Lei Chen
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and multi-step reasoning. NumericBench includes datasets ranging from synthetic number lists to crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.

KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering
Yushi Sun | Kai Sun | Yifan Ethan Xu | Xiao Yang | Xin Luna Dong | Nan Tang | Lei Chen
Findings of the Association for Computational Linguistics: EMNLP 2025

Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.

Automate Strategy Finding with LLM in Quant Investment
Zhizhuo Kou | Holam Yu | Junyu Luo | Jingshu Peng | Xujia Li | Chengzhong Liu | Juntao Dai | Lei Chen | Sirui Han | Yike Guo
Findings of the Association for Computational Linguistics: EMNLP 2025

We present a novel three-stage framework leveraging Large Language Models (LLMs) within a risk-aware multi-agent system for automate strategy finding in quantitative finance. Our approach addresses the brittleness of traditional deep learning models in financial applications by: employing prompt-engineered LLMs to generate executable alpha factor candidates across diverse financial data, implementing multimodal agent-based evaluation that filters factors based on market status, predictive quality while maintaining category balance, and deploying dynamic weight optimization that adapts to market conditions. Experimental results demonstrate the robust performance of the strategy in Chinese & US market regimes compared to established benchmarks. Our work extends LLMs capabilities to quantitative trading, providing a scalable architecture for financial signal extraction and portfolio construction. The overall framework significantly outperforms all benchmarks with 53.17% cumulative return on SSE50 (Jan 2023 to Jan 2024), demonstrating superior risk-adjusted performance and downside protection on the market.

AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios
Xinyi Mou | Jingcong Liang | Jiayu Lin | Xinnong Zhang | Xiawei Liu | Shiyue Yang | Rong Ye | Lei Chen | Haoyu Kuang | Xuanjing Huang | Zhongyu Wei
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) are increasingly leveraged to empower autonomous agents to simulate human beings in various fields of behavioral research. However, evaluating their capacity to navigate complex social interactions remains a challenge. Previous studies face limitations due to insufficient scenario diversity, complexity, and a single-perspective focus. To this end, we introduce AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios. Drawing on Dramaturgical Theory, AgentSense employs a bottom-up approach to create 1,225 diverse social scenarios constructed from extensive scripts. We evaluate LLM-driven agents through multi-turn interactions, emphasizing both goal completion and implicit reasoning. We analyze goals using ERG theory and conduct comprehensive experiments. Our findings highlight that LLMs struggle with goals in complex social scenarios, especially high-level growth needs, and even GPT-4o requires improvement in private information reasoning.

2024

SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration
Yuanhao Shen | Xiaodan Zhu | Lei Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

The tool-use ability of Large Language Models (LLMs) has a profound impact on a wide range of applications. However, LLMs’ self-awareness and self-control capability in appropriately using tools remains understudied. The problem is consequential as it alarms a potential risk of degraded performance and poses a threat to trustworthiness on the models. In this paper, we conduct a study on a family of state-of-the-art LLMs on three datasets with two mainstream tool-use frameworks. Our study reveals the tool-abuse behavior of LLMs, a tendency for models to misuse tools along with models’ frequent overconfidence in tool choice. We also find that this is a common issue regardless of model capability. Accordingly, we propose a novel framework, SMARTCAL, to mitigate the observed issues, and our results show an average 8.6 percent increase in the QA performance in three testing datasets and 21.6 percent lower Expected Calibration Error (ECE) than existing methods.

Refining and Synthesis: A Simple yet Effective Data Augmentation Framework for Cross-Domain Aspect-based Sentiment Analysis
Haining Wang | Kang He | Bobo Li | Lei Chen | Fei Li | Xu Han | Chong Teng | Donghong Ji
Findings of the Association for Computational Linguistics: ACL 2024

Aspect-based Sentiment Analysis (ABSA) is extensively researched in the NLP community, yet related models face challenges due to data sparsity when shifting to a new domain. Hence, data augmentation for cross-domain ABSA has attracted increasing attention in recent years. However, two key points have been neglected in prior studies: First, target domain unlabeled data are labeled with pseudo labels by the model trained in the source domain with little quality control, leading to inaccuracy and error propagation. Second, the label and text patterns of generated labeled data are monotonous, thus limiting the robustness and generalization ability of trained ABSA models. In this paper, we aim to design a simple yet effective framework to address the above shortages in ABSA data augmentation, called Refining and Synthesis Data Augmentation (RSDA). Our framework roughly includes two steps: First, it refines generated labeled data using a natural language inference (NLI) filter to control data quality. Second, it synthesizes diverse labeled data via novel label composition and paraphrase approaches. We conduct experiments on 4 kinds of ABSA subtasks, and our framework outperforms 7 strong baselines, demonstrating its effectiveness.

R²AG: Incorporating Retrieval Information into Retrieval Augmented Generation
Fuda Ye | Shuangyin Li | Yongqi Zhang | Lei Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R²AG, a novel enhanced RAG framework to fill this gap by incorporating **R**etrieval information into **R**etrieval **A**ugmented **G**eneration. Specifically, R²AG utilizes the nuanced features from the retrievers and employs a R²-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs’ generation. Notably, R²AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R²AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.

What Factors Influence LLMs’ Judgments? A Case Study on Question Answering
Lei Chen | Bobo Li | Li Zheng | Haining Wang | Zixiang Meng | Runfeng Shi | Hao Fei | Jun Zhou | Fei Li | Chong Teng | Donghong Ji
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) are now being considered as judges of high efficiency to evaluate the quality of answers generated by candidate models. However, their judgments may be influenced by complex scenarios and inherent biases, raising concerns about their reliability. This study aims to bridge this gap by introducing four unexplored factors and examining the performance of LLMs as judges, namely answer quantity, inducing statements, judging strategy, and judging style. Additionally, we introduce a new dimension of question difficulty to provide a more comprehensive understanding of LLMs’ judgments across varying question intricacies. We employ ChatGPT, GPT-4, Gemini, and Claude-2 as judges and conduct experiments on Vicuna Benchmark and MT-bench. Our study reveals that LLMs’ judging abilities are susceptible to the influence of these four factors, and analyzing from the newly proposed dimension of question difficulty is highly necessary. We also provide valuable insights into optimizing LLMs’ performance as judges, enhancing their reliability and adaptability across diverse evaluation scenarios.

Who Responded to Whom: The Joint Effects of Latent Topics and Discourse in Conversation Structure
Lu Ji | Lei Chen | Jing Li | Zhongyu Wei | Qi Zhang | Xuanjing Huang
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Vast amount of online conversations are produced on a daily basis, resulting in a pressing need to automatic conversation understanding. As a basis to structure a discussion, we identify the responding relations in the conversation discourse, which link response utterances to their initiations. To figure out who responded to whom, here we explore how the consistency of topic contents and dependency of discourse roles indicate such interactions, whereas most prior work ignore the effects of latent factors underlying word occurrences. We propose a neural model to learn latent topics and discourse in word distributions, and predict pairwise initiation-response links via exploiting topic consistency and discourse dependency. Experimental results on both English and Chinese conversations show that our model significantly outperforms the previous state of the arts.

2023

CAT: A Contextualized Conceptualization and Instantiation Framework for Commonsense Reasoning
Weiqi Wang | Tianqing Fang | Baixuan Xu | Chun Yi Louis Bo | Yangqiu Song | Lei Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Commonsense reasoning, aiming at endowing machines with a human-like ability to make situational presumptions, is extremely challenging to generalize. For someone who barely knows about “meditation,” while is knowledgeable about “singing,” he can still infer that “meditation makes people relaxed” from the existing knowledge that “singing makes people relaxed” by first conceptualizing “singing” as a “relaxing event” and then instantiating that event to “meditation.”This process, known as conceptual induction and deduction, is fundamental to commonsense reasoning while lacking both labeled data and methodologies to enhance commonsense modeling. To fill such a research gap, we propose CAT (Contextualized ConceptuAlization and InsTantiation),a semi-supervised learning framework that integrates event conceptualization and instantiation to conceptualize commonsense knowledge bases at scale. Extensive experiments show that our framework achieves state-of-the-art performances on two conceptualization tasks, and the acquired abstract commonsense knowledge can significantly improve commonsense inference modeling. Our code, data, and fine-tuned models are publicly available at [https://github.com/HKUST-KnowComp/CAT](https://github.com/HKUST-KnowComp/CAT).

A Simple and Effective Framework for Strict Zero-Shot Hierarchical Classification
Rohan Bhambhoria | Lei Chen | Xiaodan Zhu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In recent years, large language models (LLMs) have achieved strong performance on benchmark tasks, especially in zero or few-shot settings. However, these benchmarks often do not adequately address the challenges posed in the real-world, such as that of hierarchical classification. In order to address this challenge, we propose refactoring conventional tasks on hierarchical datasets into a more indicative long-tail prediction task. We observe LLMs are more prone to failure in these cases. To address these limitations, we propose the use of entailment-contradiction prediction in conjunction with LLMs, which allows for strong performance in a strict zero-shot setting. Importantly, our method does not require any parameter updates, a resource-intensive process and achieves strong performance across multiple datasets.

Weighted Contrastive Learning With False Negative Control to Help Long-tailed Product Classification
Tianqi Wang | Lei Chen | Xiaodan Zhu | Younghun Lee | Jing Gao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Item categorization (IC) aims to classify product descriptions into leaf nodes in a categorical taxonomy, which is a key technology used in a wide range of applications. Along with the fact that most datasets often has a long-tailed distribution, classification performances on tail labels tend to be poor due to scarce supervision, causing many issues in real-life applications. To address IC task’s long-tail issue, K-positive contrastive loss (KCL) is proposed on image classification task and can be applied on the IC task when using text-based contrastive learning, e.g., SimCSE. However, one shortcoming of using KCL has been neglected in previous research: false negative (FN) instances may harm the KCL’s representation learning. To address the FN issue in the KCL, we proposed to re-weight the positive pairs in the KCL loss with a regularization that the sum of weights should be constrained to K+1 as close as possible. After controlling FN instances with the proposed method, IC performance has been further improved and is superior to other LT-addressing methods.

Towards a Unified Conversational Recommendation System: Multi-task Learning via Contextualized Knowledge Distillation
Yeongseo Jung | Eunseo Jung | Lei Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In Conversational Recommendation System (CRS), an agent is asked to recommend a set of items to users within natural language conversations. To address the need for both conversational capability and personalized recommendations, prior works have utilized separate recommendation and dialogue modules. However, such approach inevitably results in a discrepancy between recommendation results and generated responses. To bridge the gap, we propose a multi-task learning for a unified CRS, where a single model jointly learns both tasks via Contextualized Knowledge Distillation (ConKD). We introduce two versions of ConKD: hard gate and soft gate. The former selectively gates between two task-specific teachers, while the latter integrates knowledge from both teachers. Our gates are computed on-the-fly in a context-specific manner, facilitating flexible integration of relevant knowledge. Extensive experiments demonstrate that our single model significantly improves recommendation performance while enhancing fluency, and achieves comparable results in terms of diversity.

Learning to Describe for Predicting Zero-shot Drug-Drug Interactions
Fangqi Zhu | Yongqi Zhang | Lei Chen | Bing Qin | Ruifeng Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Adverse drug-drug interactions (DDIs) can compromise the effectiveness of concurrent drug administration, posing a significant challenge in healthcare. As the development of new drugs continues, the potential for unknown adverse effects resulting from DDIs becomes a growing concern. Traditional computational methods for DDI prediction may fail to capture interactions for new drugs due to the lack of knowledge. In this paper, we introduce a new problem setup as zero-shot DDI prediction that deals with the case of new drugs. Leveraging textual information from online databases like DrugBank and PubChem, we propose an innovative approach TextDDI with a language model-based DDI predictor and a reinforcement learning (RL)-based information selector, enabling the selection of concise and pertinent text for accurate DDI prediction on new drugs. Empirical results show the benefits of the proposed approach on several settings including zero-shot and few-shot DDI prediction, and the selected texts are semantically relevant. Our code and data are available at https://github.com/zhufq00/DDIs-Prediction.

Topic-DPR: Topic-based Prompts for Dense Passage Retrieval
Qingfa Xiao | Shuangyin Li | Lei Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Prompt-based learning’s efficacy across numerous natural language processing tasks has led to its integration into dense passage retrieval. Prior research has mainly focused on enhancing the semantic understanding of pre-trained language models by optimizing a single vector as a continuous prompt. This approach, however, leads to a semantic space collapse; identical semantic information seeps into all representations, causing their distributions to converge in a restricted region. This hinders differentiation between relevant and irrelevant passages during dense retrieval. To tackle this issue, we present Topic-DPR, a dense passage retrieval model that uses topic-based prompts. Unlike the single prompt method, multiple topic-based prompts are established over a probabilistic simplex and optimized simultaneously through contrastive learning. This encourages representations to align with their topic distributions, improving space uniformity. Furthermore, we introduce a novel positive and negative sampling strategy, leveraging semi-structured data to boost dense retrieval efficiency. Experimental results from two datasets affirm that our method surpasses previous state-of-the-art retrieval techniques.

2022

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

Automatically Detecting Reduced-formed English Pronunciations by Using Deep Learning
Lei Chen | Chenglin Jiang | Yiwei Gu | Yang Liu | Jiahong Yuan
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

Reduced form pronunciations are widely used by native English speakers, especially in casual conversations. Second language (L2) learners have difficulty in processing reduced form pronunciations in listening comprehension and face challenges in production too. Meanwhile, training applications dedicated to reduced forms are still few. To solve this issue, we report on our first effort of using deep learning to evaluate L2 learners’ reduced form pronunciations. Compared with a baseline solution that uses an ASR to determine regular or reduced-formed pronunciations, a classifier that learns representative features via a convolution neural network (CNN) on low-level acoustic features, yields higher detection performance. F-1 metric has been increased from 0.690 to 0.757 on the reduction task. Furthermore, adding word entities to compute attention weights to better adjust the features learned by the CNN model helps increasing F-1 to 0.763.

A Progressive Framework for Role-Aware Rumor Resolution
Lei Chen | Guanying Li | Zhongyu Wei | Yang Yang | Baohua Zhou | Qi Zhang | Xuanjing Huang
Proceedings of the 29th International Conference on Computational Linguistics

Existing works on rumor resolution have shown great potential in recognizing word appearance and user participation. However, they ignore the intrinsic propagation mechanisms of rumors and present poor adaptive ability when unprecedented news emerges. To exploit the fine-grained rumor diffusion patterns and generalize rumor resolution methods, we formulate a predecessor task to identify triggering posts, and then exploit their characteristics to facilitate rumor verification. We design a tree-structured annotation interface and extend PHEME dataset with labels on the message level. Data analysis shows that triggers play a critical role in verifying rumors and present similar lingual patterns across irrelevant events. We propose a graph-based model considering the direction and interaction of information flow to implement role-aware rumor resolution. Experimental results demonstrate the effectiveness of our proposed model and progressive scheme.

Utilizing Cross-Modal Contrastive Learning to Improve Item Categorization BERT Model
Lei Chen | Hou Wei Chou
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

Item categorization (IC) is a core natural language processing (NLP) task in e-commerce. As a special text classification task, fine-tuning pre-trained models, e.g., BERT, has become a mainstream solution. To improve IC performance further, other product metadata, e.g., product images, have been used. Although multimodal IC (MIC) systems show higher performance, expanding from processing text to more resource-demanding images brings large engineering impacts and hinders the deployment of such dual-input MIC systems. In this paper, we proposed a new way of using product images to improve text-only IC model: leveraging cross-modal signals between products’ titles and associated images to adapt BERT models in a self-supervised learning (SSL) way. Our experiments on the three genres in the public Amazon product dataset show that the proposed method generates improved prediction accuracy and macro-F1 values than simply using the original BERT. Moreover, the proposed method is able to keep using existing text-only IC inference implementation and shows a resource advantage than the deployment of a dual-input MIC system.

Developing Prefix-Tuning Models for Hierarchical Text Classification
Lei Chen | Houwei Chou | Xiaodan Zhu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Hierarchical text classification (HTC) is a key problem and task in many industrial applications, which aims to predict labels organized in a hierarchy for given input text. For example, HTC can group the descriptions of online products into a taxonomy or organizing customer reviews into a hierarchy of categories. In real-life applications, while Pre-trained Language Models (PLMs) have dominated many NLP tasks, they face significant challenges too—the conventional fine-tuning process needs to modify and save models with a huge number of parameters. This is becoming more critical for HTC in both global and local modelling—the latter needs to learn multiple classifiers at different levels/nodes in a hierarchy. The concern will be even more serious since PLM sizes are continuing to increase in order to attain more competitive performances. Most recently, prefix tuning has become a very attractive technology by only tuning and saving a tiny set of parameters. Exploring prefix turning for HTC is hence highly desirable and has timely impact. In this paper, we investigate prefix tuning on HTC in two typical setups: local and global HTC. Our experiment shows that the prefix-tuning model only needs less than 1% of parameters and can achieve performance comparable to regular full fine-tuning. We demonstrate that using contrastive learning in learning prefix vectors can further improve HTC performance.

RIT Boston at SemEval-2022 Task 5: Multimedia Misogyny Detection By Using Coherent Visual and Language Features from CLIP Model and Data-centric AI Principle
Lei Chen | Hou Wei Chou
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Detecting MEME images to be misogynous or not is an application useful on curbing online hateful information against women. In the SemEval-2022 Multimedia Automatic Misogyny Identification (MAMI) challenge, we designed a system using two simple but effective principles. First, we leverage on recently emerging Transformer models pre-trained (mostly in a self-supervised learning way) on massive data sets to obtain very effective visual (V) and language (L) features. In particular, we used the CLIP model provided by OpenAI to obtain coherent V and L features and then simply used a logistic regression model to make binary predictions. Second, we emphasized more on data rather than tweaking models by following the data-centric AI principle. These principles were proven to be useful and our final macro-F1 is 0.778 for the MAMI task A and ranked the third place among participant teams.

2021

Align Voting Behavior with Public Statements for Legislator Representation Learning
Xinyi Mou | Zhongyu Wei | Lei Chen | Shangyi Ning | Yancheng He | Changjian Jiang | Xuanjing Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Ideology of legislators is typically estimated by ideal point models from historical records of votes. It represents legislators and legislation as points in a latent space and shows promising results for modeling voting behavior. However, it fails to capture more specific attitudes of legislators toward emerging issues and is unable to model newly-elected legislators without voting histories. In order to mitigate these two problems, we explore to incorporate both voting behavior and public statements on Twitter to jointly model legislators. In addition, we propose a novel task, namely hashtag usage prediction to model the ideology of legislators on Twitter. In practice, we construct a heterogeneous graph for the legislative context and use relational graph neural networks to learn the representation of legislators with the guidance of historical records of their voting and hashtag usage. Experiment results indicate that our model yields significant improvements for the task of roll call vote prediction. Further analysis further demonstrates that legislator representation we learned captures nuances in statements.

Multimodal Item Categorization Fully Based on Transformer
Lei Chen | Houwei Chou | Yandi Xia | Hirokazu Miyake
Proceedings of the 4th Workshop on e-Commerce and NLP

The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.

Label-Guided Learning for Item Categorization in e-Commerce
Lei Chen | Hirokazu Miyake
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Item categorization is an important application of text classification in e-commerce due to its impact on the online shopping experience of users. One class of text classification techniques that has gained attention recently is using the semantic information of the labels to guide the classification task. We have conducted a systematic investigation of the potential benefits of these methods on a real data set from a major e-commerce company in Japan. Furthermore, using a hyperbolic space to embed product labels that are organized in a hierarchical structure led to better performance compared to using a conventional Euclidean space embedding. These findings demonstrate how label-guided learning can improve item categorization systems in the e-commerce domain.

2020

PG-GSQL: Pointer-Generator Network with Guide Decoding for Cross-Domain Context-Dependent Text-to-SQL Generation
Huajie Wang | Mei Li | Lei Chen
Proceedings of the 28th International Conference on Computational Linguistics

Text-to-SQL is a task of translating utterances to SQL queries, and most existing neural approaches of text-to-SQL focus on the cross-domain context-independent generation task. We pay close attention to the cross-domain context-dependent text-to-SQL generation task, which requires a model to depend on the interaction history and current utterance to generate SQL query. In this paper, we present an encoder-decoder model called PG-GSQL based on the interaction-level encoder and with two effective innovations in decoder to solve cross-domain context-dependent text-to-SQL task. 1) To effectively capture historical information of SQL query and reuse the previous SQL query tokens, we use a hybrid pointer-generator network as decoder to copy tokens from the previous SQL query via pointer, the generator part is utilized to generate new tokens. 2) We propose a guide component to limit the prediction space of vocabulary for avoiding table-column dependency and foreign key dependency errors during decoding phase. In addition, we design a column-table linking mechanism to improve the prediction accuracy of tables. On the challenging cross-domain context-dependent text-to-SQL benchmark SParC, PG-GSQL achieves 34.0% question matching accuracy and 19.0% interaction matching accuracy on the dev set. With BERT augmentation, PG-GSQL obtains 53.1% question matching accuracy and 34.7% interaction matching accuracy on the dev set, outperforms the previous state-of-the-art model by 5.9% question matching accuracy and 5.2% interaction matching accuracy. Our code is publicly available.

Modeling Evolution of Message Interaction for Rumor Resolution
Lei Chen | Zhongyu Wei | Jing Li | Baohua Zhou | Qi Zhang | Xuanjing Huang
Proceedings of the 28th International Conference on Computational Linguistics

Previous work for rumor resolution concentrates on exploiting time-series characteristics or modeling topology structure separately. However, how local interactive pattern affects global information assemblage has not been explored. In this paper, we attempt to address the problem by learning evolution of message interaction. We model confrontation and reciprocity between message pairs via discrete variational autoencoders which effectively reflects the diversified opinion interactivity. Moreover, we capture the variation of message interaction using a hierarchical framework to better integrate information flow of a rumor cascade. Experiments on PHEME dataset demonstrate our proposed model achieves higher accuracy than existing methods.

2019

A Challenge Dataset and Effective Models for Aspect-Based Sentiment Analysis
Qingnan Jiang | Lei Chen | Ruifeng Xu | Xiang Ao | Min Yang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Aspect-based sentiment analysis (ABSA) has attracted increasing attention recently due to its broad applications. In existing ABSA datasets, most sentences contain only one aspect or multiple aspects with the same sentiment polarity, which makes ABSA task degenerate to sentence-level sentiment analysis. In this paper, we present a new large-scale Multi-Aspect Multi-Sentiment (MAMS) dataset, in which each sentence contains at least two different aspects with different sentiment polarities. The release of this dataset would push forward the research in this field. In addition, we propose simple yet effective CapsNet and CapsNet-BERT models which combine the strengths of recent NLP advances. Experiments on our new dataset show that the proposed model significantly outperforms the state-of-the-art baseline methods

2018

CONDUCT: An Expressive Conducting Gesture Dataset for Sound Control
Lei Chen | Sylvie Gibet | Camille Marteau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Predicting Audience’s Laughter During Presentations Using Convolutional Neural Network
Lei Chen | Chong Min Lee
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Public speakings play important roles in schools and work places and properly using humor contributes to effective presentations. For the purpose of automatically evaluating speakers’ humor usage, we build a presentation corpus containing humorous utterances based on TED talks. Compared to previous data resources supporting humor recognition research, ours has several advantages, including (a) both positive and negative instances coming from a homogeneous data set, (b) containing a large number of speakers, and (c) being open. Focusing on using lexical cues for humor recognition, we systematically compare a newly emerging text classification method based on Convolutional Neural Networks (CNNs) with a well-established conventional method using linguistic knowledge. The advantages of the CNN method are both getting higher detection accuracies and being able to learn essential features automatically.

2016

Analyzing Time Series Changes of Correlation between Market Share and Concerns on Companies measured through Search Engine Suggests
Takakazu Imada | Yusuke Inoue | Lei Chen | Syunya Doi | Tian Nie | Chen Zhao | Takehito Utsuro | Yasuhide Kawada
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper proposes how to utilize a search engine in order to predict market shares. We propose to compare rates of concerns of those who search for Web pages among several companies which supply products, given a specific products domain. We measure concerns of those who search for Web pages through search engine suggests. Then, we analyze whether rates of concerns of those who search for Web pages have certain correlation with actual market share. We show that those statistics have certain correlations. We finally propose how to predict the market share of a specific product genre based on the rates of concerns of those who search for Web pages.

Can We Make Computers Laugh at Talks?
Chong Min Lee | Su-Youn Yoon | Lei Chen
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

Considering the importance of public speech skills, a system which makes a prediction on where audiences laugh in a talk can be helpful to a person who prepares for a talk. We investigated a possibility that a state-of-the-art humor recognition system can be used in detecting sentences inducing laughters in talks. In this study, we used TED talks and laughters in the talks as data. Our results showed that the state-of-the-art system needs to be improved in order to be used in a practical application. In addition, our analysis showed that classifying humorous sentences in talks is very challenging due to close distance between humorous and non-humorous sentences.

2015

Feature selection for automated speech scoring
Anastassia Loukina | Klaus Zechner | Lei Chen | Michael Heilman
Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications

2014

Automatic evaluation of spoken summaries: the case of language assessment
Anastassia Loukina | Klaus Zechner | Lei Chen
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language
Klaus Zechner | Keelan Evanini | Su-Youn Yoon | Lawrence Davis | Xinhao Wang | Lei Chen | Chong Min Lee | Chee Wee Leong
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications

2013

Building Comparable Corpora Based on Bilingual LDA Model
Zede Zhu | Miao Li | Lei Chen | Zhenxin Yang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Applying Unsupervised Learning To Support Vector Space Model Based Speaking Assessment
Lei Chen
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Evaluating Unsupervised Language Model Adaptation Methods for Speaking Assessment
Shasha Xie | Lei Chen
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

2012

Utilizing Cumulative Logit Model and Human Computation on Automated Speech Assessment
Lei Chen
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

Scoring Spoken Responses Based on Content Accuracy
Fei Huang | Lei Chen | Jana Sukkarieh
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

2011

Detecting Structural Events for Assessing Non-Native Speech
Lei Chen | Su-Youn Yoon
Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications

2010

Towards Using Structural Events To Assess Non-native Speech
Lei Chen | Joel Tetreault | Xiaoming Xi
Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications

2009

Improved pronunciation features for construct-driven assessment of non-native spontaneous speech
Lei Chen | Klaus Zechner | Xiaoming Xi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Automatic Scoring of Children’s Read-Aloud Text Passages and Word Lists
Klaus Zechner | John Sabatini | Lei Chen
Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications

2006

An Open Source Prosodic Feature Extraction Tool
Zhongqiang Huang | Lei Chen | Mary Harper
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

There has been an increasing interest in utilizing a wide variety of knowledge sources in order to perform automatic tagging of speech events, such as sentence boundaries and dialogue acts. In addition to the word spoken, the prosodic content of the speech has been proved quite valuable in a variety of spoken language processing tasks such as sentence segmentation and tagging, disfluency detection, dialog act segmentation and tagging, and speaker recognition. In this paper, we report on an open source prosodic feature extraction tool based on Praat, with a description of the prosodic features and the implementation details, as well as a discussion of its extension capability. We also evaluate our tool on a sentence boundary detection task and report the system performance on the NIST RT04 CTS data.

Incorporating Gesture and Gaze into Multimodal Models of Human-to-Human Communication
Lei Chen
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Doctoral Consortium

2004

Evaluating Factors Impacting the Accuracy of Forced Alignments in a Multimodal Corpus
Lei Chen | Yang Liu | Mary Harper | Eduardo Maia | Susan McRoy
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor's gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments.

Co-authors

Yang Liu (刘扬) 2

Anastassia Loukina 2

Hirokazu Miyake 2

Ruifeng Xu (徐睿峰) 2

Rohan Bhambhoria 1

Chun Yi Louis Bo 1

Caleb Chen Cao 1

Lawrence Davis 1

Xin Luna Dong 1

Keelan Evanini 1

Tianqing Fang 1

Xu Han (韩旭) 1

Michael Heilman 1

Zhongqiang Huang 1

Takakazu Imada 1

Changjian Jiang 1

Chenglin Jiang 1

Qingnan Jiang 1

Yeongseo Jung 1

Yasuhide Kawada 1

Chee Wee Leong 1

Jingcong Liang 1

Chengzhong Liu 1

Camille Marteau 1

Susan W. McRoy 1

Bing Qin (秦兵) 1

John Sabatini 1

Jana Sukkarieh 1

Fei Teng (滕飞) 1

Joel Tetreault 1

Takehito Utsuro 1

Yifan Ethan Xu 1

Chen Jason Zhang 1

Xinnong Zhang 1

Venues