Zichao Li

2025

Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.

Automated Alignment refers to a set of algorithms designed to align Large Language Models (LLMs) with human intentions and values while minimizing manual intervention. However, it faces challenges such as algorithmic diversity and excessively convoluted workflows. We present AutoAlign, an open-source toolkit that offers:(1) a unified framework integrating mainstream automated algorithms through a consistent interface, and(2) an accessible workflow supporting one-click execution for prompt synthesis, automatic alignment signal construction, and iterative model training. Our toolkit enables easy reproduction of existing results through extensive benchmarks and facilitates the development of novel approaches via modular components. It includes implementations for both highly efficient inference and training, as well as low-resource training. By standardizing automated alignment methodologies and providing accessible implementations, AutoAlign lowers the barriers to building customized aligned models and supports academic research.

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models’ abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S³uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general Vision-Language Models, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.

pdf bib abs
Knowledge-Grounded Detection of Cryptocurrency Scams with Retrieval-Augmented LMs
Zichao Li
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

This paper presents a knowledge-grounded framework for cryptocurrency scam detection using retrieval-augmented language models. We address three key limitations of existing approaches: static knowledge bases, unreliable LM outputs, and fixed classification thresholds. Our method combines (1) temporally-weighted retrieval from scam databases, (2) confidence-aware fusion of parametric and external knowledge, and (3) adaptive threshold optimization via gradient ascent. Experiments on CryptoScams and Twitter Financial Scams datasets demonstrate state-of-the-art performance, with 22% higher recall at equivalent precision compared to fixed thresholds, 4.3× lower hallucination rates than pure LMs, and 89% temporal performance retention on emerging scam types. The system achieves real-time operation (45ms/query) while maintaining interpretability through evidence grounding. Ablation studies confirm each component’s necessity, with confidence fusion proving most critical (12.1% performance drop when removed). These advances enable more robust monitoring of evolving cryptocurrency threats while addressing fundamental challenges in knowledgeable foundation models.

pdf bib abs
Cross-Modal Augmentation for Low-Resource Language Understanding and Generation
Zichao Li | Zong Ke
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)

This paper introduces a multimodal retrieval-augmented generation (RAG) system designed to enhance language understanding and generation for low-resource languages. By integrating textual, visual, and geospatial data, the system leverages cross-lingual adaptation and multimodal augmentation to bridge the gap between high-resource and low-resource languages. Evaluated on the MM-COVID and LORELEI datasets, the system demonstrates superior performance in retrieval (precision: 85%, recall: 82%) and generation (BLEU: 28.4) tasks compared to baselines. Case studies in public health communication and disaster response highlight its practical utility. The results underscore the potential of multimodal AI to democratize access to technology and address global challenges in low-resource settings.

pdf bib abs
Formula-Text Cross-Retrieval: A Benchmarking Study of Dense Embedding Methods for Mathematical Information Retrieval
Zichao Li
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)

Mathematical information retrieval requires understanding the complex relationship between natural language and formulae. This paper presents a benchmarking study on Formula-Text Cross-Retrieval, comparing a sparse baseline (BM25), off-the-shelf dense embeddings (OpenAI, BGE), and a fine-tuned dual-encoder model. Our model, trained with a contrastive objective on the ARQAR dataset, significantly outperforms all baselines, achieving state-of-the-art results. Ablation studies confirm the importance of linearization, a shared-weight architecture, and the Multiple Negatives Ranking loss. The work provides a strong foundation for mathematical NLP applications.

pdf bib abs
Domain Meets Typology: Predicting Verb-Final Order from Universal Dependencies for Financial and Blockchain NLP
Zichao Li | Zong Ke
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

This paper introduces a domain-adapted approach for verb-order prediction across general and specialized texts (financial/blockchain), combining Universal Dependencies syntax with novel features (AVAR, DLV) and dynamic threshold calibration. We evaluate on 53 languages from UD v2.11, 12K financial sentences (FinBench), and 1,845 blockchain whitepapers (CryptoUD), outperforming four baselines by 6-19% F1. Key findings include: (1) 62% SOV prevalence in SEC filings (+51% over general English), (2) 88% technical whitepaper alignment with Solidity’s SOV patterns, and (3) 9% gains from adaptive thresholds. The system processes 1,150 sentences/second - 2.4× faster than XLM-T - while maintaining higher accuracy, demonstrating that lightweight feature-based methods can surpass neural approaches for domain-specific syntactic analysis.

pdf bib abs
Retrieval-Augmented Forecasting with Tabular Time Series Data
Zichao Li
Proceedings of the 4th Table Representation Learning Workshop

This paper presents Retrieval-Augmented Forecasting (RAF), a novel framework for tabular time-series prediction that dynamically retrieves and integrates relevant historical table slices. RAF addresses three key limitations of existing methods: 1) schema rigidity through dynamic hashing of column metadata, 2) temporal myopia via cross-attention with learned decay, and 3) pipeline sub-optimality via end-to-end retriever-forecaster co-training. Experiments across macroeconomic (FRED-MD), financial (Yahoo Finance), and development (WorldBank) benchmarks demonstrate RAF’s superiority over six baselines, reducing sMAPE by 19.1-26.5% while maintaining robustness to schema changes (+3.2% sMAPE increase vs. +6.7-12.7% for alternatives). The architecture’s computational overhead (1.8 vs. 1.2 hours/epoch vs. TFT) is justified by significant accuracy gains in critical scenarios like market shocks (61.7% vs. 55.1% directional accuracy).

pdf bib abs
Injecting Structured Knowledge into LLMs via Graph Neural Networks
Zichao Li | Zong Ke | Puning Zhao
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)

Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), but they often struggle to capture explicit linguistic structures and world knowledge. To address this limitation, we propose a hybrid model that integrates LLMs with graph neural networks (GNNs) to inject structured knowledge into NLP tasks. Our approach leverages the strengths of both components: LLMs provide rich contextual representations, while GNNs encode explicit structural priors from sources such as dependency trees, Abstract Meaning Representations (AMRs), and knowledge graphs. We evaluate the hybrid model on a diverse set of tasks, including semantic parsing, multi-hop question answering, text summarization, commonsense reasoning, and dependency parsing. Experimental results demonstrate consistent improvements over both standalone baselines and state-of-the-art methods, with relative gains of up to 2.3% in Exact Match scores for multi-hop QA and 1.7% in accuracy for commonsense reasoning. Ablation studies and sensitivity analyses further highlight the importance of balancing contextual and structural information. By bridging the gap between unstructured textual data and structured knowledge, our work advances the state of the art in NLP and paves the way for more interpretable and robust language models.

2024

Document logical structuring aims to extract the underlying hierarchical structure of documents, which is crucial for document intelligence. Traditional approaches often fall short in handling the complexity and the variability of lengthy documents. To address these issues, we introduce Seg2Act, an end-to-end, generation-based method for document logical structuring, revisiting logical structure extraction as an action generation task. Specifically, given the text segments of a document, Seg2Act iteratively generates the action sequence via a global context-aware generative model, and simultaneously updates its global context and current logical structure based on the generated actions. Experiments on ChCatExt and HierDoc datasets demonstrate the superior performance of Seg2Act in both supervised and transfer learning settings.

2023

pdf bib abs
f-Divergence Minimization for Sequence-Level Knowledge Distillation
Yuqiao Wen | Zichao Li | Wenyu Du | Lili Mou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an FDISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our FDISTILL methods. We further derive step-wise decomposition for our FDISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

pdf bib abs
Evaluating Dependencies in Fact Editing for Language Models: Specificity and Implication Awareness
Zichao Li | Ines Arous | Siva Reddy | Jackie Cheung
Findings of the Association for Computational Linguistics: EMNLP 2023

The potential of using a large language model (LLM) as a knowledge base (KB) has sparked significant interest. To maintain the knowledge acquired by LLMs, we need to ensure that the editing of learned facts respects internal logical constraints, which are known as dependency of knowledge. Existing work on editing LLMs has partially addressed the issue of dependency, when the editing of a fact should apply to its lexical variations without disrupting irrelevant ones. However, they neglect the dependency between a fact and its logical implications. We propose an evaluation protocol with an accompanying question-answering dataset, StandUp, that provides a comprehensive assessment of the editing process considering the above notions of dependency. Our protocol involves setting up a controlled environment in which we edit facts and monitor their impact on LLMs, along with their implications based on If-Then rules. Extensive experiments on StandUp show that existing knowledge editing methods are sensitive to the surface form of knowledge, and that they have limited performance in inferring the implications of edited facts.

2022

pdf bib abs
Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment
Zichao Li | Prakhar Sharma | Xing Han Lu | Jackie Cheung | Siva Reddy
Findings of the Association for Computational Linguistics: ACL 2022

Most research on question answering focuses on the pre-deployment stage; i.e., building an accurate model for deployment. In this paper, we ask the question: Can we improve QA systems further post-deployment based on user interactions? We focus on two kinds of improvements: 1) improving the QA system’s performance itself, and 2) providing the model with the ability to explain the correctness or incorrectness of an answer. We collect a retrieval-based QA dataset, FeedbackQA, which contains interactive feedback from users. We collect this dataset by deploying a base QA system to crowdworkers who then engage with the system and provide feedback on the quality of its answers. The feedback contains both structured ratings and unstructured natural language explanations. We train a neural model with this feedback data that can generate explanations and re-score answer candidates. We show that feedback data not only improves the accuracy of the deployed QA system but also other stronger non-deployed systems. The generated explanations also help users make informed decisions about the correctness of answers.

pdf bib abs
Text Revision by On-the-Fly Representation Optimization
Jingjing Li | Zichao Li | Tao Ge | Irwin King | Michael Lyu
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)

Text revision refers to a family of natural language generation tasks, where the source and target sequences share moderate resemblance in surface form but differentiate in attributes, such as text formality and simplicity. Current state-of-the-art methods formulate these tasks as sequence-to-sequence learning problems, which rely on large-scale parallel training corpus. In this paper, we present an iterative inplace editing approach for text revision, which requires no parallel data. In this approach, we simply fine-tune a pre-trained Transformer with masked language modeling and attribute classification. During inference, the editing at each iteration is realized by two-step span replacement. At the first step, the distributed representation of the text optimizes on the fly towards an attribute function. At the second step, a text span is masked and another new one is proposed conditioned on the optimized representation. The empirical experiments on two typical and important text revision tasks, text formalization and text simplification, show the effectiveness of our approach. It achieves competitive and even better performance than state-of-the-art supervised methods on text simplification, and gains better performance than strong unsupervised methods on text formalization.

2021

pdf bib abs
Codewithzichao@DravidianLangTech-EACL2021: Exploring Multilingual Transformers for Offensive Language Identification on Code Mixing Text
Zichao Li
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper describes our solution submitted to shared task on Offensive Language Identification in Dravidian Languages. We participated in all three of offensive language identification. In order to address the task, we explored multilingual models based on XLM-RoBERTa and multilingual BERT trained on mixed data of three code-mixed languages. Besides, we solved the class-imbalance problem existed in training data by class combination, class weights and focal loss. Our model achieved weighted average F1 scores of 0.75 (ranked 4th), 0.94 (ranked 4th) and 0.72 (ranked 3rd) in Tamil-English task, Malayalam-English task and Kannada-English task, respectively.

pdf bib abs
Codewithzichao@DravidianLangTech-EACL2021: Exploring Multimodal Transformers for Meme Classification in Tamil Language
Zichao Li
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper describes our submission to shared task on Meme Classification for Tamil Language. To address this task, we explore a multimodal transformer for meme classification in Tamil language. According to the characteristics of the image and text, we use different pretrained models to encode the image and text so as to get better representations of the image and text respectively. Besides, we design a multimodal attention layer to make the text and corresponding image interact fully with each other based on cross attention. Our model achieved 0.55 weighted average F1 score and ranked first in this task.

pdf bib abs
BFClass: A Backdoor-free Text Classification Framework
Zichao Li | Dheeraj Mekala | Chengyu Dong | Jingbo Shang
Findings of the Association for Computational Linguistics: EMNLP 2021

Backdoor attack introduces artificial vulnerabilities into the model by poisoning a subset of the training data via injecting triggers and modifying labels. Various trigger design strategies have been explored to attack text classifiers, however, defending such attacks remains an open problem. In this work, we propose BFClass, a novel efficient backdoor-free training framework for text classification. The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model. To identify triggers, we utilize this discriminator to locate the most suspicious token from each training sample and then distill a concise set by considering their association strengths with particular labels. To recognize the poisoned subset, we examine the training samples with these identified triggers as the most suspicious token, and check if removing the trigger will change the poisoned model’s prediction. Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.

2019

pdf bib abs
EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing
Yue Dong | Zichao Li | Mehdi Rezagholizadeh | Jackie Chi Kit Cheung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the first sentence simplification model that learns explicit edit operations (ADD, DELETE, and KEEP) via a neural programmer-interpreter approach. Most current neural sentence simplification systems are variants of sequence-to-sequence models adopted from machine translation. These methods learn to simplify sentences as a byproduct of the fact that they are trained on complex-simple sentence pairs. By contrast, our neural programmer-interpreter is directly trained to predict explicit edit operations on targeted parts of the input sentence, resembling the way that humans perform simplification and revision. Our model outperforms previous state-of-the-art neural sentence simplification models (without external knowledge) by large margins on three benchmark text simplification corpora in terms of SARI (+0.95 WikiLarge, +1.89 WikiSmall, +1.41 Newsela), and is judged by humans to produce overall better and simpler output sentences.

pdf bib abs
Decomposable Neural Paraphrase Generation
Zichao Li | Xin Jiang | Lifeng Shang | Qun Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level. This paper presents Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity in a disentangled way. Specifically, the model is composed of multiple encoders and decoders with different structures, each of which corresponds to a specific granularity. The empirical study shows that the decomposition mechanism of DNPG makes paraphrase generation more interpretable and controllable. Based on DNPG, we further develop an unsupervised domain adaptation method for paraphrase generation. Experimental results show that the proposed model achieves competitive in-domain performance compared to state-of-the-art neural models, and significantly better performance when adapting to a new domain.

2018

pdf bib abs
Paraphrase Generation with Deep Reinforcement Learning
Zichao Li | Xin Jiang | Lifeng Shang | Hang Li
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Automatic generation of paraphrases from a given sentence is an important yet challenging task in natural language processing (NLP). In this paper, we present a deep reinforcement learning approach to paraphrase generation. Specifically, we propose a new framework for the task, which consists of a generator and an evaluator, both of which are learned from data. The generator, built as a sequence-to-sequence learning model, can produce paraphrases given a sentence. The evaluator, constructed as a deep matching model, can judge whether two sentences are paraphrases of each other. The generator is first trained by deep learning and then further fine-tuned by reinforcement learning in which the reward is given by the evaluator. For the learning of the evaluator, we propose two methods based on supervised learning and inverse reinforcement learning respectively, depending on the type of available training data. Experimental results on two datasets demonstrate the proposed models (the generators) can produce more accurate paraphrases and outperform the state-of-the-art methods in paraphrase generation in both automatic evaluation and human evaluation.