Qun Liu - ACL Anthology

Qun Liu

2025

Mixture of insighTful Experts (MoTE): The Synergy of Reasoning Chains and Expert Mixtures in Self-Alignment
Zhili Liu | Yunhao Gou | Kai Chen | Lanqing Hong | Jiahui Gao | Fei Mi | Yu Zhang | Zhenguo Li | Xin Jiang | Qun Liu | James Kwok
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment.In this work, we address a fundamental question:How to effectively incorporate reasoning abilitiesand MoE architectures into self-alignment processin LLMs?We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments.From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI’s state-of-the-art o1 model.

Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Yufei Wang | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: NAACL 2025

Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in confidence during the model’s reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by the reasoning steps required. Furthermore, by analyzing patterns in confidence change, we examine the correctness of the model’s reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model’s reasoning.

REIC: RAG-Enhanced Intent Classification at Scale
Ziji Zhang | Michael Yang | Zhiyu Chen | Yingying Zhuang | Shu-Ting Pi | Qun Liu | Rajashekar Maragoud | Vy Nguyen | Anurag Beniwal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Accurate intent classification is critical for efficient routing in customer service, ensuring customers are connected with the most suitable agents while reducing handling times and operational costs. However, as companies expand their product lines, intent classification faces scalability challenges due to the increasing number of intents and variations in taxonomy across different verticals. In this paper, we introduce REIC, a Retrieval-augmented generation Enhanced Intent Classification approach, which addresses these challenges effectively. REIC leverages retrieval-augmented generation (RAG) to dynamically incorporate relevant knowledge, enabling precise classification without the need for frequent retraining. Through extensive experiments on real-world datasets, we demonstrate that REIC outperforms traditional fine-tuning, zero-shot, and few-shot methods in large-scale customer service settings. Our results highlight its effectiveness in both in-domain and out-of-domain scenarios, demonstrating its potential for real-world deployment in adaptive and large-scale intent classification systems.

Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
Chengwu Liu | Ye Yuan | Yichun Yin | Yan Xu | Xin Xu | Zaoyu Chen | Yasheng Wang | Lifeng Shang | Qun Liu | Ming Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.

ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.

Recent advancements in model architectures and length extrapolation techniques have significantly extended the context length of large language models (LLMs), paving the way for their application in increasingly complex tasks. However, despite the growing capabilities of long-context LLMs, the safety issues in long-context scenarios remain underexplored. While safety alignment in short context has been widely studied, the safety concerns of long-context LLMs have not been adequately addressed. In this work, we introduce ${textbf{LongSafety}}$, a comprehensive safety alignment dataset for long-context LLMs, containing 10 tasks and 17k samples, with an average length of 40.9k tokens. Our experiments demonstrate that training with LongSafety can enhance long-context safety performance while enhancing short-context safety and preserving general capabilities. Furthermore, we demonstrate that long-context safety does not equal long-context alignment with short-context safety data and LongSafety has generalizing capabilities in context length and long-context safety scenarios.

Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs’ Reasoning
Zezhong Wang | Xingshan Zeng | Weiwen Liu | Yufei Wang | Liangyou Li | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Mathematical reasoning through Chain-of-Thought (CoT) has emerged as a powerful capability of Large Language Models (LLMs), which can be further enhanced through Test-Time Scaling (TTS) methods like Beam Search and DVTS. However, these methods, despite improving accuracy by allocating more computational resources during inference, often suffer from path homogenization and inefficient use of intermediate results. To address these limitations, we propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps. It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making. Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results. Experimental results show that SRCA improves reasoning accuracy compared to existing TTS methods across various mathematical datasets.

Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing
Kaishuai Xu | Tiezheng Yu | Wenjun Hou | Yi Cheng | Chak Tou Leong | Liangyou Li | Xin Jiang | Lifeng Shang | Qun Liu | Wenjie Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs’ full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

Hierarchical Memory Organization for Wikipedia Generation
Eugene J. Yu | Dawei Zhu | Yifan Song | Xiangyu Wong | Jiebin Zhang | Wenxuan Shi | Xiaoguang Li | Qun Liu | Sujian Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.

LLM-Based Dialogue Labeling for Multiturn Adaptive RAG
Zhiyu Chen | Biancen Xie | Sidarth Srinivasan | Manikandarajan Ramanathan | Rajashekar Maragoud | Qun Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Customer service often relies on human agents, which, while effective, can be costly and slower to scale. Recent advancements in intelligent chatbots, particularly Retrieval-Augmented Generation (RAG) models, have significantly enhanced efficiency by integrating large language models with external knowledge retrieval. However, developing a multi-turn RAG-based chatbot for real-world customer service presents additional complexities, requiring components like adaptive retrieval and query reformulation. These components typically require substantial annotated data, which is often scarce. To overcome this limitation, we propose methods to automatically generate labels for these components using real customer-agent dialogue data. Specifically, we introduce two labeling strategies for adaptive retrieval: an intent-guided strategy and an explanation-based strategy, along with two query reformulation strategies: natural language query reformulation and keyword-based reformulation. Our experiments reveal that the explanation-based strategy yields the best results for adaptive retrieval, while the keyword-based reformulation improves document retrieval quality.Our findings offer valuable insights for practitioners working on multi-turn RAG systems.

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu | Tiezheng Yu | Yi Cheng | Wenjun Hou | Liangyou Li | Xin Jiang | Lifeng Shang | Qun Liu | Wenjie Li
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.

Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning
Yunhao Gou | Hansi Yang | Zhili Liu | Kai Chen | Yihan Zeng | Lanqing Hong | Zhenguo Li | Qun Liu | Bo Han | James Kwok | Yu Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are corrupted but not broken. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang | Dawei Zhu | Yifan Song | Wenhao Wu | Chuqiao Kuang | Xiaoguang Li | Lifeng Shang | Qun Liu | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimensions separately. However, these works have left the trade-off between these two orthogonal dimensions largely unexplored. In this paper, we leverage the Information Bottleneck principle to formulate KV cache compression within a unified theoretical framework. We demonstrate that a carefully managed token-precision trade-off can achieve an optimal point within the Information Bottleneck compared to standalone KV pruning or KV quantization. Experiments reveal that storing more tokens in the KV cache at lower precision—a strategy we term quantized pruning—can significantly enhance the long-context performance of LLMs. An in-depth analysis of this token-precision trade-off across key aspects shows that quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning exhibits notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code isavailable at https://github.com/zhzihao/QPruningKV.

Efficient Inference for Large Language Models –Algorithm, Model, and System
Xuefei Ning | Guohao Dai | Haoli Bai | Lu Hou | Yu Wang | Qun Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The inference of LLMs incurs high computational costs, memory access overhead, and memory usage, leading to inefficiencies in terms of latency, throughput, power consumption, and storage. To this end, this tutorial focuses on the increasingly important topic of Efficient Inference for LLMs and aims to provide a systematic understanding of key facts and methodologies from a designer’s perspective. We start by introducing the basic concepts of modern LLMs, software and hardware. Following this, we define the efficiency optimization problem. To equip the audience with a designer’s mindset, we briefly explain how to diagnose efficiency bottlenecks for a given workload on specific hardware. After introducing the basics, we will introduce our full-stack taxonomy of efficient inference methods for LLMs. We will walk through each category of methodology, using one to three representative methods as examples for each leaf subcategory, elaborating on the design logic behind each method and which inefficiency factors they primarily address. Finally, we will wrap up with a takeaway summary, and future research directions. The tutorial website is https://haolibai.github.io/emnlp-2025-tutorial-efficiency/.

WIKIGENBENCH:Exploring Full-length Wikipedia Generation under Real-World Scenario
Jiebin Zhang | Eugene J. Yu | Qinyu Chen | Chenhao Xiong | Dawei Zhu | Han Qian | Mingbo Song | Weimin Xiong | Xiaoguang Li | Qun Liu | Sujian Li
Proceedings of the 31st International Conference on Computational Linguistics

It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

2024

Don’t Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention
Shu-Ting Pi | Pradeep Bagavan | Yejia Li | Disha | Qun Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model’s ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.

Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions
Xuming Hu | Xiaochuan Li | Junzhe Chen | Yinghui Li | Yangning Li | Xiaoguang Li | Yasheng Wang | Qun Liu | Lijie Wen | Philip Yu | Zhijiang Guo
Findings of the Association for Computational Linguistics: ACL 2024

Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment. The dataset and code will be publicly available.

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios
Shijue Huang | Wanjun Zhong | Jianqiao Lu | Qi Zhu | Jiahui Gao | Weiwen Liu | Yutai Hou | Xingshan Zeng | Yasheng Wang | Lifeng Shang | Xin Jiang | Ruifeng Xu | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2024

The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool.

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models
Wai-Chung Kwan | Xingshan Zeng | Yuxin Jiang | Yufei Wang | Liangyou Li | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) are increasingly used for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks mainly focus on single-turn evaluations, overlooking the models’ capabilities in multi-turn interactions. To address this gap, we introduce , a comprehensive benchmark to evaluate the multi-turn conversational abilities of LLMs. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or creating new examples using GPT-4 with a human-in-the-loop process to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 10 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models’ fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance.

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
Wai-Chung Kwan | Xingshan Zeng | Yufei Wang | Yusen Sun | Liangyou Li | Yuxin Jiang | Lifeng Shang | Qun Liu | Kam-Fai Wong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Managing long sequences has become an important and necessary feature for large language models (LLMs). However, assessing their ability to handle long contexts remains a challenge. This paper introduces M⁴LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. It encompasses 36 NLP datasets, covering 11 types of tasks and 12 domains, providing a comprehensive test bed. To address the lack of tasks featuring naturally long sequences, we propose an automatic approach to convert short-sequence tasks into long-sequence scenarios. These scenarios evaluate LLMs’ long-context understanding across five key abilities: understanding of single or multiple relevant spans in long contexts based on explicit or semantic hints, and global context understanding. This automatic approach allows us to create instances evenly distributed from 1k to 8k input length. Our evaluation of 11 prominent LLMs reveals that 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area.

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Yuxin Jiang | Yufei Wang | Xingshan Zeng | Wanjun Zhong | Liangyou Li | Fei Mi | Lifeng Shang | Xin Jiang | Qun Liu | Wei Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs’ outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.

The emerging citation-based QA systems are gaining more attention especially in generative AI search applications. The importance of extracted knowledge provided to these systems is vital from both accuracy (completeness of information) and efficiency (extracting the information in a timely manner). In this regard, citation-based QA systems are suffering from two shortcomings. First, they usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system. Second, web-retrieved contents are usually obtained by some simple heuristics such as fixed length or breakpoints which might lead to splitting information into pieces. To mitigate these issues, we propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system. This has been done through designing an adaptive web retriever and incorporating KGs triples in an efficient manner. We demonstrate the effectiveness of over the open-source state-of-the-art (SoTA) web-based and KG baseline models using a comprehensive set of quantitative and human evaluation experiments. Our model is able to: first, improve the web-retriever baseline in terms of extracting more relevant passages (>20%), the coverage of answer span (>25%) and self containment (>35%); second, obtain and integrate KG triples into its pipeline very efficiently (by avoiding any LLM calls) to outperform the web-only and KG-only SoTA baselines significantly in 7 quantitative QA tasks and our human evaluation.

Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk
Zhiyuan Zeng | Qipeng Guo | Xiaoran Liu | Zhangyue Yin | Wentao Shu | Mianqiu Huang | Bo Wang | Yunhua Zhou | Linlin Li | Qun Liu | Xipeng Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The evolution of Large Language Models (LLMs) has led to significant advancements, with models like Claude and Gemini capable of processing contexts up to 1 million tokens. However, efficiently handling long sequences remains challenging, particularly during the prefilling stage when input lengths exceed GPU memory capacity. Traditional methods often segment sequence into chunks and compress them iteratively with fixed-size memory. However, our empirical analysis shows that the fixed-size memory results in wasted computational and GPU memory resources. Therefore, we introduces Incremental Memory (IM), a method that starts with a small memory size and gradually increases it, optimizing computational efficiency. Additionally, we propose Decremental Chunk based on Incremental Memory (IMDC), which reduces chunk size while increasing memory size, ensuring stable and lower GPU memory usage. Our experiments demonstrate that IMDC is consistently faster (1.45x) and reduces GPU memory consumption by 23.3% compared to fixed-size memory, achieving comparable performance on the LongBench Benchmark.

CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
Fengran Mo | Abbas Ghaddar | Kelong Mao | Mehdi Rezagholizadeh | Boxing Chen | Qun Liu | Jian-Yun Nie
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at https://github.com/fengranMark/CHIQ.

“Knowing When You Don’t Know”: A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation
Nandan Thakur | Luiz Bonifacio | Crystina Zhang | Odunayo Ogundepo | Ehsan Kamalloo | David Alfonso-Hermelo | Xiaoguang Li | Qun Liu | Boxing Chen | Mehdi Rezagholizadeh | Jimmy Lin
Findings of the Association for Computational Linguistics: EMNLP 2024

Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish **NoMIRACL**, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) *hallucination rate*, measuring model tendency to hallucinate when the answer is not present in passages in the non-relevant subset, and (ii) *error rate*, measuring model inaccuracy to recognize relevant passages in the relevant subset. In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.

Prompt-Based Length Controlled Generation with Multiple Control Types
Renlong Jie | Xiaojun Meng | Lifeng Shang | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) have attracted great attention given their strong performance on a wide range of NLP tasks. In practice, users often expect generated texts to fall within a specific length range, making length controlled generation an important topic, especially for GPT-style models. Existing length control methods mostly focus on a simple control type of “equal to” a target length. Different from them, we propose a prompt-based method to achieve length controlled generation under different control types with high accuracy. In particular, we adopt reinforcement learning (RL) and sample filtering with the reward signal given by rule-based reward models, which enhances the length control ability of models by rewarding outputs that follow certain control instructions. In addition, we introduce a standard prompt extractor to parse arbitrary users’ input into standard control instructions. Experiments show that our method significantly improves the accuracy of prompt-based length control on popular summarization datasets like CNNDM and NYT under multiple control types. Moreover, both the standard prompt extractor and RL-tuned model show strong generalization to unseen control prompt templates.

Visually Guided Generative Text-Layout Pre-training for Document Intelligence
Zhiming Mao | Haoli Bai | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024)
Yvette Graham | Qun Liu | Gerasimos Lampouras | Ignacio Iacobacci | Sinead Madden | Haider Khalid | Rameez Qureshi
Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024)

LFED: A Literary Fiction Evaluation Dataset for Large Language Models
Linhao Yu | Qun Liu | Deyi Xiong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments involving various state-of-the-art LLMs, our findings reveal that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at https://github.com/tjunlp-lab/LFED.git.

ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
Haochen Tan | Zhijiang Guo | Zhan Shi | Lu Xu | Zhili Liu | Yunlong Feng | Xiaoguang Li | Yasheng Wang | Lifeng Shang | Qun Liu | Linqi Song
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have succeeded remarkably in understanding long-form contents. However, exploring their capability for generating long-form contents, such as reports and articles, has been relatively unexplored and inadequately assessed by existing benchmarks. The prevalent evaluation methods, which predominantly rely on crowdsourcing, are recognized for their labor-intensive nature and lack of efficiency, whereas automated metrics, such as the ROUGE score, demonstrate discordance with human judgment criteria. In this paper, we propose ProxyQA, an innovative framework dedicated to assessing long-text generation. ProxyQA comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. LLMs are tasked to generate extensive content in response to these meta-questions, by engaging an evaluator and incorporating the generated texts as contextual background, ProxyQA assesses the generated content’s quality through the evaluator’s accuracy in addressing the proxy-questions. We examine multiple LLMs, emphasizing ProxyQA’s demanding nature as a high-quality assessment tool. Human evaluation demonstrates that the proxy-question method is notably self-consistent and aligns closely with human evaluative standards. The dataset and leaderboard is available at https://proxy-qa.com.

Findings of the First Workshop on Simulating Conversational Intelligence in Chat
Yvette Graham | Mohammed Rameez Qureshi | Haider Khalid | Gerasimos Lampouras | Ignacio Iacobacci | Qun Liu
Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024)

The aim of this workshop is to bring together experts working on open-domain dialogue research. In this speedily advancing research area many challenges still exist, such as learning information from conversations, engaging in realistic and convincing simulation of human intelligence and reasoning. SCI-CHAT follows previous workshops on open domain dialogue but with a focus on the simulation of intelligent conversation as judged in a live human evaluation. Models aim to include the ability to follow a challenging topic over a multi-turn conversation, while positing, refuting and reasoning over arguments. The workshop included both a research track and shared task. The main goal of this paper is to provide an overview of the shared task and a link to an additional paper that will include an in depth analysis of the shared task results following presentation at the workshop.

Learning to Edit: Aligning LLMs with Knowledge Editing
Yuxin Jiang | Yufei Wang | Chuhan Wu | Wanjun Zhong | Xingshan Zeng | Jiahui Gao | Liangyou Li | Xin Jiang | Lifeng Shang | Ruiming Tang | Qun Liu | Wei Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (LLMs) without negatively impacting performance across other inputs, have garnered widespread attention. However, existing methods predominantly rely on memorizing the updated knowledge, impeding LLMs from effectively combining the new knowledge with their inherent knowledge when answering questions. To this end, we propose a Learning to Edit (LTE) framework, focusing on teaching LLMs to apply updated knowledge into input questions, inspired by the philosophy of “Teach a man to fish.” LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the Inference Phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. By comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two LLM architectures, we demonstrate LTE’s superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. The data and code are publicly available at https://github.com/YJiangcm/LTE.

2023

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Pei Ke | Fei Huang | Fei Mi | Yasheng Wang | Qun Liu | Xiaoyan Zhu | Minlie Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.

AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models
Siheng Li | Cheng Yang | Yichun Yin | Xinyu Zhu | Zesen Cheng | Lifeng Shang | Xin Jiang | Qun Liu | Yujiu Yang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Information-seeking conversation, which aims to help users gather information through conversation, has achieved great progress in recent years. However, the research is still stymied by the scarcity of training data. To alleviate this problem, we propose AutoConv for synthetic conversation generation, which takes advantage of the few-shot learning ability and generation capacity of large language models (LLM). Specifically, we formulate the conversation generation problem as a language modeling task, then finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process and use it for generating synthetic conversations with high quality. Experimental results on two frequently-used datasets verify that AutoConv has substantial improvements over strong baselines and alleviates the dependence on human annotation. In addition, we also provide several analysis studies to promote future research.

Gradually Excavating External Knowledge for Implicit Complex Question Answering
Chang Liu | Xiaoguang Li | Lifeng Shang | Xin Jiang | Qun Liu | Edmund Lam | Ngai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire extrinsic information, then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA in the ~10B LLM class.

Sub-Character Tokenization for Chinese Pretrained Language Models
Chenglei Si | Zhengyan Zhang | Yingfa Chen | Fanchao Qi | Xiaozhi Wang | Zhiyuan Liu | Yasheng Wang | Qun Liu | Maosong Sun
Transactions of the Association for Computational Linguistics, Volume 11

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions
Hao Sun | Zhexin Zhang | Fei Mi | Yasheng Wang | Wei Liu | Jianwei Cui | Bin Wang | Qun Liu | Minlie Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Morality in dialogue systems has raised great attention in research recently. A moral dialogue system aligned with users’ values could enhance conversation engagement and user connections. In this paper, we propose a framework, MoralDial to train and evaluate moral dialogue systems. In our framework, we first explore the communication mechanisms of morality and resolve expressed morality into three parts, which indicate the roadmap for building a moral dialogue system. Based on that, we design a simple yet effective method: constructing moral discussions between simulated specific users and the dialogue system. The constructed discussions consist of expressing, explaining, revising, and inferring moral views in dialogue exchanges, which makes conversational models learn morality well in a natural manner. Furthermore, we propose a novel evaluation method under the framework. We evaluate the multiple aspects of morality by judging the relation between dialogue responses and human values in discussions, where the multifaceted nature of morality is particularly considered. Automatic and manual experiments demonstrate that our framework is promising to train and evaluate moral dialogue systems.

HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
Zhengkun Zhang | Wenya Guo | Xiaojun Meng | Yasheng Wang | Yadao Wang | Xin Jiang | Qun Liu | Zhenglu Yang
Findings of the Association for Computational Linguistics: ACL 2023

With the scale and capacity of pretrained models growing rapidly, parameter-efficient language model tuning has emerged as a popular paradigm for solving various NLP and Vision-and-Language (V&L) tasks. In this paper, we design a unified parameter-efficient multitask learning framework that works effectively on both NLP and V&L tasks. In particular, we use a shared hypernetwork that takes trainable hyper-embeddings and visual modality as input, and outputs weights for different modules in a pretrained language model, such as the parameters inserted into multi-head attention blocks (i.e., prefix-tuning) and feed-forward blocks (i.e., adapter-tuning.). Our proposed framework adds fewer trainable parameters in multi-task learning while achieving superior performances and transfer ability compared to state-of-the-art methods. Empirical results on the GLUE benchmark and multiple V&L tasks confirm the effectiveness of our framework.

Structured Pruning for Efficient Generative Pre-trained Language Models
Chaofan Tao | Lu Hou | Haoli Bai | Jiansheng Wei | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2023

The increasing sizes of large generative Pre-trained Language Models (PLMs) hinder their deploymentin real-world applications. To obtain efficient PLMs, previous studies mostly focus on pruning the attention heads and feed-forward networks (FFNs) of the Transformer. Nevertheless, we find that in generative PLMs, the hidden dimension shared by many other modules (e.g., embedding layer and layer normalization) contains persistent outliers regardless of the network input. This study comprehensively investigates the structured pruning of generative PLMs with all the above compressible components. To identify redundant network structures, we assign learnable masks over compressible components followed by sparse training. Various sizes of PLMs can be flexibly extracted via different thresholds, and are then task-specifically fine-tuned for further improvement. Extensive experiments on language modeling, summarization and machine translation validate the effectiveness of the proposed method. For example, the pruned BART brings 1.51x/6.96x inference speedup on GPU/CPU with 67% size reduction, and can be further combined with quantization for more than 25× compression.

NewsDialogues: Towards Proactive News Grounded Conversation
Siheng Li | Yichun Yin | Cheng Yang | Wangjie Jiang | Yiwei Li | Zesen Cheng | Lifeng Shang | Xin Jiang | Qun Liu | Yujiu Yang
Findings of the Association for Computational Linguistics: ACL 2023

Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset NewsDialogues, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research.

AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech Translation
Xingshan Zeng | Liangyou Li | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2023

To alleviate the data scarcity problem in End-to-end speech translation (ST), pre-training on data for speech recognition and machine translation is considered as an important technique. However, the modality gap between speech and text prevents the ST model from efficiently inheriting knowledge from the pre-trained models. In this work, we propose AdaTranS for end-to-end ST. It adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features by predicting word boundaries. Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods, with higher inference speed and lower memory usage. Further experiments also show that AdaTranS can be equipped with additional alignment losses to further improve performance.

One Cannot Stand for Everyone! Leveraging Multiple User Simulators to train Task-oriented Dialogue Systems
Yajiao Liu | Xin Jiang | Yichun Yin | Yasheng Wang | Fei Mi | Qun Liu | Xiang Wan | Benyou Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

User simulators are agents designed to imitate human users; recent advances have found that Task-oriented Dialogue (ToD) systems optimized toward a user simulator could better satisfy the need of human users. However, this might result in a sub-optimal ToD system if it is tailored to only one ad hoc user simulator, since human users can behave differently. In this paper, we propose a framework called MUST to optimize ToD systems via leveraging Multiple User SimulaTors. The main challenges of implementing MUST fall in 1) how to adaptively determine which user simulator to interact with the ToD system at each optimization step, since the ToD system might be over-fitted to some specific user simulators, and simultaneously under-fitted to some others; 2) how to avoid catastrophic forgetting of the adaption for a simulator that is not selected for several consecutive optimization steps.To tackle these challenges, we formulate MUST as a Multi-armed bandits (MAB) problem and provide a method called MUST_adaptive that balances i) the boosting adaption for adaptive interactions between different user simulators and the ToD system andii) the uniform adaption to avoid the catastrophic forgetting issue.With both automatic evaluations and human evaluations, our experimental results on MultiWOZ show that the dialogue system trained by MUST achieves a better performance than those trained by a single user simulator. It also has a better generalization ability when testing with unseen user simulators.

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages
Xinyu Zhang | Nandan Thakur | Odunayo Ogundepo | Ehsan Kamalloo | David Alfonso-Hermelo | Xiaoguang Li | Qun Liu | Mehdi Rezagholizadeh | Jimmy Lin
Transactions of the Association for Computational Linguistics, Volume 11

MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.

mCLIP: Multilingual CLIP via Cross-lingual Transfer
Guanhua Chen | Lu Hou | Yun Chen | Wenliang Dai | Lifeng Shang | Xin Jiang | Qun Liu | Jia Pan | Wenping Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable performance on various downstream cross-modal tasks. However, they are usually biased towards English due to the lack of sufficient non-English image-text pairs. Existing multilingual VLP methods often learn retrieval-inefficient single-stream models by translation-augmented non-English image-text pairs. In this paper, we introduce mCLIP, a retrieval-efficient dual-stream multilingual VLP model, trained by aligning the CLIP model and a Multilingual Text Encoder (MTE) through a novel Triangle Cross-modal Knowledge Distillation (TriKD) method. It is parameter-efficient as only two light projectors on the top of them are updated during distillation. Furthermore, to enhance the token- and sentence-level multilingual representation of the MTE, we propose to train it with machine translation and contrastive learning jointly before the TriKD to provide a better initialization. Empirical results show that mCLIP achieves new state-of-the-art performance for both zero-shot and finetuned multilingual image-text retrieval task.

Automated theorem proving (ATP) has become an appealing domain for exploring the reasoning ability of the recent successful generative language models. However, current ATP benchmarks are mainly focus on symbolic inference, but rarely involve the understanding of complex number combination reasoning. In this work, we propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proof but also evaluates a generative LM’s reasoning ability on formulas and capability to manipulate, group, and factor number terms. We gather trigonometric expressions and their reduced forms from web, annotate the simplification process manually, and translate it into the “Lean” formal language system. We then automatically generate additional examples from the annotated samples to expand the dataset. Furthermore, we also create three automatically generated training and testing datasets of varying difficulty and distributions. Our extensive experiments show our proposed TRIGO poses a new challenge for advanced generative LM’s including GPT-4 which is pre-trained on a considerable amount of open-source formal theorem-proving language data, and provide a new tool to study the generative LM’s ability on both formal and mathematical reasoning.

Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment
Boyang Xue | Weichao Wang | Hongru Wang | Fei Mi | Rui Wang | Yasheng Wang | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external factual knowledge they rely upon. Inspired by previous work which identified that feedforward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability of FFNs by knowledge enhancement and alignment respectively. We first propose K-Dial, which explicitly introduces extended FFNs in Transformers to enhance factual knowledge expressions given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs’ expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems.

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding
Haoli Bai | Zhiguang Liu | Xiaojun Meng | Li Wentao | Shuang Liu | Yifeng Luo | Nian Xie | Rongfu Zheng | Liangwei Wang | Lu Hou | Jiansheng Wei | Xin Jiang | Qun Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding (VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that Wukong-Reader brings superior performance on various VDU tasks in both English and Chinese. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

SongRewriter: A Chinese Song Rewriting System with Controllable Content and Rhyme Scheme
Yusen Sun | Liangyou Li | Qun Liu | Dit-Yan Yeung
Findings of the Association for Computational Linguistics: ACL 2023

Although lyrics generation has achieved significant progress in recent years, it has limited practical applications because the generated lyrics cannot be performed without composing compatible melodies. In this work, we bridge this practical gap by proposing a song rewriting system which rewrites the lyrics of an existing song such that the generated lyrics are compatible with the rhythm of the existing melody and thus singable. In particular, we propose SongRewriter, a controllable Chinese lyric generation and editing system which assists users without prior knowledge of melody composition. The system is trained by a randomized multi-level masking strategy which produces a unified model for generating entirely new lyrics or editing a few fragments. To improve the controllabiliy of the generation process, we further incorporate a keyword prompt to control the lexical choices of the content and propose novel decoding constraints and a vowel modeling task to enable flexible end and internal rhyme schemes. While prior rhyming metrics are mainly for rap lyrics, we propose three novel rhyming evaluation metrics for song lyrics. Both automatic and human evaluations show that the proposed model performs better than the state-of-the-art models in both contents and rhyming quality.

2022

LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Dongsheng Chen | Chaofan Tao | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training.

Read before Generate! Faithful Long Form Question Answering with Machine Reading
Dan Su | Xiaoguang Li | Jindi Zhang | Lifeng Shang | Xin Jiang | Qun Liu | Pascale Fung
Findings of the Association for Computational Linguistics: ACL 2022

Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question. While current work on LFQA using large pre-trained model for generation are effective at producing fluent and somewhat relevant content, one primary challenge lies in how to generate a faithful answer that has less hallucinated content. We propose a new end-to-end framework that jointly models answer generation and machine reading. The key idea is to augment the generation model with fine-grained, answer-related salient information which can be viewed as an emphasis on faithful facts. State-of-the-art results on two LFQA datasets, ELI5 and MS MARCO, demonstrate the effectiveness of our method, in comparison with strong baselines on automatic and human evaluation metrics. A detailed analysis further proves the competency of our methods in generating fluent, relevant, and more faithful answers.

UniDS: A Unified Dialogue System for Chit-Chat and Task-oriented Dialogues
Xinyan Zhao | Bin He | Yasheng Wang | Yitong Li | Fei Mi | Yajiao Liu | Xin Jiang | Qun Liu | Huanhuan Chen
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

With the advances in deep learning, tremendous progress has been made with chit-chat dialogue systems and task-oriented dialogue systems. However, these two systems are often tackled separately in current methods. To achieve more natural interaction with humans, dialogue systems need to be capable of both chatting and accomplishing tasks. To this end, we propose a unified dialogue system (UniDS) with the two aforementioned skills. In particular, we design a unified dialogue data schema, compatible for both chit-chat and task-oriented dialogues. Besides, we propose a two-stage training method to train UniDS based on the unified dialogue data schema. UniDS does not need to adding extra parameters to existing chit-chat dialogue systems. Experimental results demonstrate that the proposed UniDS works comparably well as the state-of-the-art chit-chat dialogue systems and task-oriented dialogue systems. More importantly, UniDS achieves better robustness than pure dialogue systems and satisfactory switch ability between two types of dialogues.

Compilable Neural Code Generation with Compiler Feedback
Xin Wang | Yasheng Wang | Yao Wan | Fei Mi | Yitong Li | Pingyi Zhou | Jin Liu | Hao Wu | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2022

Automatically generating compilable programs with (or without) natural language descriptions has always been a touchstone problem for computational linguistics and automated software engineering. Existing deep-learning approaches model code generation as text generation, either constrained by grammar structures in decoder, or driven by pre-trained language models on large-scale code corpus (e.g., CodeGPT, PLBART, and CodeT5). However, few of them account for compilability of the generated programs. To improve compilability of the generated programs, this paper proposes COMPCODER, a three-stage pipeline utilizing compiler feedback for compilable code generation, including language model fine-tuning, compilability reinforcement, and compilability discrimination. Comprehensive experiments on two code generation tasks demonstrate the effectiveness of our proposed approach, improving the success rate of compilation from 44.18 to 89.18 in code completion on average and from 70.3 to 96.2 in text-to-code generation, respectively, when comparing with the state-of-the-art CodeGPT.

Pre-training Language Models with Deterministic Factual Knowledge
Shaobo Li | Xiaoguang Li | Lifeng Shang | Chengjie Sun | Bingquan Liu | Zhenzhou Ji | Xin Jiang | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Previous works show that Pre-trained Language Models (PLMs) can capture factual knowledge. However, some analyses reveal that PLMs fail to perform it robustly, e.g., being sensitive to the changes of prompts when extracting factual knowledge. To mitigate this issue, we propose to let PLMs learn the deterministic relationship between the remaining context and the masked content. The deterministic relationship ensures that the masked factual content can be deterministically inferable based on the existing clues in the context. That would provide more stable patterns for PLMs to capture factual knowledge than randomly masking. Two pre-training tasks are further introduced to motivate PLMs to rely on the deterministic relationship when filling masks. Specifically, we use an external Knowledge Base (KB) to identify deterministic relationships and continuously pre-train PLMs with the proposed methods. The factual knowledge probing experiments indicate that the continuously pre-trained PLMs achieve better robustness in factual knowledge capturing. Further experiments on question-answering datasets show that trying to learn a deterministic relationship with the proposed methods can also help other knowledge-intensive tasks.

COPEN: Probing Conceptual Knowledge in Pre-trained Language Models
Hao Peng | Xiaozhi Wang | Shengding Hu | Hailong Jin | Lei Hou | Juanzi Li | Zhiyuan Liu | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Conceptual knowledge is fundamental to human cognition and knowledge bases. However, existing knowledge probing works only focus on evaluating factual knowledge of pre-trained language models (PLMs) and ignore conceptual knowledge. Since conceptual knowledge often appears as implicit commonsense behind texts, designing probes for conceptual knowledge is hard. Inspired by knowledge representation schemata, we comprehensively evaluate conceptual knowledge of PLMs by designing three tasks to probe whether PLMs organize entities by conceptual similarities, learn conceptual properties, and conceptualize entities in contexts, respectively. For the tasks, we collect and annotate 24k data instances covering 393 concepts, which is COPEN, a COnceptual knowledge Probing bENchmark. Extensive experiments on different sizes and types of PLMs show that existing PLMs systematically lack conceptual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing human-like cognition in PLMs. COPEN and our codes are publicly released at https://github.com/THU-KEG/COPEN.

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
Tianbo Ji | Yvette Graham | Gareth Jones | Chenyang Lyu | Qun Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of r=0.969. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.

ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer
Ningning Wang | Guobing Gan | Peng Zhang | Shuai Zhang | Junqiu Wei | Qun Liu | Xin Jiang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, a lot of research has been carried out to improve the efficiency of Transformer. Among them, the sparse pattern-based method is an important branch of efficient Transformers. However, some existing sparse methods usually use fixed patterns to select words, without considering similarities between words. Other sparse methods use clustering patterns to select words, but the clustering process is separate from the training process of the target task, which causes a decrease in effectiveness. To address these limitations, we design a neural clustering method, which can be seamlessly integrated into the Self-Attention Mechanism in Transformer. The clustering task and the target task are jointly trained and optimized to benefit each other, leading to significant effectiveness improvement. In addition, our method groups the words with strong dependencies into the same cluster and performs the attention mechanism for each cluster independently, which improves the efficiency. We verified our method on machine translation, text classification, natural language inference, and text matching tasks. Experimental results show that our method outperforms two typical sparse attention methods, Reformer and Routing Transformer while having a comparable or even better time and memory efficiency.

FPT: Improving Prompt Tuning Efficiency via Progressive Training
Yufei Huang | Yujia Qin | Huadong Wang | Yichun Yin | Maosong Sun | Zhiyuan Liu | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, prompt tuning (PT) has gained increasing attention as a parameter-efficient way of tuning pre-trained language models (PLMs). Despite extensively reducing the number of tunable parameters and achieving satisfying performance, PT is training-inefficient due to its slow convergence. To improve PT’s training efficiency, we first make some novel observations about the prompt transferability of “partial PLMs”, which are defined by compressing a PLM in depth or width. We observe that the soft prompts learned by different partial PLMs of various sizes are similar in the parameter space, implying that these soft prompts could potentially be transferred among partial PLMs. Inspired by these observations, we propose Fast Prompt Tuning (FPT), which starts by conducting PT using a small-scale partial PLM, and then progressively expands its depth and width until the full-model size. After each expansion, we recycle the previously learned soft prompts as initialization for the enlarged partial PLM and then proceed PT. We demonstrate the feasibility of FPT on 5 tasks and show that FPT could save over 30% training computations while achieving comparable performance. The codes are publicly available at https://github.com/thunlp/FastPromptTuning.

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
Zhongwei Wan | Yichun Yin | Wei Zhang | Jiaxin Shi | Lifeng Shang | Guangyong Chen | Xin Jiang | Qun Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

General pre-trained language models (PLMs), such as BERT, have achieved remarkable performance on various NLP tasks. Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this domain-adaptive pre-training (DAPT (CITATION)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of Memory-Augmented Pre-trained Language Model (MAP), which augments the domain-specific PLM by a memory built from the frozen general PLM without losing the general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmentation strategies are explored to build memory and fusion memory into domain-specific PLM. We demonstrate the effectiveness of MAP on different domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed MAP can achieve SOTA results on these tasks.

Compression of Generative Pre-trained Language Models via Quantization
Chaofan Tao | Lu Hou | Wei Zhang | Lifeng Shang | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The increasing size of generative Pre-trained Language Models (PLMs) have greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the homogeneous word embeddings caused by reduced capacity and the varied distribution of weights. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rate on GPT-2 and BART, respectively.

How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis
Shaobo Li | Xiaoguang Li | Lifeng Shang | Zhenhua Dong | Chengjie Sun | Bingquan Liu | Zhenzhou Ji | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2022

Recently, there has been a trend to investigate the factual knowledge captured by Pre-trained Language Models (PLMs). Many works show the PLMs’ ability to fill in the missing factual words in cloze-style prompts such as ”Dante was born in [MASK].” However, it is still a mystery how PLMs generate the results correctly: relying on effective clues or shortcut patterns? We try to answer this question by a causal-inspired analysis that quantitatively measures and evaluates the word-level patterns that PLMs depend on to generate the missing words. We check the words that have three typical associations with the missing words: knowledge-dependent, positionally close, and highly co-occurred. Our analysis shows: (1) PLMs generate the missing factual words more by the positionally close and highly co-occurred words than the knowledge-dependent words; (2) the dependence on the knowledge-dependent words is more effective than the positionally close and highly co-occurred words. Accordingly, we conclude that the PLMs capture the factual knowledge ineffectively because of depending on the inadequate associations.

Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark
Jingyan Zhou | Jiawen Deng | Fei Mi | Yitong Li | Yasheng Wang | Minlie Huang | Xin Jiang | Qun Liu | Helen Meng
Findings of the Association for Computational Linguistics: EMNLP 2022

Among all the safety concerns that hinder the deployment of open-domain dialog systems (e.g., offensive languages, biases, and toxic behaviors), social bias presents an insidious challenge. Addressing this challenge requires rigorous analyses and normative reasoning. In this paper, we focus our investigation on social bias measurement to facilitate the development of unbiased dialog systems. We first propose a novel Dial-Bias Framework for analyzing the social bias in conversations using a holistic method beyond bias lexicons or dichotomous annotations. Leveraging the proposed framework, we further introduce the CDial-Bias Dataset which is, to the best of our knowledge, the first annotated Chinese social bias dialog dataset. We also establish a fine-grained dialog bias measurement benchmark and conduct in-depth ablation studies to shed light on the utility of the detailed annotations in the proposed dataset. Finally, we evaluate representative Chinese generative models with our classifiers to unveil the presence of social bias in these systems.

End-to-End Simultaneous Speech Translation with Pretraining and Distillation: Huawei Noah’s System for AutoSimTranS 2022
Xingshan Zeng | Pengfei Li | Liangyou Li | Qun Liu
Proceedings of the Third Workshop on Automatic Simultaneous Translation

This paper describes the system submitted to AutoSimTrans 2022 from Huawei Noah’s Ark Lab, which won the first place in the audio input track of the Chinese-English translation task. Our system is based on RealTranS, an end-to-end simultaneous speech translation model. We enhance the model with pretraining, by initializing the acoustic encoder with ASR encoder, and the semantic encoder and decoder with NMT encoder and decoder, respectively. To relieve the data scarcity, we further construct pseudo training corpus as a kind of knowledge distillation with ASR data and the pretrained NMT model. Meanwhile, we also apply several techniques to improve the robustness and domain generalizability, including punctuation removal, token-level knowledge distillation and multi-domain finetuning. Experiments show that our system significantly outperforms the baselines at all latency and also verify the effectiveness of our proposed methods.

FreeTransfer-X: Safe and Label-Free Cross-Lingual Transfer from Off-the-Shelf Models
Yinpeng Guo | Liangyou Li | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: NAACL 2022

Cross-lingual transfer (CLT) is of various applications. However, labeled cross-lingual corpus is expensive or even inaccessible, especially in the fields where labels are private, such as diagnostic results of symptoms in medicine and user profiles in business. Nevertheless, there are off-the-shelf models in these sensitive fields. Instead of pursuing the original labels, a workaround for CLT is to transfer knowledge from the off-the-shelf models without labels. To this end, we define a novel CLT problem named FreeTransfer-X that aims to achieve knowledge transfer from the off-the-shelf models in rich-resource languages. To address the problem, we propose a 2-step knowledge distillation (KD, Hinton et al., 2015) framework based on multilingual pre-trained language models (mPLM). The significant improvement over strong neural machine translation (NMT) baselines demonstrates the effectiveness of the proposed method. In addition to reducing annotation cost and protecting private labels, the proposed method is compatible with different networks and easy to be deployed. Finally, a range of analyses indicate the great potential of the proposed method.

Offline-to-Online Co-Evolutional User Simulator and Dialogue System
Dafeng Chi | Yuzheng Zhuang | Yao Mu | Bin Wang | Jianzhu Bao | Yasheng Wang | Yuhan Dong | Xin Jiang | Qun Liu | Jianye Hao
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)

Reinforcement learning (RL) has emerged as a promising approach to fine-tune offline pretrained GPT-2 model in task-oriented dialogue (TOD) systems. In order to obtain human-like online interactions while extending the usage of RL, building pretrained user simulators (US) along with dialogue systems (DS) and facilitating jointly fine-tuning via RL becomes prevalent. However, joint training brings distributional shift problem caused by compounding exposure bias. Existing methods usually iterative update US and DS to ameliorate the ensued non-stationarity problem, which could lead to sub-optimal policy and less sample efficiency. To take a step further for tackling the problem, we introduce an Offline-to-oNline Co-Evolutional (ONCE) framework, which enables bias-aware concurrent joint update for RL-based fine-tuning whilst takes advantages from GPT-2 based end-to-end modeling on US and DS. Extensive experiments demonstrate that ONCE builds high-quality loops of policy learning and dialogues data collection, and achieves state-of-the-art online and offline evaluation results on MultiWOZ2.1 dataset. Opensourced code will be implemented with Mindspore (MS, 2022) and released on our homepage.

bert2BERT: Towards Reusable Pretrained Language Models
Cheng Chen | Yichun Yin | Lifeng Shang | Xin Jiang | Yujia Qin | Fengyu Wang | Zhi Wang | Xiao Chen | Zhiyuan Liu | Qun Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources, and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving method proposed in computer vision on the Transformer-based language model, and further improve it by proposing a novel method, advanced knowledge for large model’s initialization. In addition, a two-stage learning method is proposed to further accelerate the pre-training. We conduct extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT and MSLT; (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.

MINER: Multi-Interest Matching Network for News Recommendation
Jian Li | Jieming Zhu | Qiwei Bi | Guohao Cai | Lifeng Shang | Zhenhua Dong | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2022

Personalized news recommendation is an essential technique to help users find interested news. Accurately matching user’s interests and candidate news is the key to news recommendation. Most existing methods learn a single user embedding from user’s historical behaviors to represent the reading interest. However, user interest is usually diverse and may not be adequately modeled by a single user embedding. In this paper, we propose a poly attention scheme to learn multiple interest vectors for each user, which encodes the different aspects of user interest. We further propose a disagreement regularization to make the learned interests vectors more diverse. Moreover, we design a category-aware attention weighting strategy that incorporates the news category information as explicit interest signals into the attention mechanism. Extensive experiments on the MIND news recommendation benchmark demonstrate that our approach significantly outperforms existing state-of-the-art methods.

LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework
Mengjie Zhao | Fei Mi | Yasheng Wang | Minglei Li | Xin Jiang | Qun Liu | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2022

Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shotlearners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Active learning is integrated into LMTurk to reduce the amount of queries made to PLMs, minimizing the computational cost of running PLM inference passes. Altogether, LMTurk is an important step towards making effective use of current PLMs.

Universal Conditional Masked Language Pre-training for Neural Machine Translation
Pengfei Li | Liangyou Li | Meng Zhang | Minghao Wu | Qun Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pre-trained sequence-to-sequence models have significantly improved Neural Machine Translation (NMT). Different from prior works where pre-trained models usually adopt an unidirectional decoder, this paper demonstrates that pre-training a sequence-to-sequence model but with a bidirectional decoder can produce notable performance gains for both Autoregressive and Non-autoregressive NMT. Specifically, we propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora in many languages. We also introduce two simple but effective methods to enhance the CeMAT, aligned code-switching & masking and dynamic dual-masking. We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios from low- to extremely high-resource languages, i.e., up to +14.4 BLEU on low resource and +7.9 BLEU improvements on average for Autoregressive NMT. For Non-autoregressive NMT, we demonstrate it can also produce consistent performance gains, i.e., up to +5.3 BLEU. To the best of our knowledge, this is the first work to pre-train a unified model for fine-tuning on both NMT tasks. Code, data, and pre-trained models are available at https://github.com/huawei-noah/Pretrained-Language-Model/CeMAT

Findings of the Third Workshop on Automatic Simultaneous Translation
Ruiqing Zhang | Chuanqiang Zhang | Zhongjun He | Hua Wu | Haifeng Wang | Liang Huang | Qun Liu | Julia Ive | Wolfgang Macherey
Proceedings of the Third Workshop on Automatic Simultaneous Translation

This paper reports the results of the shared task we hosted on the Third Workshop of Automatic Simultaneous Translation (AutoSimTrans). The shared task aims to promote the development of text-to-text and speech-to-text simultaneous translation, and includes Chinese-English and English-Spanish tracks. The number of systems submitted this year has increased fourfold compared with last year. Additionally, the top 1 ranked system in the speech-to-text track is the first end-to-end submission we have received in the past three years, which has shown great potential. This paper reports the results and descriptions of the 14 participating teams, compares different evaluation metrics, and revisits the ranking method.

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu | Pascale Fung
Findings of the Association for Computational Linguistics: ACL 2022

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with 7× fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.

Pan More Gold from the Sand: Refining Open-domain Dialogue Training with Noisy Self-Retrieval Generation
Yihe Wang | Yitong Li | Yasheng Wang | Fei Mi | Pingyi Zhou | Xin Wang | Jin Liu | Xin Jiang | Qun Liu
Proceedings of the 29th International Conference on Computational Linguistics

Real human conversation data are complicated, heterogeneous, and noisy, from which building open-domain dialogue systems remains a challenging task. In fact, such dialogue data still contains a wealth of information and knowledge, however, they are not fully explored. In this paper, we show existing open-domain dialogue generation methods that memorize context-response paired data with autoregressive or encode-decode language models underutilize the training data. Different from current approaches, using external knowledge, we explore a retrieval-generation training framework that can take advantage of the heterogeneous and noisy training data by considering them as “evidence”. In particular, we use BERTScore for retrieval, which gives better qualities of the evidence and generation. Experiments over publicly available datasets demonstrate that our method can help models generate better responses, even such training data are usually impressed as low-quality data. Such performance gain is comparable with those improved by enlarging the training set, even better. We also found that the model performance has a positive correlation with the relevance of the retrieved evidence. Moreover, our method performed well on zero-shot experiments, which indicates that our method can be more robust to real-world data.

MTRec: Multi-Task Learning over BERT for News Recommendation
Qiwei Bi | Jian Li | Lifeng Shang | Xin Jiang | Qun Liu | Hanfang Yang
Findings of the Association for Computational Linguistics: ACL 2022

Existing news recommendation methods usually learn news representations solely based on news titles. To sufficiently utilize other fields of news information such as category and entities, some methods treat each field as an additional feature and combine different feature vectors with attentive pooling. With the adoption of large pre-trained models like BERT in news recommendation, the above way to incorporate multi-field information may encounter challenges: the shallow feature encoding to compress the category and entity information is not compatible with the deep BERT encoding. In this paper, we propose a multi-task method to incorporate the multi-field information into BERT, which improves its news encoding capability. Besides, we modify the gradients of auxiliary tasks based on their gradient conflicts with the main task, which further boosts the model performance. Extensive experiments on the MIND news recommendation benchmark show the effectiveness of our approach.

Triangular Transfer: Freezing the Pivot for Triangular Machine Translation
Meng Zhang | Liangyou Li | Qun Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Triangular machine translation is a special case of low-resource machine translation where the language pair of interest has limited parallel data, but both languages have abundant parallel data with a pivot language. Naturally, the key to triangular machine translation is the successful exploitation of such auxiliary data. In this work, we propose a transfer-learning-based approach that utilizes all types of auxiliary data. As we train auxiliary source-pivot and pivot-target translation models, we initialize some parameters of the pivot side with a pre-trained language model and freeze them to encourage both translation models to work in the same pivot language space, so that they can be smoothly transferred to the source-target translation model. Experiments show that our approach can outperform previous ones.

Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Processing
Abbas Ghaddar | Yimeng Wu | Sunyam Bagga | Ahmad Rashid | Khalil Bibi | Mehdi Rezagholizadeh | Chao Xing | Yasheng Wang | Xinyu Duan | Zhefeng Wang | Baoxing Huai | Xin Jiang | Qun Liu | Phillippe Langlais
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

There is a growing body of work in recent years to develop pre-trained language models (PLMs) for the Arabic language. This work addresses two major problems in existing Arabic PLMs that limit the progress of the Arabic NLU and NLG fields. First, existing Arabic PLMs are not well-explored and their pre-training can be improved significantly using a more methodical approach. Second, there is a lack of systematic and reproducible evaluation of these models in the literature. We revisit both the pre-training and evaluation of Arabic PLMs. In terms of pre-training, we explore the impact of the quality of the pretraining data, the size of the model, and the incorporation of character-level information on Arabic PLM. As a result, we release three new Arabic BERT-style models ( JABER, Char-JABER, and SABER), and two T5-style models (AT5S and AT5B). In terms of evaluation, we conduct a comprehensive empirical study to systematically evaluate the performance of existing state-of-the-art models on ALUE, a leaderboard-powered benchmark for Arabic NLU tasks, and on a subset of the Arabic generative tasks. We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks. Our models and source code to reproduce results will be made available upon acceptance.

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

Controlled Text Generation Using Dictionary Prior in Variational Autoencoders
Xianghong Fang | Jian Li | Lifeng Shang | Xin Jiang | Qun Liu | Dit-Yan Yeung
Findings of the Association for Computational Linguistics: ACL 2022

While variational autoencoders (VAEs) have been widely applied in text generation tasks, they are troubled by two challenges: insufficient representation capacity and poor controllability. The former results from the posterior collapse and restrictive assumption, which impede better representation learning. The latter arises as continuous latent variables in traditional formulations hinder VAEs from interpretability and controllability. In this paper, we propose Dictionary Prior (DPrior), a new data-driven prior that enjoys the merits of expressivity and controllability. To facilitate controlled text generation with DPrior, we propose to employ contrastive learning to separate the latent space into several parts. Extensive experiments on both language modeling and controlled text generation demonstrate the effectiveness of the proposed approach.

2021

AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models
Yichun Yin | Cheng Chen | Lifeng Shang | Xin Jiang | Xiao Chen | Qun Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT. Few studies have been conducted to explore the design of architecture hyper-parameters in BERT, especially for the more efficient PLMs with tiny sizes, which are essential for practical deployment on resource-constrained devices. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters. Specifically, we carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints. We name our method AutoTinyBERT and evaluate its effectiveness on the GLUE and SQuAD benchmarks. The extensive experiments show that our method outperforms both the SOTA search-based baseline (NAS-BERT) and the SOTA distillation-based methods (such as DistilBERT, TinyBERT, MiniLM, and MobileBERT). In addition, based on the obtained architectures, we propose a more efficient development method that is even faster than the development of a single PLM. The source code and models will be publicly available upon publication.

Proceedings of the Second Workshop on Automatic Simultaneous Translation
Hua Wu | Colin Cherry | Liang Huang | Zhongjun He | Qun Liu | Maha Elbayad | Mark Liberman | Haifeng Wang | Mingbo Ma | Ruiqing Zhang
Proceedings of the Second Workshop on Automatic Simultaneous Translation

TGEA: An Error-Annotated Dataset and Benchmark Tasks for TextGeneration from Pretrained Language Models
Jie He | Bo Peng | Yi Liao | Qun Liu | Deyi Xiong
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In order to deeply understand the capability of pretrained language models in text generation and conduct a diagnostic evaluation, we propose TGEA, an error-annotated dataset with multiple benchmark tasks for text generation from pretrained language models (PLMs). We use carefully selected prompt words to guide GPT-2 to generate candidate sentences, from which we select 47K for error annotation. Crowdsourced workers manually check each of these sentences and detect 12k erroneous sentences. We create an error taxonomy to cover 24 types of errors occurring in these erroneous sentences according to the nature of errors with respect to linguistics and knowledge (e.g., common sense). For each erroneous span in PLM-generated sentences, we also detect another span that is closely associated with it. Each error is hence manually labeled with comprehensive annotations, including the span of the error, the associated span, minimal correction to the error, the type of the error, and rationale behind the error. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. This is the first dataset with comprehensive annotations for PLM-generated texts, which facilitates the diagnostic evaluation of PLM-based text generation. Furthermore, we use TGEA as a benchmark dataset and propose a series of automatic diagnosis tasks, including error detection, error type classification, associated span detection, error rationale generation, to further promote future study on the automatic error detection and correction on texts generated by pretrained language models.

Two Parents, One Child: Dual Transfer for Low-Resource Neural Machine Translation
Meng Zhang | Liangyou Li | Qun Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Self-Supervised Quality Estimation for Machine Translation
Yuanhang Zheng | Zhixing Tan | Meng Zhang | Mieradilijiang Maimaiti | Huanbo Luan | Maosong Sun | Qun Liu | Yang Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Quality estimation (QE) of machine translation (MT) aims to evaluate the quality of machine-translated sentences without references and is important in practical applications of MT. Training QE models require massive parallel data with hand-crafted quality annotations, which are time-consuming and labor-intensive to obtain. To address the issue of the absence of annotated training data, previous studies attempt to develop unsupervised QE methods. However, very few of them can be applied to both sentence- and word-level QE tasks, and they may suffer from noises in the synthetic data. To reduce the negative impact of noises, we propose a self-supervised method for both sentence- and word-level QE, which performs quality estimation by recovering the masked target words. Experimental results show that our method outperforms previous unsupervised methods on several QE tasks in different language pairs and domains.

Chinese WPLC: A Chinese Dataset for Evaluating Pretrained Language Models on Word Prediction Given Long-Range Context
Huibin Ge | Chenxi Sun | Deyi Xiong | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper presents a Chinese dataset for evaluating pretrained language models on Word Prediction given Long-term Context (Chinese WPLC). We propose both automatic and manual selection strategies tailored to Chinese to guarantee that target words in passages collected from over 69K novels can only be predicted with long-term context beyond the scope of sentences containing the target words. Dataset analysis reveals that the types of target words range from common nouns to Chinese 4-character idioms. We also observe that linguistic relations between target words and long-range context exhibit diversity, including lexical match, synonym, summary and reasoning. Experiment results show that the Chinese pretrained language model PanGu-𝛼 is 45 points behind human in terms of top-1 word prediction accuracy, indicating that Chinese WPLC is a challenging dataset. The dataset is publicly available at https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC.

Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning
Chenglei Si | Zhengyan Zhang | Fanchao Qi | Zhiyuan Liu | Yasheng Wang | Qun Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Revisiting Robust Neural Machine Translation: A Transformer Case Study
Peyman Passban | Puneeth Saladi | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

Transformers have brought a remarkable improvement in the performance of neural machine translation (NMT) systems but they could be surprisingly vulnerable to noise. In this work, we try to investigate how noise breaks Transformers and if there exist solutions to deal with such issues. There is a large body of work in the NMT literature on analyzing the behavior of conventional models for the problem of noise but Transformers are relatively understudied in this context. Motivated by this, we introduce a novel data-driven technique called Target Augmented Fine-tuning (TAFT) to incorporate noise during training. This idea is comparable to the well-known fine-tuning strategy. Moreover, we propose two other novel extensions to the original Transformer: Controlled Denoising (CD) and Dual-Channel Decoding (DCD), that modify the neural architecture as well as the training process to handle noise. One important characteristic of our techniques is that they only impact the training phase and do not impose any overhead at inference time. We evaluated our techniques to translate the English–German pair in both directions and observed that our models have a higher tolerance to noise. More specifically, they perform with no deterioration where up to 10% of entire test words are infected by noise.

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings
Weixuan Wang | Wei Peng | Meng Zhang | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural Machine Translation (NMT) has shown a strong ability to utilize local context to disambiguate the meaning of words. However, it remains a challenge for NMT to leverage broader context information like topics. In this paper, we propose heterogeneous ways of embedding topic information at the sentence level into an NMT model to improve translation performance. Specifically, the topic information can be incorporated as pre-encoder topic embedding, post-encoder topic embedding, and decoder topic embedding to increase the likelihood of selecting target words from the same topic of the source sentence. Experimental results show that NMT models with the proposed topic knowledge embedding outperform the baselines on the English -> German and English -> French translation tasks.

Improving Unsupervised Question Answering via Summarization-Informed Question Generation
Chenyang Lyu | Lifeng Shang | Yvette Graham | Jennifer Foster | Xin Jiang | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Question Generation (QG) is the task of generating a plausible question for a given <passage, answer> pair. Template-based QG uses linguistically-informed heuristics to transform declarative sentences into interrogatives, whereas supervised QG uses existing Question Answering (QA) datasets to train a system to generate a question given a passage and an answer. A disadvantage of the heuristic approach is that the generated questions are heavily tied to their declarative counterparts. A disadvantage of the supervised approach is that they are heavily tied to the domain/language of the QA dataset used as training data. In order to overcome these shortcomings, we propose a distantly-supervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system. We make use of freely available news summary data, transforming declarative summary sentences into appropriate questions using heuristics informed by dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model. We extrinsically evaluate our approach using unsupervised QA: our QG model is used to generate synthetic QA pairs for training a QA model. Experimental results show that, trained with only 20k English Wikipedia-based synthetic QA pairs, the QA model substantially outperforms previous unsupervised models on three in-domain datasets (SQuAD1.1, Natural Questions, TriviaQA) and three out-of-domain datasets (NewsQA, BioASQ, DuoRC), demonstrating the transferability of the approach.

NoahNMT at WMT 2021: Dual Transfer for Very Low Resource Supervised Machine Translation
Meng Zhang | Minghao Wu | Pengfei Li | Liangyou Li | Qun Liu
Proceedings of the Sixth Conference on Machine Translation

This paper describes the NoahNMT system submitted to the WMT 2021 shared task of Very Low Resource Supervised Machine Translation. The system is a standard Transformer model equipped with our recent technique of dual transfer. It also employs widely used techniques that are known to be helpful for neural machine translation, including iterative back-translation, selected finetuning, and ensemble. The final submission achieves the top BLEU for three translation directions.

DyLex: Incorporating Dynamic Lexicons into BERT for Sequence Labeling
Baojun Wang | Zhao Zhang | Kun Xu | Guang-Yuan Hao | Yuyang Zhang | Lifeng Shang | Linlin Li | Xiao Chen | Xin Jiang | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Incorporating lexical knowledge into deep learning models has been proved to be very effective for sequence labeling tasks. However, previous works commonly have difficulty dealing with large-scale dynamic lexicons which often cause excessive matching noise and problems of frequent updates. In this paper, we propose DyLex, a plug-in lexicon incorporation approach for BERT based sequence labeling tasks. Instead of leveraging embeddings of words in the lexicon as in conventional methods, we adopt word-agnostic tag embeddings to avoid re-training the representation while updating the lexicon. Moreover, we employ an effective supervised lexical knowledge denoising method to smooth out matching noise. Finally, we introduce a col-wise attention based knowledge fusion mechanism to guarantee the pluggability of the proposed framework. Experiments on ten datasets of three tasks show that the proposed framework achieves new SOTA, even with very large scale lexicons.

HyKnow: End-to-End Task-Oriented Dialog Modeling with Hybrid Knowledge Management
Silin Gao | Ryuichi Takanobu | Wei Peng | Qun Liu | Minlie Huang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

A Mutual Information Maximization Approach for the Spurious Solution Problem in Weakly Supervised Question Answering
Zhihong Shao | Lifeng Shang | Qun Liu | Minlie Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Weakly supervised question answering usually has only the final answers as supervision signals while the correct solutions to derive the answers are not provided. This setting gives rise to the spurious solution problem: there may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance (e.g., producing wrong solutions or answers). For example, for discrete reasoning tasks as on DROP, there may exist many equations to derive a numeric answer, and typically only one of them is correct. Previous learning methods mostly filter out spurious solutions with heuristics or using model confidence, but do not explicitly exploit the semantic correlations between a question and its solution. In this paper, to alleviate the spurious solution problem, we propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions. Extensive experiments on four question answering datasets show that our method significantly outperforms previous learning methods in terms of task performance and is more effective in training models to produce correct solutions.

Document Graph for Neural Machine Translation
Mingzhou Xu | Liangyou Li | Derek F. Wong | Qun Liu | Lidia S. Chao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Previous works have shown that contextual information can improve the performance of neural machine translation (NMT). However, most existing document-level NMT methods failed to leverage contexts beyond a few set of previous sentences. How to make use of the whole document as global contexts is still a challenge. To address this issue, we hypothesize that a document can be represented as a graph that connects relevant contexts regardless of their distances. We employ several types of relations, including adjacency, syntactic dependency, lexical consistency, and coreference, to construct the document graph. Then, we incorporate both source and target graphs into the conventional Transformer architecture with graph convolutional networks. Experiments on various NMT benchmarks, including IWSLT English–French, Chinese-English, WMT English–German and Opensubtitle English–Russian, demonstrate that using document graphs can significantly improve the translation quality. Extensive analysis verifies that the document graph is beneficial for capturing discourse phenomena.

Generate & Rank: A Multi-task Framework for Math Word Problems
Jianhao Shen | Yichun Yin | Lin Li | Lifeng Shang | Xin Jiang | Ming Zhang | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

Math word problem (MWP) is a challenging and critical task in natural language processing. Many recent studies formalize MWP as a generation task and have adopted sequence-to-sequence models to transform problem descriptions to mathematical expressions. However, mathematical expressions are prone to minor mistakes while the generation objective does not explicitly handle such mistakes. To address this limitation, we devise a new ranking task for MWP and propose Generate & Rank, a multi-task framework based on a generative pre-trained language model. By joint training with generation and ranking, the model learns from its own mistakes and is able to distinguish between correct and incorrect expressions. Meanwhile, we perform tree-based disturbance specially designed for MWP and an online update to boost the ranker. We demonstrate the effectiveness of our proposed method on the benchmark and the results show that our method consistently outperforms baselines in all datasets. Particularly, in the classical Math23k, our method is 7% (78.4% to 85.4%) higher than the state-of-the-art. Code could be found at https://github.com/huawei-noah/noah-research.

Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training
Minghao Wu | Yitong Li | Meng Zhang | Liangyou Li | Gholamreza Haffari | Qun Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Learning multilingual and multi-domain translation model is challenging as the heterogeneous and imbalanced data make the model converge inconsistently over different corpora in real world. One common practice is to adjust the share of each corpus in the training, so that the learning process is balanced and low-resource cases can benefit from the high resource ones. However, automatic balancing methods usually depend on the intra- and inter-dataset characteristics, which is usually agnostic or requires human priors. In this work, we propose an approach, MultiUAT, that dynamically adjusts the training data usage based on the model’s uncertainty on a small set of trusted clean data for multi-corpus machine translation. We experiments with two classes of uncertainty measures on multilingual (16 languages with 4 settings) and multi-domain settings (4 for in-domain and 2 for out-of-domain on English-German translation) and demonstrate our approach MultiUAT substantially outperforms its baselines, including both static and dynamic strategies. We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.

GhostBERT: Generate More Features with Cheap Operations for BERT
Zhiqi Huang | Lu Hou | Lifeng Shang | Xin Jiang | Xiao Chen | Qun Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Transformer-based pre-trained language models like BERT, though powerful in many tasks, are expensive in both memory and computation, due to their large number of parameters. Previous works show that some parameters in these models can be pruned away without severe accuracy drop. However, these redundant features contribute to a comprehensive understanding of the training data and removing them weakens the model’s representation ability. In this paper, we propose GhostBERT, which generates more features with very cheap operations from the remaining features. In this way, GhostBERT has similar memory and computational cost as the pruned model, but enjoys much larger representation power. The proposed ghost module can also be applied to unpruned BERT models to enhance their performance with negligible additional parameters and computation. Empirical results on the GLUE benchmark on three backbone models (i.e., BERT, RoBERTa and ELECTRA) verify the efficacy of our proposed method.

RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
Xingshan Zeng | Liangyou Li | Qun Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Huawei AARC’s Submissions to the WMT21 Biomedical Translation Task: Domain Adaption from a Practical Perspective
Weixuan Wang | Wei Peng | Xupeng Meng | Qun Liu
Proceedings of the Sixth Conference on Machine Translation

This paper describes Huawei Artificial Intelligence Application Research Center’s neural machine translation systems and submissions to the WMT21 biomedical translation shared task. Four of the submissions achieve state-of-the-art BLEU scores based on the official-released automatic evaluation results (EN->FR, EN<->IT and ZH->EN). We perform experiments to unveil the practical insights of the involved domain adaptation techniques, including finetuning order, terminology dictionaries, and ensemble decoding. Issues associated with overfitting and under-translation are also discussed.

Multilingual Speech Translation with Unified Transformer: Huawei Noah’s Ark Lab at IWSLT 2021
Xingshan Zeng | Liangyou Li | Qun Liu
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

This paper describes the system submitted to the IWSLT 2021 Multilingual Speech Translation (MultiST) task from Huawei Noah’s Ark Lab. We use a unified transformer architecture for our MultiST model, so that the data from different modalities (i.e., speech and text) and different tasks (i.e., Speech Recognition, Machine Translation, and Speech Translation) can be exploited to enhance the model’s ability. Specifically, speech and text inputs are firstly fed to different feature extractors to extract acoustic and textual features, respectively. Then, these features are processed by a shared encoder–decoder architecture. We apply several training techniques to improve the performance, including multi-task learning, task-level curriculum learning, data augmentation, etc. Our final system achieves significantly better results than bilingual baselines on supervised language pairs and yields reasonable results on zero-shot language pairs.

BinaryBERT: Pushing the Limit of BERT Quantization
Haoli Bai | Wei Zhang | Lu Hou | Lifeng Shang | Jin Jin | Xin Jiang | Qun Liu | Michael Lyu | Irwin King
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks. Code will be released.

2020

Huawei’s Submissions to the WMT20 Biomedical Translation Task
Wei Peng | Jianfeng Liu | Minghan Wang | Liangyou Li | Xupeng Meng | Hao Yang | Qun Liu
Proceedings of the Fifth Conference on Machine Translation

This paper describes Huawei’s submissions to the WMT20 biomedical translation shared task. Apart from experimenting with finetuning on domain-specific bitexts, we explore effects of in-domain dictionaries on enhancing cross-domain neural machine translation performance. We utilize a transfer learning strategy through pre-trained machine translation models and extensive scope of engineering endeavors. Four of our ten submissions achieve state-of-the-art performance according to the official automatic evaluation results, namely translation directions on English<->French, English->German and English->Italian.

Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT
Zhiyong Wu | Yun Chen | Ben Kao | Qun Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this approach of evaluating a language model is undermined by the uncertainty of the amount of knowledge that is learned by the probe itself. Complementary to those works, we propose a parameter-free probing technique for analyzing pre-trained language models (e.g., BERT). Our method does not require direct supervision from the probing tasks, nor do we introduce additional parameters to the probing process. Our experiments on BERT show that syntactic trees recovered from BERT using our method are significantly better than linguistically-uninformed baselines. We further feed the empirically induced dependency structures into a downstream sentiment classification task and find its improvement compatible with or even superior to a human-designed dependency schema.

Word-level Textual Adversarial Attacking as Combinatorial Optimization
Yuan Zang | Fanchao Qi | Chenghao Yang | Zhiyuan Liu | Meng Zhang | Qun Liu | Maosong Sun
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Adversarial attacks are carried out to reveal the vulnerability of deep neural networks. Textual adversarial attacking is challenging because text is discrete and a small perturbation can bring significant change to the original input. Word-level attacking, which can be regarded as a combinatorial optimization problem, is a well-studied class of textual attack methods. However, existing word-level attack models are far from perfect, largely because unsuitable search space reduction methods and inefficient optimization algorithms are employed. In this paper, we propose a novel attack model, which incorporates the sememe-based word substitution method and particle swarm optimization-based search algorithm to solve the two problems separately. We conduct exhaustive experiments to evaluate our attack model by attacking BiLSTM and BERT on three benchmark datasets. Experimental results demonstrate that our model consistently achieves much higher attack success rates and crafts more high-quality adversarial examples as compared to baseline methods. Also, further experiments show our model has higher transferability and can bring more robustness enhancement to victim models by adversarial training. All the code and data of this paper can be obtained on https://github.com/thunlp/SememePSO-Attack.

Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order
Yi Liao | Xin Jiang | Qun Liu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Masked language model and autoregressive language model are two types of language models. While pretrained masked language models such as BERT overwhelm the line of natural language understanding (NLU) tasks, autoregressive language models such as GPT are especially capable in natural language generation (NLG). In this paper, we propose a probabilistic masking scheme for the masked language model, which we call probabilistically masked language model (PMLM). We implement a specific PMLM with a uniform prior distribution on the masking ratio named u-PMLM. We prove that u-PMLM is equivalent to an autoregressive permutated language model. One main advantage of the model is that it supports text generation in arbitrary order with surprisingly good quality, which could potentially enable new applications over traditional unidirectional generation. Besides, the pretrained u-PMLM also outperforms BERT on a bunch of downstream NLU tasks.

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Qun Liu | David Schlangen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

TernaryBERT: Distillation-aware Ultra-low Bit BERT
Wei Zhang | Lu Hou | Yichun Yin | Lifeng Shang | Xiao Chen | Xin Jiang | Qun Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Transformer-based pre-training models like BERT have achieved remarkable performance in many natural language processing tasks. However, these models are both computation and memory expensive, hindering their deployment to resource-constrained devices. In this work, we propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Specifically, we use both approximation-based and loss-aware ternarization methods and empirically investigate the ternarization granularity of different parts of BERT. Moreover, to reduce the accuracy degradation caused by lower capacity of low bits, we leverage the knowledge distillation technique in the training process. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller.

Accurate Word Alignment Induction from Neural Machine Translation
Yun Chen | Yang Liu | Guanhua Chen | Xin Jiang | Qun Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite its original goal to jointly learn to align and translate, prior researches suggest that Transformer captures poor word alignments through its attention mechanism. In this paper, we show that attention weights do capture accurate word alignments and propose two novel word alignment induction methods Shift-Att and Shift-AET. The main idea is to induce alignments at the step when the to-be-aligned target token is the decoder input rather than the decoder output as in previous work. Shift-Att is an interpretation method that induces alignments from the attention weights of Transformer and does not require parameter update or architecture change. Shift-AET extracts alignments from an additional alignment module which is tightly integrated into Transformer and trained in isolation with supervision from symmetrized Shift-Att alignments. Experiments on three publicly available datasets demonstrate that both methods perform better than their corresponding neural baselines and Shift-AET significantly outperforms GIZA++ by 1.4-4.8 AER points.

HyperText: Endowing FastText with Hyperbolic Geometry
Yudong Zhu | Di Zhou | Jinghui Xiao | Xin Jiang | Xiao Chen | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2020

Natural language data exhibit tree-like hierarchical structures such as the hypernym-hyponym hierarchy in WordNet. FastText, as the state-of-the-art text classifier based on shallow neural network in Euclidean space, may not represent such hierarchies precisely with limited representation capacity. Considering that hyperbolic space is naturally suitable for modelling tree-like hierarchical data, we propose a new model named HyperText for efficient text classification by endowing FastText with hyperbolic geometry. Empirically, we show that HyperText outperforms FastText on a range of text classification tasks with much reduced parameters.

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
Yimeng Wu | Peyman Passban | Mehdi Rezagholizadeh | Qun Liu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

With the growth of computing power neural machine translation (NMT) models also grow accordingly and become better. However, they also become harder to deploy on edge devices due to memory constraints. To cope with this problem, a common practice is to distill knowledge from a large and accurately-trained teacher network (T) into a compact student network (S). Although knowledge distillation (KD) is useful in most cases, our study shows that existing KD techniques might not be suitable enough for deep NMT engines, so we propose a novel alternative. In our model, besides matching T and S predictions we have a combinatorial mechanism to inject layer-level supervision from T to S. In this paper, we target low-resource settings and evaluate our translation engines for Portuguese→English, Turkish→English, and English→German directions. Students trained using our technique have 50% fewer parameters and can still deliver comparable results to those of 12-layer teachers.

A General Framework for Adaptation of Neural Machine Translation to Simultaneous Translation
Yun Chen | Liangyou Li | Xin Jiang | Xiao Chen | Qun Liu
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Despite the success of neural machine translation (NMT), simultaneous neural machine translation (SNMT), the task of translating in real time before a full sentence has been observed, remains challenging due to the syntactic structure difference and simultaneity requirements. In this paper, we propose a general framework for adapting neural machine translation to translate simultaneously. Our framework contains two parts: prefix translation that utilizes a consecutive NMT model to translate source prefixes and a stopping criterion that determines when to stop the prefix translation. Experiments on three translation corpora and two language pairs show the efficacy of the proposed framework on balancing the quality and latency in adapting NMT to perform simultaneous translation.

The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
Jie He | Tao Wang | Deyi Xiong | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2020

Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy ( 6 60.1%) and reasoning consistency (6 31%). We will release our test suite as a machine translation commonsense reasoning testbed to promote future work in this direction.

Proceedings of the Second International Workshop of Discourse Processing
Qun Liu | Deyi Xiong | Shili Ge | Xiaojun Zhang
Proceedings of the Second International Workshop of Discourse Processing

BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models
Bin He | Di Zhou | Jinghui Xiao | Xin Jiang | Qun Liu | Nicholas Jing Yuan | Tong Xu
Findings of the Association for Computational Linguistics: EMNLP 2020

Complex node interactions are common in knowledge graphs (KGs), and these interactions can be considered as contextualized knowledge exists in the topological structure of KGs. Traditional knowledge representation learning (KRL) methods usually treat a single triple as a training unit, neglecting the usage of graph contextualized knowledge. To utilize these unexploited graph-level knowledge, we propose an approach to model subgraphs in a medical KG. Then, the learned knowledge is integrated with a pre-trained language model to do the knowledge generalization. Experimental results demonstrate that our model achieves the state-of-the-art performance on several medical NLP tasks, and the improvement above MedERNIE indicates that graph contextualized knowledge is beneficial.

TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao | Yichun Yin | Lifeng Shang | Xin Jiang | Xiao Chen | Linlin Li | Fang Wang | Qun Liu
Findings of the Association for Computational Linguistics: EMNLP 2020

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT4 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT-Base on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ~28% parameters and ~31% inference time of them. Moreover, TinyBERT6 with 6 layers performs on-par with its teacher BERT-Base.

2019

Huawei’s NMT Systems for the WMT 2019 Biomedical Translation Task
Wei Peng | Jianfeng Liu | Liangyou Li | Qun Liu
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes Huawei’s neural machine translation systems for the WMT 2019 biomedical translation shared task. We trained and fine-tuned our systems on a combination of out-of-domain and in-domain parallel corpora for six translation directions covering English–Chinese, English–French and English–German language pairs. Our submitted systems achieve the best BLEU scores on English–French and English–German language pairs according to the official evaluation results. In the English–Chinese translation task, our systems are in the second place. The enhanced performance is attributed to more in-domain training and more sophisticated models developed. Development of translation models and transfer learning (or domain adaptation) methods has significantly contributed to the progress of the task.

Decomposable Neural Paraphrase Generation
Zichao Li | Xin Jiang | Lifeng Shang | Qun Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level. This paper presents Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity in a disentangled way. Specifically, the model is composed of multiple encoders and decoders with different structures, each of which corresponds to a specific granularity. The empirical study shows that the decomposition mechanism of DNPG makes paraphrase generation more interpretable and controllable. Based on DNPG, we further develop an unsupervised domain adaptation method for paraphrase generation. Experimental results show that the proposed model achieves competitive in-domain performance compared to state-of-the-art neural models, and significantly better performance when adapting to a new domain.

Bilingual-GAN: A Step Towards Parallel Text Generation
Ahmad Rashid | Alan Do-Omri | Md. Akmal Haidar | Qun Liu | Mehdi Rezagholizadeh
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation

Latent space based GAN methods and attention based sequence to sequence models have achieved impressive results in text generation and unsupervised machine translation respectively. Leveraging the two domains, we propose an adversarial latent space based model capable of generating parallel sentences in two languages concurrently and translating bidirectionally. The bilingual generation goal is achieved by sampling from the latent space that is shared between both languages. First two denoising autoencoders are trained, with shared encoders and back-translation to enforce a shared latent state between the two languages. The decoder is shared for the two translation directions. Next, a GAN is trained to generate synthetic ‘code’ mimicking the languages’ shared latent space. This code is then fed into the decoder to generate text in either language. We perform our experiments on Europarl and Multi30k datasets, on the English-French language pair, and document our performance using both supervised and unsupervised machine translation.

Improving Domain Adaptation Translation with Domain Invariant and Specific Information
Shuhao Gu | Yang Feng | Qun Liu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In domain adaptation for neural machine translation, translation performance can benefit from separating features into domain-specific features and common features. In this paper, we propose a method to explicitly model the two kinds of information in the encoder-decoder framework so as to exploit out-of-domain data in in-domain training. In our method, we maintain a private encoder and a private decoder for each domain which are used to model domain-specific information. In the meantime, we introduce a common encoder and a common decoder shared by all the domains which can only have domain-independent information flow through. Besides, we add a discriminator to the shared encoder and employ adversarial training for the whole model to reinforce the performance of information separation and machine translation simultaneously. Experiment results show that our method can outperform competitive baselines greatly on multiple data sets.

Modeling Semantic Compositionality with Sememe Knowledge
Fanchao Qi | Junjie Huang | Chenghao Yang | Zhiyuan Liu | Xiao Chen | Qun Liu | Maosong Sun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Semantic compositionality (SC) refers to the phenomenon that the meaning of a complex linguistic unit can be composed of the meanings of its constituents. Most related works focus on using complicated compositionality functions to model SC while few works consider external knowledge in models. In this paper, we verify the effectiveness of sememes, the minimum semantic units of human languages, in modeling SC by a confirmatory experiment. Furthermore, we make the first attempt to incorporate sememe knowledge into SC models, and employ the sememe-incorporated models in learning representations of multiword expressions, a typical task of SC. In experiments, we implement our models by incorporating knowledge from a famous sememe knowledge base HowNet and perform both intrinsic and extrinsic evaluations. Experimental results show that our models achieve significant performance boost as compared to the baseline methods without considering sememe knowledge. We further conduct quantitative analysis and case studies to demonstrate the effectiveness of applying sememe knowledge in modeling SC.All the code and data of this paper can be obtained on https://github.com/thunlp/Sememe-SC.

Bridging the Gap between Training and Inference for Neural Machine Translation
Wen Zhang | Yang Feng | Fandong Meng | Di You | Qun Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words. At training time, it predicts with the ground truth words as context while at inference it has to generate the entire sequence from scratch. This discrepancy of the fed context leads to error accumulation among the way. Furthermore, word-level training requires strict matching between the generated sequence and the ground truth sequence which leads to overcorrection over different but reasonable translations. In this paper, we address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence by the model during training, where the predicted sequence is selected with a sentence-level optimum. Experiment results on Chinese->English and WMT’14 English->German translation tasks demonstrate that our approach can achieve significant improvements on multiple datasets.

ERNIE: Enhanced Language Representation with Informative Entities
Zhengyan Zhang | Xu Han | Zhiyuan Liu | Xin Jiang | Maosong Sun | Qun Liu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The code and datasets will be available in the future.

2018

Speeding Up Neural Machine Translation Decoding by Cube Pruning
Wen Zhang | Liang Huang | Yang Feng | Lei Shen | Qun Liu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Although neural machine translation has achieved promising results, it suffers from slow translation speed. The direct consequence is that a trade-off has to be made between translation quality and speed, thus its performance can not come into full play. We apply cube pruning, a popular technique to speed up dynamic programming, into neural machine translation to speed up the translation. To construct the equivalence class, similar target hidden states are combined, leading to less RNN expansion operations on the target side and less softmax operations over the large target vocabulary. The experiments show that, at the same or even better translation quality, our method can translate faster compared with naive beam search by 3.3x on GPUs and 3.5x on CPUs.

Tailoring Neural Architectures for Translating from Morphologically Rich Languages
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 27th International Conference on Computational Linguistics

A morphologically complex word (MCW) is a hierarchical constituent with meaning-preserving subunits, so word-based models which rely on surface forms might not be powerful enough to translate such structures. When translating from morphologically rich languages (MRLs), a source word could be mapped to several words or even a full sentence on the target side, which means an MCW should not be treated as an atomic unit. In order to provide better translations for MRLs, we boost the existing neural machine translation (NMT) architecture with a double- channel encoder and a double-attentive decoder. The main goal targeted in this research is to provide richer information on the encoder side and redesign the decoder accordingly to benefit from such information. Our experimental results demonstrate that we could achieve our goal as the proposed model outperforms existing subword- and character-based architectures and showed significant improvements on translating from German, Russian, and Turkish into English.

Refining Source Representations with Relation Networks for Neural Machine Translation
Wen Zhang | Jiawei Hu | Yang Feng | Qun Liu
Proceedings of the 27th International Conference on Computational Linguistics

Although neural machine translation with the encoder-decoder framework has achieved great success recently, it still suffers drawbacks of forgetting distant information, which is an inherent disadvantage of recurrent neural network structure, and disregarding relationship between source words during encoding step. Whereas in practice, the former information and relationship are often useful in current step. We target on solving these problems and thus introduce relation networks to learn better representations of the source. The relation networks are able to facilitate memorization capability of recurrent neural network via associating source words with each other, this would also help retain their relationships. Then the source representations and all the relations are fed into the attention component together while decoding, with the main encoder-decoder framework unchanged. Experiments on several datasets show that our method can improve the translation performance significantly over the conventional encoder-decoder model and even outperform the approach involving supervised syntactic knowledge.

Multimodal Neural Machine Translation for Low-resource Language Pairs using Synthetic Data
Koel Dutta Chowdhury | Mohammed Hasanuzzaman | Qun Liu
Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP

In this paper, we investigate the effectiveness of training a multimodal neural machine translation (MNMT) system with image features for a low-resource language pair, Hindi and English, using synthetic data. A three-way parallel corpus which contains bilingual texts and corresponding images is required to train a MNMT system with image features. However, such a corpus is not available for low resource language pairs. To address this, we developed both a synthetic training dataset and a manually curated development/test dataset for Hindi based on an existing English-image parallel corpus. We used these datasets to build our image description translation system by adopting state-of-the-art MNMT models. Our results show that it is possible to train a MNMT system for low-resource language pairs through the use of synthetic data and that such a system can benefit from image features.

Improving Character-Based Decoding Using Target-Side Morphological Information for Neural Machine Translation
Peyman Passban | Qun Liu | Andy Way
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recently, neural machine translation (NMT) has emerged as a powerful alternative to conventional statistical approaches. However, its performance drops considerably in the presence of morphologically rich languages (MRLs). Neural engines usually fail to tackle the large vocabulary and high out-of-vocabulary (OOV) word rate of MRLs. Therefore, it is not suitable to exploit existing word-based models to translate this set of languages. In this paper, we propose an extension to the state-of-the-art model of Chung et al. (2016), which works at the character level and boosts the decoder with target-side morphological information. In our architecture, an additional morphology table is plugged into the model. Each time the decoder samples from a target vocabulary, the table sends auxiliary signals from the most relevant affixes in order to enrich the decoder’s current state and constrain it to provide better predictions. We evaluated our model to translate English into German, Russian, and Turkish as three MRLs and observed significant improvements.

E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language
Henry Elder | Sebastian Gehrmann | Alexander O’Connor | Qun Liu
Proceedings of the 11th International Conference on Natural Language Generation

In natural language generation (NLG), the task is to generate utterances from a more abstract input, such as structured data. An added challenge is to generate utterances that contain an accurate representation of the input, while reflecting the fluency and variety of human-generated text. In this paper, we report experiments with NLG models that can be used in task oriented dialogue systems. We explore the use of additional input to the model to encourage diversity and control of outputs. While our submission does not rank highly using automated metrics, qualitative investigation of generated utterances suggests the use of additional information in neural network NLG systems to be a promising research direction.

Knowledge Diffusion for Neural Dialogue Generation
Shuman Liu | Hongshen Chen | Zhaochun Ren | Yang Feng | Qun Liu | Dawei Yin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

End-to-end neural dialogue generation has shown promising results recently, but it does not employ knowledge to guide the generation and hence tends to generate short, general, and meaningless responses. In this paper, we propose a neural knowledge diffusion (NKD) model to introduce knowledge into dialogue generation. This method can not only match the relevant facts for the input utterance but diffuse them to similar entities. With the help of facts matching and entity diffusion, the neural dialogue generation is augmented with the ability of convergent and divergent thinking over the knowledge base. Our empirical study on a real-world dataset prove that our model is capable of generating meaningful, diverse and natural responses for both factoid-questions and knowledge grounded chi-chats. The experiment results also show that our model outperforms competitive baseline models significantly.

Learning to Jointly Translate and Predict Dropped Pronouns with a Shared Reconstruction Mechanism
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations. Recently, Wang et al. (2018) proposed a novel reconstruction-based approach to alleviating dropped pronoun (DP) translation problems for neural machine translation models. In this work, we improve the original model from two perspectives. First, we employ a shared reconstructor to better exploit encoder and decoder representations. Second, we jointly learn to translate and predict DPs in an end-to-end manner, to avoid the errors propagated from an external DP prediction model. Experimental results show that our approach significantly improves both translation performance and DP prediction accuracy.

2017

Improving Evaluation of Document-level Machine Translation Quality Estimation
Yvette Graham | Qingsong Ma | Timothy Baldwin | Qun Liu | Carla Parra | Carolina Scarton
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Meaningful conclusions about the relative performance of NLP systems are only possible if the gold standard employed in a given evaluation is both valid and reliable. In this paper, we explore the validity of human annotations currently employed in the evaluation of document-level quality estimation for machine translation (MT). We demonstrate the degree to which MT system rankings are dependent on weights employed in the construction of the gold standard, before proposing direct human assessment as a valid alternative. Experiments show direct assessment (DA) scores for documents to be highly reliable, achieving a correlation of above 0.9 in a self-replication experiment, in addition to a substantial estimated cost reduction through quality controlled crowd-sourcing. The original gold standard based on post-edits incurs a 10–20 times greater cost than DA.

Exploiting Cross-Sentence Context for Neural Machine Translation
Longyue Wang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In translation, considering the document as a whole can help to resolve ambiguities and inconsistencies. In this paper, we propose a cross-sentence context-aware approach and investigate the influence of historical contextual information on the performance of neural machine translation (NMT). First, this history is summarized in a hierarchical way. We then integrate the historical representation into NMT in two strategies: 1) a warm-start of encoder and decoder states, and 2) an auxiliary context source for updating decoder states. Experimental results on a large Chinese-English translation task show that our approach significantly improves upon a strong attention-based NMT system by up to +2.1 BLEU points.

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation
Iacer Calixto | Qun Liu | Nick Campbell
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado | Lifeng Han | Erwan Moreau | Ashjan Alsulaimani | Koel Dutta Chowdhury | Carl Vogel | Qun Liu
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.

DCU System Report on the WMT 2017 Multi-modal Machine Translation Task
Iacer Calixto | Koel Dutta Chowdhury | Qun Liu
Proceedings of the Second Conference on Machine Translation

Incorporating Word Reordering Knowledge into Attention-based Neural Machine Translation
Jinchao Zhang | Mingxuan Wang | Qun Liu | Jie Zhou
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper proposes three distortion models to explicitly incorporate the word reordering knowledge into attention-based Neural Machine Translation (NMT) for further improving translation performance. Our proposed models enable attention mechanism to attend to source words regarding both the semantic requirement and the word reordering penalty. Experiments on Chinese-English translation show that the approaches can improve word alignment quality and achieve significant translation improvements over a basic attention-based NMT by large margins. Compared with previous works on identical corpora, our system achieves the state-of-the-art performance on translation quality.

If You Can’t Beat Them Join Them: Handcrafted Features Complement Neural Nets for Non-Factoid Answer Reranking
Dasha Bogdanova | Jennifer Foster | Daria Dzendzik | Qun Liu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We show that a neural approach to the task of non-factoid answer reranking can benefit from the inclusion of tried-and-tested handcrafted features. We present a neural network architecture based on a combination of recurrent neural networks that are used to encode questions and answers, and a multilayer perceptron. We show how this approach can be combined with additional features, in particular, the discourse features used by previous research. Our neural approach achieves state-of-the-art performance on a public dataset from Yahoo! Answers and its performance is further improved by incorporating the discourse features. Additionally, we present a new dataset of Ask Ubuntu questions where the hybrid approach also achieves good results.

Deep Neural Machine Translation with Linear Associative Unit
Mingxuan Wang | Zhengdong Lu | Jie Zhou | Qun Liu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Deep Neural Networks (DNNs) have provably enhanced the state-of-the-art Neural Machine Translation (NMT) with its capability in modeling complex functions and capturing complex linguistic structures. However NMT with deep architecture in its encoder or decoder RNNs often suffer from severe gradient diffusion due to the non-linear recurrent activations, which often makes the optimization much more difficult. To address this problem we propose a novel linear associative units (LAU) to reduce the gradient propagation path inside the recurrent unit. Different from conventional approaches (LSTM unit and GRU), LAUs uses linear associative connections between input and output of the recurrent unit, which allows unimpeded information flow through both space and time The model is quite simple, but it is surprisingly effective. Our empirical study on Chinese-English translation shows that our model with proper configuration can improve by 11.7 BLEU upon Groundhog and the best reported on results in the same setting. On WMT14 English-German task and a larger WMT14 English-French task, our model achieves comparable results with the state-of-the-art.

CASICT-DCU Neural Machine Translation Systems for WMT17
Jinchao Zhang | Peerachet Porkaew | Jiawei Hu | Qiuye Zhao | Qun Liu
Proceedings of the Second Conference on Machine Translation

Neural Automatic Post-Editing Using Prior Alignment and Reranking
Santanu Pal | Sudip Kumar Naskar | Mihaela Vela | Qun Liu | Josef van Genabith
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a second-stage machine translation (MT) system based on a neural machine translation (NMT) approach to automatic post-editing (APE) that improves the translation quality provided by a first-stage MT system. Our APE system (APE_Sym) is an extended version of an attention based NMT model with bilingual symmetry employing bidirectional models, mt–pe and pe–mt. APE translations produced by our system show statistically significant improvements over the first-stage MT, phrase-based APE and the best reported score on the WMT 2016 APE dataset by a previous neural APE system. Re-ranking (APE_Rerank) of the n-best translations from the phrase-based APE and APE_Sym systems provides further substantial improvements over the symmetric neural APE model. Human evaluation confirms that the APE_Rerank generated PE translations improve on the previous best neural APE system at WMT 2016.

ADAPT Centre Cone Team at IJCNLP-2017 Task 5: A Similarity-Based Logistic Regression Approach to Multi-choice Question Answering in an Examinations Shared Task
Daria Dzendzik | Alberto Poncelas | Carl Vogel | Qun Liu
Proceedings of the IJCNLP 2017, Shared Tasks

We describe the work of a team from the ADAPT Centre in Ireland in addressing automatic answer selection for the Multi-choice Question Answering in Examinations shared task. The system is based on a logistic regression over the string similarities between question, answer, and additional text. We obtain the highest grade out of six systems: 48.7% accuracy on a validation set (vs. a baseline of 29.45%) and 45.6% on a test set.

Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search
Chris Hokamp | Qun Liu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present Grid Beam Search (GBS), an algorithm which extends beam search to allow the inclusion of pre-specified lexical constraints. The algorithm can be used with any model which generates sequences token by token. Lexical constraints take the form of phrases or words that must be present in the output sequence. This is a very general way to incorporate auxillary knowledge into a model’s output without requiring any modification of the parameters or training data. We demonstrate the feasibility and flexibility of Lexically Constrained Decoding by conducting experiments on Neural Interactive-Predictive Translation, as well as Domain Adaptation for Neural Machine Translation. Experiments show that GBS can provide large improvements in translation quality in interactive scenarios, and that, even without any user input, GBS can be used to achieve significant gains in performance in domain adaptation scenarios.

Further Investigation into Reference Bias in Monolingual Evaluation of Machine Translation
Qingsong Ma | Yvette Graham | Timothy Baldwin | Qun Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Monolingual evaluation of Machine Translation (MT) aims to simplify human assessment by requiring assessors to compare the meaning of the MT output with a reference translation, opening up the task to a much larger pool of genuinely qualified evaluators. Monolingual evaluation runs the risk, however, of bias in favour of MT systems that happen to produce translations superficially similar to the reference and, consistent with this intuition, previous investigations have concluded monolingual assessment to be strongly biased in this respect. On re-examination of past analyses, we identify a series of potential analytical errors that force some important questions to be raised about the reliability of past conclusions, however. We subsequently carry out further investigation into reference bias via direct human assessment of MT adequacy via quality controlled crowd-sourcing. Contrary to both intuition and past conclusions, results for show no significant evidence of reference bias in monolingual evaluation of MT.

Blend: a Novel Combined MT Metric Based on Direct Assessment — CASICT-DCU submission to WMT17 Metrics Task
Qingsong Ma | Yvette Graham | Shugen Wang | Qun Liu
Proceedings of the Second Conference on Machine Translation

Incorporating Global Visual Features into Attention-based Neural Machine Translation.
Iacer Calixto | Qun Liu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce multi-modal, attention-based neural machine translation (NMT) models which incorporate visual features into different parts of both the encoder and the decoder. Global image features are extracted using a pre-trained convolutional neural network and are incorporated (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state. In our experiments, we evaluate translations into English and German, how different strategies to incorporate global image features compare and which ones perform best. We also study the impact that adding synthetic multi-modal, multilingual data brings and find that the additional data have a positive impact on multi-modal NMT models. We report new state-of-the-art results and our best models also significantly improve on a comparable phrase-based Statistical MT (PBSMT) model trained on the Multi30k data set according to all metrics evaluated. To the best of our knowledge, it is the first time a purely neural model significantly improves over a PBSMT model on all metrics evaluated on this data set.

Context-Aware Graph Segmentation for Graph-Based Translation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present an improved graph-based translation model which segments an input graph into node-induced subgraphs by taking source context into consideration. Translations are generated by combining subgraph translations left-to-right using beam search. Experiments on Chinese–English and German–English demonstrate that the context-aware segmentation significantly improves the baseline graph-based model.

Semantics-Enhanced Task-Oriented Dialogue Translation: A Case Study on Hotel Booking
Longyue Wang | Jinhua Du | Liangyou Li | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the IJCNLP 2017, System Demonstrations

We showcase TODAY, a semantics-enhanced task-oriented dialogue translation system, whose novelties are: (i) task-oriented named entity (NE) definition and a hybrid strategy for NE recognition and translation; and (ii) a novel grounded semantic method for dialogue understanding and task-order management. TODAY is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

Sentence-Level Multilingual Multi-modal Embedding for Natural Language Processing
Iacer Calixto | Qun Liu
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We propose a novel discriminative ranking model that learns embeddings from multilingual and multi-modal data, meaning that our model can take advantage of images and descriptions in multiple languages to improve embedding quality. To that end, we introduce an objective function that uses pairwise ranking adapted to the case of three or more input sources. We compare our model against different baselines, and evaluate the robustness of our embeddings on image–sentence ranking (ISR), semantic textual similarity (STS), and neural machine translation (NMT). We find that the additional multilingual signals lead to improvements on all three tasks, and we highlight that our model can be used to consistently improve the adequacy of translations generated with NMT models when re-ranking n-best lists.

2016

Interactive Attention for Neural Machine Translation
Fandong Meng | Zhengdong Lu | Hang Li | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Conventional attention-based Neural Machine Translation (NMT) conducts dynamic alignment in generating the target sentence. By repeatedly reading the representation of source sentence, which keeps fixed after generated by the encoder (Bahdanau et al., 2015), the attention mechanism has greatly enhanced state-of-the-art NMT. In this paper, we propose a new attention mechanism, called INTERACTIVE ATTENTION, which models the interaction between the decoder and the representation of source sentence during translation by both reading and writing operations. INTERACTIVE ATTENTION can keep track of the interaction history and therefore improve the translation performance. Experiments on NIST Chinese-English translation task show that INTERACTIVE ATTENTION can achieve significant improvements over both the previous attention-based NMT baseline and some state-of-the-art variants of attention-based NMT (i.e., coverage models (Tu et al., 2016)). And neural machine translator with our INTERACTIVE ATTENTION can outperform the open source attention-based NMT system Groundhog by 4.22 BLEU points and the open source phrase-based system Moses by 3.94 BLEU points averagely on multiple test sets.

Fast Gated Neural Domain Adaptation: Language Model as a Case Study
Jian Zhang | Xiaofeng Wu | Andy Way | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Neural network training has been shown to be advantageous in many natural language processing applications, such as language modelling or machine translation. In this paper, we describe in detail a novel domain adaptation mechanism in neural network training. Instead of learning and adapting the neural network on millions of training sentences – which can be very time-consuming or even infeasible in some cases – we design a domain adaptation gating mechanism which can be used in recurrent neural networks and quickly learn the out-of-domain knowledge directly from the word vector representations with little speed overhead. In our experiments, we use the recurrent neural network language model (LM) as a case study. We show that the neural LM perplexity can be reduced by 7.395 and 12.011 using the proposed domain adaptation mechanism on the Penn Treebank and News data, respectively. Furthermore, we show that using the domain-adapted neural LM to re-rank the statistical machine translation n-best list on the French-to-English language pair can significantly improve translation quality.

A subtree-based factorization of dependency parsing
Qiuye Zhao | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We propose a dependency parsing pipeline, in which the parsing of long-distance projections and localized dependencies are explicitly decomposed at the input level. A chosen baseline dependency parsing model performs only on ‘carved’ sequences at the second stage, which are transformed from coarse constituent parsing outputs at the first stage. When k-best constituent parsing outputs are kept, a third-stage is required to search for an optimal combination of the overlapped dependency subtrees. In this sense, our dependency model is subtree-factored. We explore alternative approaches for scoring subtrees, including feature-based models as well as continuous representations. The search for optimal subset to combine is formulated as an ILP problem. This framework especially benefits the models poor on long sentences, generally improving baselines by 0.75-1.28 (UAS) on English, achieving comparable performance with high-order models but faster. For Chinese, the most notable increase is as high as 3.63 (UAS) when the proposed framework is applied to first-order parsing models.

Enriching Phrase Tables for Statistical Machine Translation Using Mixed Embeddings
Peyman Passban | Qun Liu | Andy Way
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The phrase table is considered to be the main bilingual resource for the phrase-based statistical machine translation (PBSMT) model. During translation, a source sentence is decomposed into several phrases. The best match of each source phrase is selected among several target-side counterparts within the phrase table, and processed by the decoder to generate a sentence-level translation. The best match is chosen according to several factors, including a set of bilingual features. PBSMT engines by default provide four probability scores in phrase tables which are considered as the main set of bilingual features. Our goal is to enrich that set of features, as a better feature set should yield better translations. We propose new scores generated by a Convolutional Neural Network (CNN) which indicate the semantic relatedness of phrase pairs. We evaluate our model in different experimental settings with different language pairs. We observe significant improvements when the proposed features are incorporated into the PBSMT pipeline.

Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics
Yvette Graham | Qun Liu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A Novel Approach to Dropped Pronoun Translation
Longyue Wang | Zhaopeng Tu | Xiaojun Zhang | Hang Li | Andy Way | Qun Liu
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Variational Neural Discourse Relation Recognizer
Biao Zhang | Deyi Xiong | Jinsong Su | Qun Liu | Rongrong Ji | Hong Duan | Min Zhang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Neural Network for Heterogeneous Annotations
Hongshen Chen | Yue Zhang | Qun Liu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Extending Phrase-Based Translation with Dependencies by Using Graphs
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

ProphetMT: A Tree-based SMT-driven Controlled Language Authoring/Post-Editing Tool
Xiaofeng Wu | Jinhua Du | Qun Liu | Andy Way
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents ProphetMT, a tree-based SMT-driven Controlled Language (CL) authoring and post-editing tool. ProphetMT employs the source-side rules in a translation model and provides them as auto-suggestions to users. Accordingly, one might say that users are writing in a Controlled Language that is understood by the computer. ProphetMT also allows users to easily attach structural information as they compose content. When a specific rule is selected, a partial translation is promptly generated on-the-fly with the help of the structural information. Our experiments conducted on English-to-Chinese show that our proposed ProphetMT system can not only better regularise an author’s writing behaviour, but also significantly improve translation fluency which is vital to reduce the post-editing time. Additionally, when the writing and translation process is over, ProphetMT can provide an effective colour scheme to further improve the productivity of post-editors by explicitly featuring the relations between the source and target rules.

Calculating the percentage reduction in translator effort when using machine translation
Andrzej Zydrón | Qun Liu
Proceedings of Translating and the Computer 38

Graph-Based Translation Via Graph Segmentation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Improving Phrase-Based SMT Using Cross-Granularity Embedding Similarity
Peyman Passban | Chris Hokamp | Andy Way | Qun Liu
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

Memory-enhanced Decoder for Neural Machine Translation
Mingxuan Wang | Zhengdong Lu | Hang Li | Qun Liu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Combining Translation Memories and Syntax-Based SMT: Experiments with Real Industrial Data
Liangyou Li | Carla Parra Escartin | Qun Liu
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

Phrase-Level Combination of SMT and TM Using Constrained Word Lattice
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Topic-Informed Neural Machine Translation
Jian Zhang | Liangyou Li | Andy Way | Qun Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In recent years, neural machine translation (NMT) has demonstrated state-of-the-art machine translation (MT) performance. It is a new approach to MT, which tries to learn a set of parameters to maximize the conditional probability of target sentences given source sentences. In this paper, we present a novel approach to improve the translation performance in NMT by conveying topic knowledge during translation. The proposed topic-informed NMT can increase the likelihood of selecting words from the same topic and domain for translation. Experimentally, we demonstrate that topic-informed NMT can achieve a 1.15 (3.3% relative) and 1.67 (5.4% relative) absolute improvement in BLEU score on the Chinese-to-English language pair using NIST 2004 and 2005 test sets, respectively, compared to NMT without topic information.

Automatic Construction of Discourse Corpora for Dialogue Translation
Longyue Wang | Xiaojun Zhang | Zhaopeng Tu | Andy Way | Qun Liu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, a novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation. Firstly, the parallel subtitle data and its corresponding monolingual movie script data are crawled and collected from Internet. Then tags such as speaker and discourse boundary from the script data are projected to its subtitle data via an information retrieval approach in order to map monolingual discourse to bilingual texts. We not only evaluate the mapping results, but also integrate speaker information into the translation. Experiments show our proposed method can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary annotation, and speaker-based language model adaptation can obtain around 0.5 BLEU points improvement in translation qualities. Finally, we publicly release around 100K parallel discourse data with manual speaker and dialogue boundary annotation.

2015

Benchmarking SMT Performance for Farsi Using the TEP++ Corpus
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

HandyCAT - An Open-Source Platform for CAT Tool Research
Christopher Hokamp | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

MT Tuning on RED: A Dependency-Based Evaluation Metric
Liangyou Li | Hui Yu | Qun Liu
Proceedings of the Tenth Workshop on Statistical Machine Translation

Dependency Graph-to-String Translation
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Encoding Source Language with Convolutional Neural Network for Machine Translation
Fandong Meng | Zhengdong Lu | Mingxuan Wang | Hang Li | Wenbin Jiang | Qun Liu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Benchmarking SMT Performance for Farsi Using the TEP++ Corpus
Peyman Passban | Andy Way | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

HandyCAT - An Open-Source Platform for CAT Tool Research
Chris Hokamp | Qun Liu
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

ParFDA for Fast Deployment of Accurate Statistical Machine Translation Systems, Benchmarks, and Statistics
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Tenth Workshop on Statistical Machine Translation

The EXPERT project: Advancing the state of the art in hybrid translation technologies
Constantin Orasan | Alessandro Cattelan | Gloria Corpas Pastor | Josef van Genabith | Manuel Herranz | Juan José Arevalillo | Qun Liu | Khalil Sima’an | Lucia Specia
Proceedings of Translating and the Computer 37

CASICT-DCU Participation in WMT2015 Metrics Task
Hui Yu | Qingsong Ma | Xiaofeng Wu | Qun Liu
Proceedings of the Tenth Workshop on Statistical Machine Translation

Automatic Adaptation of Annotations
Wenbin Jiang | Yajuan Lü | Liang Huang | Qun Liu
Computational Linguistics, Volume 41, Issue 1 - March 2015

Referential Translation Machines for Predicting Translation Quality and Related Statistics
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Tenth Workshop on Statistical Machine Translation

The DCU Discourse Parser for Connective, Argument Identification and Explicit Sense Classification
Longyue Wang | Chris Hokamp | Tsuyoshi Okita | Xiaojun Zhang | Qun Liu
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task

The DCU Discourse Parser: A Sense Classification Task
Tsuyoshi Okita | Longyue Wang | Qun Liu
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task

genCNN: A Convolutional Architecture for Word Sequence Prediction
Mingxuan Wang | Zhengdong Lu | Hang Li | Wenbin Jiang | Qun Liu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

A discriminative framework of integrating translation memory features into SMT
Liangyou Li | Andy Way | Qun Liu
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Combining Translation Memory (TM) with Statistical Machine Translation (SMT) together has been demonstrated to be beneficial. In this paper, we present a discriminative framework which can integrate TM into SMT by incorporating TM-related feature functions. Experiments on English–Chinese and English–French tasks show that our system using TM feature functions only from the best fuzzy match performs significantly better than the baseline phrase- based system on both tasks, and our discriminative model achieves comparable results to those of an effective generative model which uses similar features. Furthermore, with the capacity of handling a large amount of features in the discriminative framework, we propose a method to efficiently use multiple fuzzy matches which brings more feature functions and further significantly improves our system.

RED: A Reference Dependency Based MT Evaluation Metric
Hui Yu | Xiaofeng Wu | Jun Xie | Wenbin Jiang | Qun Liu | Shouxun Lin
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Syntactic SMT Using a Discriminative Text Generation Model
Yue Zhang | Kai Song | Linfeng Song | Jingbo Zhu | Qun Liu
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems
Ergun Biçici | Qun Liu | Andy Way
Proceedings of the Ninth Workshop on Statistical Machine Translation

Active Learning for Post-Editing Based Incrementally Retrained MT
Aswarth Abhilash Dara | Josef van Genabith | Qun Liu | John Judge | Antonio Toral
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

Transformation and Decomposition for Efficiently Implementing and Improving Dependency-to-String Model In Moses
Liangyou Li | Jun Xie | Andy Way | Qun Liu
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

The DCU-ICTCAS MT system at WMT 2014 on German-English Translation Task
Liangyou Li | Xiaofeng Wu | Santiago Cortés Vaíllo | Jun Xie | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts
Qun Liu | Fei Xia
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts

RED, The DCU-CASICT Submission of Metrics Tasks
Xiaofeng Wu | Hui Yu | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

Augment Dependency-to-String Translation with Fixed and Floating Structures
Jun Xie | Jinan Xu | Qun Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

DCU Terminology Translation System for Medical Query Subtask at WMT14
Tsuyoshi Okita | Ali Vahid | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

A Structured Language Model for Incremental Tree-to-String Translation
Heng Yu | Haitao Mi | Liang Huang | Qun Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

A Dependency Edge-based Transfer Model for Statistical Machine Translation
Hongshen Chen | Jun Xie | Fandong Meng | Wenbin Jiang | Qun Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Review and analysis of China workshop on machine translation 2013 evaluation
Sitong Yang | Heng Yu | Hongmei Zhao | Qun Liu | Yajuan Lü
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

This paper gives a general review and detailed analysis of China Workshop on Machine Translation (CWMT) Evaluation. Compared with the past CWMT evaluation campaigns, CWMT2013 evaluation is characterized as follows: first, adopting gray-box evaluation which makes the results more replicable and controllable; second, adding one rule-based system as a counterpart; third, carrying out manual evaluations on some specific tasks to give a more comprehensive analysis of the translation errors. Boosted by those new features, our analysis and case study on the evaluation results shows the pros and cons of both rule-based and statistical systems, and reveals some interesting correlations bewteen automatic and manual evaluation metrics on different translation systems.

DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task
Xiaofeng Wu | Rejwanul Haque | Tsuyoshi Okita | Piyush Arora | Andy Way | Qun Liu
Proceedings of the Ninth Workshop on Statistical Machine Translation

A probabilistic feature-based fill-up for SMT
Jian Zhang | Liangyou Li | Andy Way | Qun Liu
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

In this paper, we describe an effective translation model combination approach based on the estimation of a probabilistic Support Vector Machine (SVM). We collect domain knowledge from both in-domain and general-domain corpora inspired by a commonly used data selection algorithm, which we then use as features for the SVM training. Drawing on previous work on binary-featured phrase table fill-up (Nakov, 2008; Bisazza et al., 2011), we substitute the binary feature in the original work with our probabilistic domain-likeness feature. Later, we design two experiments to evaluate the proposed probabilistic feature-based approach on the French-to-English language pair using data provided at WMT07, WMT13 and IWLST11 translation tasks. Our experiments demonstrate that translation performance can gain significant improvements of up to +0.36 and +0.82 BLEU scores by using our probabilistic feature-based translation model fill-up approach compared with the binary featured fill-up approach in both experiments.

Annotation Adaptation and Language Adaptation in NLP
Qun Liu
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Modeling Term Translation for Document-informed Machine Translation
Fandong Meng | Deyi Xiong | Wenbin Jiang | Qun Liu
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

Bilingually-Guided Monolingual Dependency Grammar Induction
Kai Liu | Yajuan Lü | Wenbin Jiang | Qun Liu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
Wenbin Jiang | Meng Sun | Yajuan Lü | Yating Yang | Qun Liu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Shallow Semantically-Informed PBSMT and HPBSMT
Tsuyoshi Okita | Qun Liu | Josef van Genabith
Proceedings of the Eighth Workshop on Statistical Machine Translation

Stem Translation with Affix-Based Rule Selection for Agglutinative Languages
Zhiyang Wang | Yajuan Lü | Meng Sun | Qun Liu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Iterative Transformation of Annotation Guidelines for Constituency Parsing
Xiang Li | Wenbin Jiang | Yajuan Lü | Qun Liu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A Topic-Triggered Language Model for Statistical Machine Translation
Heng Yu | Jinsong Su | Yajuan Lv | Qun Liu
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Machine Translation in CNGL II
Qun Liu
Proceedings of Machine Translation Summit XIV: European projects

The CNGL-DCU-Prompsit Translation Systems for WMT13
Raphael Rubino | Antonio Toral | Santiago Cortés Vaíllo | Jun Xie | Xiaofeng Wu | Stephen Doherty | Qun Liu
Proceedings of the Eighth Workshop on Statistical Machine Translation

A Novel Graph-based Compact Representation of Word Alignment
Qun Liu | Zhaopeng Tu | Shouxun Lin
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Translation with Source Constituency and Dependency Trees
Fandong Meng | Jun Xie | Linfeng Song | Yajuan Lü | Qun Liu
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

DCU Participation in WMT2013 Metrics Task
Xiaofeng Wu | Hui Yu | Qun Liu
Proceedings of the Eighth Workshop on Statistical Machine Translation

Improving Alignment of System Combination by Using Multi-objective Optimization
Tian Xia | Zongcheng Ji | Shaodan Zhai | Yidong Chen | Qun Liu | Shaojun Wang
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation
Guosheng Ben | Deyi Xiong | Zhiyang Teng | Yajuan Lü | Qun Liu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

A Topic Similarity Model for Hierarchical Phrase-based Translation
Xinyan Xiao | Deyi Xiong | Min Zhang | Qun Liu | Shouxun Lin
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

ICT:A System Combination for Chinese Semantic Dependency Parsing
Hao Xiong | Qun Liu
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation
Wenbin Jiang | Fandong Meng | Qun Liu | Yajuan Lü
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

ICT: A Translation based Method for Cross-lingual Textual Entailment
Fandong Meng | Hao Xiong | Qun Liu
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Discriminative Boosting from Dictionary and Raw Text – A Novel Approach to Build A Chinese Word Segmenter
Fandong Meng | Wenbin Jiang | Hao Xiong | Qun Liu
Proceedings of COLING 2012: Posters

Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification
Zhaopeng Tu | Yifan He | Jennifer Foster | Josef van Genabith | Qun Liu | Shouxun Lin
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Left-to-Right Tree-to-String Decoding with Prediction
Yang Feng | Yang Liu | Qun Liu | Trevor Cohn
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Hierarchical Chunk-to-String Translation
Yang Feng | Dongdong Zhang | Mu Li | Qun Liu
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Combining Multiple Alignments to Improve Machine Translation
Zhaopeng Tu | Yang Liu | Yifan He | Josef van Genabith | Qun Liu | Shouxun Lin
Proceedings of COLING 2012: Posters

ICT: System Description for CoNLL-2012
Hao Xiong | Qun Liu
Joint Conference on EMNLP and CoNLL - Shared Task

Unsupervised Discriminative Induction of Synchronous Grammar for Machine Translation
Xinyan Xiao | Deyi Xiong | Yang Liu | Qun Liu | Shouxun Lin
Proceedings of COLING 2012

System Combination with Extra Alignment Information
Xiaofeng Wu | Tsuyoshi Okita | Josef van Genabith | Qun Liu
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information
Jinsong Su | Hua Wu | Haifeng Wang | Yidong Chen | Xiaodong Shi | Huailin Dong | Qun Liu
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2011

Multi-granularity Word Alignment and Decoding for Agglutinative Language Translation
Zhiyang Wang | Yajuan Lü | Qun Liu
Proceedings of Machine Translation Summit XIII: Papers

Relaxed Cross-lingual Projection of Constituent Syntax
Wenbin Jiang | Qun Liu | Yajuan Lv
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
Xinyan Xiao | Yang Liu | Qun Liu | Shouxun Lin
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Adjoining Tree-to-String Translation
Yang Liu | Qun Liu | Yajuan Lü
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Maximum Rank Correlation Training for Statistical Machine Translation
Daqi Zheng | Yifan He | Yang Liu | Qun Liu
Proceedings of Machine Translation Summit XIII: Papers

ETS: An Error Tolerable System for Coreference Resolution
Hao Xiong | Linfeng Song | Fandong Meng | Yang Liu | Qun Liu | Yajuan Lv
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

Bagging-based System Combination for Domain Adaption
Linfeng Song | Haitao Mi | Yajuan Lü | Qun Liu
Proceedings of Machine Translation Summit XIII: Papers

Feedback Selecting of Manually Acquired Rules Using Automatic Evaluation
Xianhua Li | Yajuan Lü | Yao Meng | Qun Liu | Hao Yu
Proceedings of the 4th Workshop on Patent Translation

A novel dependency-to-string model for statistical machine translation
Jun Xie | Haitao Mi | Qun Liu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Extracting Hierarchical Rules from a Weighted Alignment Matrix
Zhaopeng Tu | Yang Liu | Qun Liu | Shouxun Lin
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

The ICT statistical machine translation system for IWSLT 2010
Hao Xiong | Jun Xie | Hui Yu | Kai Liu | Wei Luo | Haitao Mi | Yang Liu | Yajuan Lü | Qun Liu
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

Dependency Parsing and Projection Based on Word-Pair Classification
Wenbin Jiang | Qun Liu
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su | Yang Liu | Yajuan Lv | Haitao Mi | Qun Liu
Proceedings of the ACL 2010 Conference Short Papers

Statistical Translation Model Based On Source Syntax Structure
Qun Liu | Yang Liu | Haitao Mi
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

Joint Parsing and Translation
Yang Liu | Qun Liu
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules
Zhiyang Wang | Yajuan Lv | Qun Liu | Young-Sook Hwang
Proceedings of the ACL 2010 Conference Short Papers

Effective Constituent Projection across Languages
Wenbin Jiang | Yajuan Lv | Yang Liu | Qun Liu
Coling 2010: Posters

An Efficient Shift-Reduce Decoding Algorithm for Phrased-Based Machine Translation
Yang Feng | Haitao Mi | Yang Liu | Qun Liu
Coling 2010: Posters

Discriminative Word Alignment by Linear Modeling
Yang Liu | Qun Liu | Shouxun Lin
Computational Linguistics, Volume 36, Issue 3 - September 2010

Machine Translation with Lattices and Forests
Haitao Mi | Liang Huang | Qun Liu
Coling 2010: Posters

Joint Tokenization and Translation
Xinyan Xiao | Yang Liu | Young-Sook Hwang | Qun Liu | Shouxun Lin
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Dependency-Based Bracketing Transduction Grammar for Statistical Machine Translation
Jinsong Su | Yang Liu | Haitao Mi | Hongmei Zhao | Yajuan Lv | Qun Liu
Coling 2010: Posters

Dependency Forest for Statistical Machine Translation
Zhaopeng Tu | Yang Liu | Young-Sook Hwang | Qun Liu | Shouxun Lin
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Constituency to Dependency Translation with Forests
Haitao Mi | Qun Liu
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

2009

Improving Tree-to-Tree Translation with Packed Forests
Yang Liu | Yajuan Lü | Qun Liu
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

Introduction to China’s CWMT2008 Machine Translation Evaluation
Hongmei Zhao | Jun Xie | Qun Liu | Yajuan Lü | Dongdong Zhang | Mu Li
Proceedings of Machine Translation Summit XII: Papers

Bilingually-Constrained (Monolingual) Shift-Reduce Parsing
Liang Huang | Wenbin Jiang | Qun Liu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Reducing SMT Rule Table with Monolingual Key Phrase
Zhongjun He | Yao Meng | Yajuan Lü | Hao Yu | Qun Liu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Lattice-based System Combination for Statistical Machine Translation
Yang Feng | Yang Liu | Haitao Mi | Qun Liu | Yajuan Lü
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study
Wenbin Jiang | Liang Huang | Qun Liu
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions
Zhixiang Ren | Yajuan Lü | Jie Cao | Qun Liu | Yun Huang
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

Automatic Adaptation of Annotation Standards for Dependency Parsing ? Using Projected Treebank as Source Corpus
Wenbin Jiang | Qun Liu
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

Sub-Sentence Division for Tree-Based Machine Translation
Hao Xiong | Wenwen Xu | Haitao Mi | Yang Liu | Qun Liu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Weighted Alignment Matrices for Statistical Machine Translation
Yang Liu | Tian Xia | Xinyan Xiao | Qun Liu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

The ICT statistical machine translation system for the IWSLT 2009
Haitao Mi | Yang Li | Tian Xia | Xinyan Xiao | Yang Feng | Jun Xie | Hao Xiong | Zhaopeng Tu | Daqi Zheng | Yanjuan Lu | Qun Liu
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the ICT Statistical Machine Translation systems that used in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2009. For this year’s evaluation, we participated in the Challenge Task (Chinese-English and English-Chinese) and BTEC Task (Chinese-English). And we mainly focus on one new method to improve single system’s translation quality. Specifically, we developed a sentence-similarity based development set selection technique. For each task, we finally submitted the single system who got the maximum BLEU scores on the selected development set. The four single translation systems are based on different techniques: a linguistically syntax-based system, two formally syntax-based systems and a phrase-based system. Typically, we didn’t use any rescoring or system combination techniques in this year’s evaluation.

Joint Decoding with Multiple Translation Models
Yang Liu | Haitao Mi | Yang Feng | Qun Liu
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

Improving Statistical Machine Translation using Lexicalized Rule Selection
Zhongjun He | Qun Liu | Shouxun Lin
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

The ICT system description for IWSLT 2008.
Yang Liu | Zhongjun He | Haitao Mi | Yun Huang | Yang Feng | Wenbin Jiang | Yajuan Lu | Qun Liu
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper presents a description for the ICT systems involved in the IWSLT 2008 evaluation campaign. This year, we participated in Chinese-English and English-Chinese translation directions. Four statistical machine translation systems were used: one linguistically syntax-based, two formally syntax-based, and one phrase-based. The outputs of the four SMT systems were fed to a sentence-level system combiner, which was expected to produce better translations than single systems. We will report the results of the four single systems and the combiner on both the development and test sets.

Maximum Entropy based Rule Selection Model for Syntax-based Statistical Machine Translation
Qun Liu | Zhongjun He | Yang Liu | Shouxun Lin
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

Forest-Based Translation
Haitao Mi | Liang Huang | Qun Liu
Proceedings of ACL-08: HLT

Partial Matching Strategy for Phrase-based Statistical Machine Translation
Zhongjun He | Qun Liu | Shouxun Lin
Proceedings of ACL-08: HLT, Short Papers

Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging
Wenbin Jiang | Haitao Mi | Qun Liu
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
Wenbin Jiang | Liang Huang | Qun Liu | Yajuan Lü
Proceedings of ACL-08: HLT

Refinements in BTG-based Statistical Machine Translation
Deyi Xiong | Min Zhang | AiTi Aw | Haitao Mi | Qun Liu | Shouxun Lin
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

2007

Improving Statistical Machine Translation Performance by Training Data Selection and Optimization
Yajuan Lü | Jin Huang | Qun Liu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

A Dependency Treelet String Correspondence Model for Statistical Machine Translation
Deyi Xiong | Qun Liu | Shouxun Lin
Proceedings of the Second Workshop on Statistical Machine Translation

The ICT statistical machine translation systems for IWSLT 2007
Zhongjun He | Haitao Mi | Yang Liu | Deyi Xiong | Weihua Luo | Yun Huang | Zhixiang Ren | Yajuan Lu | Qun Liu
Proceedings of the Fourth International Workshop on Spoken Language Translation

In this paper, we give an overview of the ICT statistical machine translation systems for the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2007. In this year’s evaluation, we participated in the Chinese-English transcript translation task, and developed three systems based on different techniques: a formally syntax-based system Bruin, an extended phrase-based system Confucius and a linguistically syntax-based system Lynx. We will describe the models of these three systems, and compare their performance in detail. We set Bruin as our primary system, which ranks 2 among the 15 primary results according to the official evaluation results.

Forest-to-String Statistical Translation Rules
Yang Liu | Yun Huang | Qun Liu | Shouxun Lin
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

Tree-to-String Alignment Template for Statistical Machine Translation
Yang Liu | Qun Liu | Shouxun Lin
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation
Deyi Xiong | Qun Liu | Shouxun Lin
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

Introduction to China’s HTRDP Machine Translation Evaluation
Qun Liu | Hongxu Hou | Shouxun Lin | Yueliang Qian | Yujie Zhang | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Invited papers

Since 1994, China’s HTRDP machine translation evaluation has been conducted for five times. Systems of various translation directions between Chinese, English, Japanese and French have been tested. Both human evaluation and automatic evaluation are conducted in HTRDP evaluation. In recent years, the evaluation was organized jointly with NICT of Japan. This paper introduces some details of this evaluation.

Log-Linear Models for Word Alignment
Yang Liu | Qun Liu | Shouxun Lin
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

Parsing the Penn Chinese Treebank with Semantic Knowledge
Deyi Xiong | Shuanglong Li | Qun Liu | Shouxun Lin | Yueliang Qian
Second International Joint Conference on Natural Language Processing: Full Papers

A Multi-aligner for Japanese-Chinese Parallel Corpora
Yujie Zhang | Qun Liu | Qing Ma | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

Automatic word alignment is an important technology for extracting translation knowledge from parallel corpora. However, automatic techniques cannot resolve this problem completely because of variances in translations. We therefore need to investigate the performance potential of automatic word alignment and then decide how to suitably apply it. In this paper we first propose a lexical knowledge-based approach to word alignment on a Japanese-Chinese corpus. Then we evaluate the performance of the proposed approach on the corpus. At the same time we also apply a statistics-based approach, the well-known toolkit GIZA++, to the same test data. Through comparison of the performances of the two approaches, we propose a multi-aligner, exploiting the lexical knowledge-based aligner and the statistics-based aligner at the same time. Quantitative results confirmed the effectiveness of the multi-aligner.

2003

HHMM-based Chinese Lexical Analyzer ICTCLAS
Hua-Ping Zhang | Hong-Kui Yu | De-Yi Xiong | Qun Liu
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

Chinese Lexical Analysis Using Hierarchical Hidden Markov Model
Hua-Ping Zhang | Qun Liu | Xue-Qi Cheng | Hao Zhang | Hong-Kui Yu
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

Chinese Named Entity Recognition Using Role Model
Hua-Ping Zhang | Qun Liu | Hong-Kui Yu | Xue-Qi Cheng | Shuo Bai
International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 2, August 2003

2002

A Character-net Based Chinese Text Segmentation Method
Lixin Zhou | Qun Liu
COLING-02: SEMANET: Building and Using Semantic Networks

Automatic Recognition of Chinese Unknown Words Based on Roles Tagging
Kevin Zhang | Qun Liu | Hao Zhang | Xue-Qi Cheng
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

基於《知網》的辭彙語義相似度計算 (Word Similarity Computing Based on How-net) [In Chinese]
Qun Liu | Sujian Li
International Journal of Computational Linguistics & Chinese Language Processing, Volume 7, Number 2, August 2002: Special Issue on Computational Chinese Lexical Semantics

1998

TransEasy: A Chinese-English machine translation system based on hybrid approach
Qun Liu | Shiwen Yu
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: System Descriptions

This paper describes the progress of a machine translation system from Chinese to English. The system is based on a reusable platform of MT software components. It’s a rule-based system, and some statistical algorithms are used as heuristic functions in parsing as well. There are about 50,000 Chinese words and 400 global parsing rules in the system. The system got a good result in a public test of MT system in China in Mar. 1998. It is a research vehicle up to now.

Co-authors

Yasheng Wang 21

Wenbin Jiang 20

Deyi Xiong (德意熊) 16

Xiaoguang Li 13

Xingshan Zeng 12

Fandong Meng 10

Yvette Graham 9

Peyman Passban 8

Mehdi Rezagholizadeh 7

Josef van Genabith 7

Tsuyoshi Okita 6

Mingxuan Wang 5

Iacer Calixto 4

Sujian Li (李素建) 4

Xiaojun Zhang 4

David Alfonso-Hermelo 3

Hongshen Chen 3

Xueqi Cheng (程学旗) 3

Koel Dutta Chowdhury 3

Jennifer Foster 3

Abbas Ghaddar 3

Young-Sook Hwang 3

Hua Wu (吴华) 3

Hua-Ping Zhang 3

Zhengyan Zhang 3

Timothy Baldwin 2

Santiago Cortés Vaíllo 2

Daria Dzendzik 2

Mianqiu Huang 2

Ignacio Iacobacci 2

Hitoshi Isahara 2

Ehsan Kamalloo 2

Haider Khalid 2

Wai Chung Kwan 2

Gerasimos Lampouras 2

Bingquan Liu (刘秉权) 2

Rajashekar Maragoud 2

Odunayo Ogundepo 2

Carla Parra Escartín 2

Yueliang Qian 2

Xipeng Qiu (邱锡鹏) 2

Raphael Rubino 2

Cheng-Jie Sun (孙承杰) 2

Nandan Thakur 2

Antonio Toral 2

Jiansheng Wei 2

Chenghao Yang 2

Dit-Yan Yeung 2

Dongdong Zhang 2

Jinchao Zhang 2

Yujie Zhang (张玉洁) 2

Ruiqing Zhang 2

Mohammad Alomrani 1

Ashjan Alsulaimani 1

Pradeep Bagavan 1

Anurag Beniwal 1

Mahdi Biparva 1

Dasha Bogdanova 1

Ondřej Bojar 1

Luiz Bonifacio 1

Nick Campbell 1

Alessandro Cattelan 1

Lidia S. Chao 1

Rajen Chatterjee 1

Dongsheng Chen 1

Huanhuan Chen 1

Guangyong Chen 1

Gloria Corpas Pastor 1

Aswarth Abhilash Dara 1

Mohammad Dehghan 1

Stephen Doherty 1

Xianghong Fang 1

Christian Federmann 1

Yang Gao (扬高) 1

Sebastian Gehrmann 1

Gholamreza Haffari 1

Md. Akmal Haidar 1

Xu Han (韩旭) 1

Guang-Yuan Hao 1

Rejwanul Haque 1

Mohammed Hasanuzzaman 1

Manuel Herranz 1

Shujian Huang (书剑黄) 1

Xuan-Jing Huang (黄萱菁) 1

Matthias Huck 1

Wangjie Jiang 1

Juan José Arevalillo 1

Philipp Koehn 1

Chuqiao Kuang 1

Philippe Langlais 1

Chak Tou Leong 1

Shuanglong Li 1

Xiaodan Liang 1

Mark Liberman 1

Chang Liu (刘畅) 1

Zhengying Liu 1

Varvara Logacheva 1

Michael R. Lyu 1

Wolfgang Macherey 1

Sinead Madden 1

Mieradilijiang Maimaiti 1

Alfredo Maldonado 1

Christof Monz 1

Sudip Kumar Naskar 1

Constantin Orasan 1

Alexander O’Connor 1

Prasanna Parthasarathi 1

Alberto Poncelas 1

Peerachet Porkaew 1

Rameez Qureshi 1

Mohammed Rameez Qureshi 1

Manikandarajan Ramanathan 1

Puneeth Saladi 1

Carolina Scarton 1

David Schlangen 1

Hinrich Schütze 1

Xiaodong Shi (史晓东) 1

Khalil Sima’an 1

Sidarth Srinivasan 1

Ryuichi Takanobu 1

Ningning Wang 1

Liangwei Wang 1

Derek F. Wong (黄辉) 1

Chenhao Xiong 1

Ruifeng Xu (徐睿峰) 1

Jinan Xu (徐金安) 1

Nicholas Jing Yuan 1

Yingxue Zhang 1

Zhengkun Zhang 1

Crystina Zhang 1

Chuanqiang Zhang 1

Yuanhang Zheng 1

Chuanyang Zheng 1

Yingying Zhuang 1

Yuzheng Zhuang 1

Andrzej Zydroń 1

Venues