Yuhao Zhang

Also published as: 裕浩 , YuHao Zhang


2024

pdf bib
Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models
Zhengxuan Wu | Yuhao Zhang | Peng Qi | Yumo Xu | Rujun Han | Yian Zhang | Jifan Chen | Bonan Min | Zhiheng Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Modern language models (LMs) need to follow human instructions while being faithful; yet, they often fail to achieve both. Here, we provide concrete evidence of a trade-off between instruction following (i.e., follow open-ended instructions) and faithfulness (i.e., ground responses in given context) when training LMs with these objectives. For instance, fine-tuning LLaMA-7B on instruction following datasets renders it less faithful. Conversely, instruction-tuned Vicuna-7B shows degraded performance at following instructions when further optimized on tasks that require contextual grounding. One common remedy is multi-task learning (MTL) with data mixing, yet it remains far from achieving a synergic outcome. We propose a simple yet effective method that relies on Reject-sampling by Self-instruct with Continued Fine-tuning (ReSet), which significantly outperforms vanilla MTL. Surprisingly, we find that less is more, as training ReSet with high-quality, yet substantially smaller data (three-fold less) yields superior results. Our findings offer a better understanding of objective discrepancies in alignment training of LMs.

pdf bib
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Rujun Han | Yuhao Zhang | Peng Qi | Yumo Xu | Jenyuan Wang | Lan Liu | William Yang Wang | Bonan Min | Vittorio Castelli
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA’s answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM’s answers are preferred to LFRQA’s answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

pdf bib
CodeFort: Robust Training for Code Generation Models
Yuhao Zhang | Shiqi Wang | Haifeng Qian | Zijian Wang | Mingyue Shang | Linbo Liu | Sanjay Krishna Gouda | Baishakhi Ray | Murali Krishna Ramanathan | Xiaofei Ma | Anoop Deoras
Findings of the Association for Computational Linguistics: EMNLP 2024

Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the performance of these models. Although improving the robustness of code generation models is crucial to enhancing user experience in real-world applications, existing research efforts do not address this issue. To fill this gap, we propose CodeFort, a framework to improve the robustness of code generation models, generalizing a large variety of code perturbations to enrich the training data and enabling various robust training strategies, mixing data augmentation, batch augmentation, adversarial logits pairing, and contrastive learning, all carefully designed to support high-throughput training. Extensive evaluations show that we increase the average robust pass rates of baseline CodeGen models from 14.79 to 21.74. We notably decrease the robustness drop rate from 95.02% to 54.95% against code-syntax perturbations.

pdf bib
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
Yi Luo | Zhenghao Lin | YuHao Zhang | Jiashuo Sun | Chen Lin | Chengjin Xu | Xiangdong Su | Yelong Shen | Jian Guo | Yeyun Gong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage.Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model.We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.

2023

pdf bib
CTC-based Non-autoregressive Speech Translation
Chen Xu | Xiaoqian Liu | Xiaowen Liu | Qingxuan Sun | Yuhao Zhang | Murun Yang | Qianqian Dong | Tom Ko | Mingxuan Wang | Tong Xiao | Anxiang Ma | Jingbo Zhu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST).In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67×, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.

pdf bib
基于多尺度建模的端到端自动语音识别方法(An End-to-End Automatic Speech Recognition Method Based on Multiscale Modeling)
Hao Chen (陈昊) | Runlai Zhang (张润来) | Yuhao Zhang (张裕浩) | Chenghao Gao (高成浩) | Chen Xu (许晨) | Anxiang Ma (马安香) | Tong Xiao (肖桐) | Jingbo Zhu (朱靖波)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“近年来,基于深度学习的端到端自动语音识别模型直接对语音和文本进行建模,结构简单且性能上也具有显著优势,逐渐成为主流。然而,由于连续的语音信号与离散的文本在长度及表示尺度上存在巨大差异,二者间的模态鸿沟问题是该类任务一直存在的困扰。为解决该问题,本文提出了多尺度语音识别建模方法,该方法从利用细粒度分布知识的角度出发,建立多个不同尺度形式的文本信息,将特征序列从细粒度的低层次序列逐步对齐预测出文本序列。这种逐级预测的方式能够有效降低预测难度,缓解模态鸿沟带来的影响,并通过融合不同尺度下特征,提高语料信息的丰富性与完整性,进一步增强模型推理能力。本文在LibriSpeech小规模、大规模和TEDLIUM2数据集上实验,相比基线系统词错误率平均降低1.7、0.45和0.76,验证了方法的有效性。”

pdf bib
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation
Yuhao Zhang | Chen Xu | Bei Li | Hao Chen | Tong Xiao | Chunliang Zhang | Jingbo Zhu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Significant improvements in end-to-end speech translation (ST) have been achieved through the application of multi-task learning. However, the extent to which auxiliary tasks are highly consistent with the ST task, and how much this approach truly helps, have not been thoroughly studied. In this paper, we investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. Furthermore, we propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation. We conduct experiments on the MuST-C dataset. The results demonstrate that our method attains state-of-the-art results. Moreover, when additional data is used, we achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the training time required by the current SOTA method.

pdf bib
Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Xingyu Fu | Sheng Zhang | Gukyeong Kwon | Pramuditha Perera | Henghui Zhu | Yuhao Zhang | Alexander Hanbo Li | William Yang Wang | Zhiguo Wang | Vittorio Castelli | Patrick Ng | Dan Roth | Bing Xiang
Findings of the Association for Computational Linguistics: ACL 2023

The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias – the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality – only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, {pasted macro ‘MODEL’}name first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1% on OK-VQA, without additional computation cost.

pdf bib
RobustQA: Benchmarking the Robustness of Domain Adaptation for Open-Domain Question Answering
Rujun Han | Peng Qi | Yuhao Zhang | Lan Liu | Juliette Burger | William Yang Wang | Zhiheng Huang | Bing Xiang | Dan Roth
Findings of the Association for Computational Linguistics: ACL 2023

Open-domain question answering (ODQA) is a crucial task in natural language processing. A typical ODQA system relies on a retriever module to select relevant contexts from a large corpus for a downstream reading comprehension model. Existing ODQA datasets consist mainly of Wikipedia corpus, and are insufficient to study models’ generalizability across diverse domains as models are trained and evaluated on the same genre of data. We propose **RobustQA**, a novel benchmark consisting of datasets from 8 different domains, which facilitates the evaluation of ODQA’s domain robustness. To build **RobustQA**, we annotate QA pairs in retrieval datasets with rigorous quality control. We further examine improving QA performances by incorporating unsupervised learning methods with target-domain corpus and adopting large generative language models. These methods can effectively improve model performances on **RobustQA**. However, experimental results demonstrate a significant gap from in-domain training, suggesting that **RobustQA** is a challenging benchmark to evaluate ODQA domain robustness.

pdf bib
Improving Cross-task Generalization of Unified Table-to-text Models with Compositional Task Configurations
Jifan Chen | Yuhao Zhang | Lan Liu | Rui Dong | Xinchi Chen | Patrick Ng | William Yang Wang | Zhiheng Huang
Findings of the Association for Computational Linguistics: ACL 2023

There has been great progress in unifying various table-to-text tasks using a single encoder-decoder model trained via multi-task learning (Xie et al., 2022).However, existing methods typically encode task information with a simple dataset name as a prefix to the encoder. This not only limits the effectiveness of multi-task learning, but also hinders the model’s ability to generalize to new domains or tasks that were not seen during training, which is crucial for real-world applications. In this paper, we propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization of unified models. We design the task configurations to explicitly specify the task type, as well as its input and output types. We show that this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations that apply novel input-output combinations in a zero-shot manner. We demonstrate via experiments over ten table-to-text tasks that our method outperforms the UnifiedSKG baseline by noticeable margins in both in-domain and zero-shot settings, with average improvements of +0.5 and +12.6 from using a T5-large backbone, respectively.

pdf bib
Bridging the Granularity Gap for Acoustic Modeling
Chen Xu | Yuhao Zhang | Chengbo Jiao | Xiaoqian Liu | Chi Hu | Xin Zeng | Tong Xiao | Anxiang Ma | Huizhen Wang | Jingbo Zhu
Findings of the Association for Computational Linguistics: ACL 2023

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose Progressive Down-Sampling (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20x to 1.47x.By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.

pdf bib
A Query-Parallel Machine Reading Comprehension Framework for Low-resource NER
Yuhao Zhang | Yongliang Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

Named entity recognition (NER) is a fundamental task in natural language processing. Recently, NER has been formulated as a machine reading comprehension (MRC) task, in which manually-crafted queries are used to extract entities of different types. However, current MRC-based NER techniques are limited to extracting a single type of entities at a time and are largely geared towards resource-rich settings. This renders them inefficient during the inference phase, while also leaving their potential untapped for utilization in low-resource settings. We suggest a query-parallel MRC-based approach to address these issues, which is capable of extracting multiple entity types concurrently and is applicable to both resource-rich and resource-limited settings. Specifically, we propose a query-parallel encoder which uses a query-segmented attention mechanism to isolate the semantics of queries and model the query-context interaction with a unidirectional flow. This allows for easier generalization to new entity types or transfer to new domains. After obtaining the query and context representations through the encoder, they are fed into a query-conditioned biaffine predictor to extract multiple entities at once. The model is trained with parameter-efficient tuning technique, making it more data-efficient. We conduct extensive experiments and demonstrate that our model performs competitively against strong baseline methods in resource-rich settings, and achieves state-of-the-art results in low-resource settings, including training-from-scratch, in-domain transfer and cross-domain transfer tasks.

pdf bib
Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
Kaiser Sun | Peng Qi | Yuhao Zhang | Lan Liu | William Wang | Zhiheng Huang
Findings of the Association for Computational Linguistics: EMNLP 2023

Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we study the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F1 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. Our results demonstrate the need for increased scrutiny regarding how tokenization is done in extractive tasks and the benefits of consistent tokenization during training.

pdf bib
The NiuTrans End-to-End Speech Translation System for IWSLT23 English-to-Chinese Offline Task
Yuchen Han | Xiaoqian Liu | Hao Chen | Yuhao Zhang | Chen Xu | Tong Xiao | Jingbo Zhu
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper describes the NiuTrans end-to-end speech translation system submitted for the IWSLT 2023 English-to-Chinese offline task. Our speech translation models are composed of pre-trained ASR and MT models under the SATE framework. Several pre-trained models with diverse architectures and input representations (e.g., log Mel-filterbank and waveform) were utilized. We proposed an IDA method to iteratively improve the performance of the MT models and generate the pseudo ST data through MT systems. We then trained ST models with different structures and data settings to enhance ensemble performance. Experimental results demonstrate that our NiuTrans system achieved a BLEU score of 29.22 on the MuST-C En-Zh tst-COMMON set, outperforming the previous year’s submission by 0.12 BLEU despite using less MT training data.

2022

pdf bib
A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space
Yuhao Zhang | Hongji Zhu | Yongliang Wang | Nan Xu | Xiaobo Li | Binqiang Zhao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning high-quality sentence representations is a fundamental problem of natural language processing which could benefit a wide range of downstream tasks. Though the BERT-like pre-trained language models have achieved great success, using their sentence representations directly often results in poor performance on the semantic textual similarity task. Recently, several contrastive learning methods have been proposed for learning sentence representations and have shown promising results. However, most of them focus on the constitution of positive and negative representation pairs and pay little attention to the training objective like NT-Xent, which is not sufficient enough to acquire the discriminating power and is unable to model the partial order of semantics between sentences. So in this paper, we propose a new method ArcCSE, with training objectives designed to enhance the pairwise discriminative power and model the entailment relation of triplet sentences. We conduct extensive experiments which demonstrate that our approach outperforms the previous state-of-the-art on diverse sentence related tasks, including STS and SentEval.

pdf bib
The NiuTrans’s Submission to the IWSLT22 English-to-Chinese Offline Speech Translation Task
Yuhao Zhang | Canan Huang | Chen Xu | Xiaoqian Liu | Bei Li | Anxiang Ma | Tong Xiao | Jingbo Zhu
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

This paper describes NiuTrans’s submission to the IWSLT22 English-to-Chinese (En-Zh) offline speech translation task. The end-to-end and bilingual system is built by constrained English and Chinese data and translates the English speech to Chinese text without intermediate transcription. Our speech translation models are composed of different pre-trained acoustic models and machine translation models by two kinds of adapters. We compared the effect of the standard speech feature (e.g. log Mel-filterbank) and the pre-training speech feature and try to make them interact. The final submission is an ensemble of three potential speech translation models. Our single best and ensemble model achieves 18.66 BLEU and 19.35 BLEU separately on MuST-C En-Zh tst-COMMON set.

2021

pdf bib
Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders
Chen Xu | Bojie Hu | Yanyang Li | Yuhao Zhang | Shen Huang | Qi Ju | Tong Xiao | Jingbo Zhu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Encoder pre-training is promising in end-to-end Speech Translation (ST), given the fact that speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic Speech Recognition (ASR) or Machine Translation (MT) encoders. For example, we find that ASR encoders lack the global context representation, which is necessary for translation, whereas MT encoders are not designed to deal with long but locally attentive acoustic sequences. In this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for speech translation. Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an MT encoder for a global representation of the input sequence. In this way, it is straightforward to incorporate the pre-trained models into the system. Also, we develop an adaptor module to alleviate the representation inconsistency between the pre-trained ASR encoder and MT encoder, and develop a multi-teacher knowledge distillation method to preserve the pre-training knowledge. Experimental results on the LibriSpeech En-Fr and MuST-C En-De ST tasks show that our method achieves state-of-the-art BLEU scores of 18.3 and 25.2. To our knowledge, we are the first to develop an end-to-end ST system that achieves comparable or even better BLEU performance than the cascaded ST counterpart when large-scale ASR and MT data is available.

pdf bib
Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain
Asma Ben Abacha | Yassine Mrabet | Yuhao Zhang | Chaitanya Shivade | Curtis Langlotz | Dina Demner-Fushman
Proceedings of the 20th Workshop on Biomedical Language Processing

The MEDIQA 2021 shared tasks at the BioNLP 2021 workshop addressed three tasks on summarization for medical text: (i) a question summarization task aimed at exploring new approaches to understanding complex real-world consumer health queries, (ii) a multi-answer summarization task that targeted aggregation of multiple relevant answers to a biomedical question into one concise and relevant answer, and (iii) a radiology report summarization task addressing the development of clinically relevant impressions from radiology report findings. Thirty-five teams participated in these shared tasks with sixteen working notes submitted (fifteen accepted) describing a wide variety of models developed and tested on the shared and external datasets. In this paper, we describe the tasks, the datasets, the models and techniques developed by various teams, the results of the evaluation, and a study of correlations among various summarization evaluation measures. We hope that these shared tasks will bring new research and insights in biomedical text summarization and evaluation.

pdf bib
Do Syntax Trees Help Pre-trained Transformers Extract Information?
Devendra Sachan | Yuhao Zhang | Peng Qi | William L. Hamilton
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Much recent work suggests that incorporating syntax information from dependency trees can improve task-specific transformer models. However, the effect of incorporating dependency tree information into pre-trained transformer models (e.g., BERT) remains unclear, especially given recent studies highlighting how these models implicitly encode syntax. In this work, we systematically study the utility of incorporating dependency trees into pre-trained transformers on three representative information extraction tasks: semantic role labeling (SRL), named entity recognition, and relation extraction. We propose and investigate two distinct strategies for incorporating dependency structure: a late fusion approach, which applies a graph neural network on the output of a transformer, and a joint fusion approach, which infuses syntax structure into the transformer attention layers. These strategies are representative of prior work, but we introduce additional model design elements that are necessary for obtaining improved performance. Our empirical analysis demonstrates that these syntax-infused transformers obtain state-of-the-art results on SRL and relation extraction tasks. However, our analysis also reveals a critical shortcoming of these models: we find that their performance gains are highly contingent on the availability of human-annotated dependency parses, which raises important questions regarding the viability of syntax-augmented transformers in real-world applications.

pdf bib
Certified Robustness to Programmable Transformations in LSTMs
Yuhao Zhang | Aws Albarghouthi | Loris D’Antoni
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Deep neural networks for natural language processing are fragile in the face of adversarial examples—small input perturbations, like synonym substitution or word duplication, which cause a neural network to change its prediction. We present an approach to certifying the robustness of LSTMs (and extensions of LSTMs) and training models that can be efficiently certified. Our approach can certify robustness to intractably large perturbation spaces defined programmatically in a language of string transformations. Our evaluation shows that (1) our approach can train models that are more robust to combinations of string transformations than those produced using existing techniques; (2) our approach can show high certification accuracy of the resulting models.

pdf bib
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
Yasuhide Miura | Yuhao Zhang | Emily Tsai | Curtis Langlotz | Dan Jurafsky
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9%). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

2020

pdf bib
Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports
Yuhao Zhang | Derek Merck | Emily Tsai | Christopher D. Manning | Curtis Langlotz
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural abstractive summarization models are able to generate summaries which have high overlap with human references. However, existing models are not optimized for factual correctness, a critical metric in real-world applications. In this work, we develop a general framework where we evaluate the factual correctness of a generated summary by fact-checking it automatically against its reference using an information extraction module. We further propose a training strategy which optimizes a neural summarization model with a factual correctness reward via reinforcement learning. We apply the proposed method to the summarization of radiology reports, where factual correctness is a key requirement. On two separate datasets collected from hospitals, we show via both automatic and human evaluation that the proposed approach substantially improves the factual correctness and overall quality of outputs over a competitive neural summarization system, producing radiology summaries that approach the quality of human-authored ones.

pdf bib
Learning Architectures from an Extended Search Space for Language Modeling
Yinqiao Li | Chi Hu | Yuhao Zhang | Nuo Xu | Yufan Jiang | Tong Xiao | Jingbo Zhu | Tongran Liu | Changliang Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural architecture search (NAS) has advanced significantly in recent years but most NAS systems restrict search to learning architectures of a recurrent or convolutional cell. In this paper, we extend the search space of NAS. In particular, we present a general approach to learn both intra-cell and inter-cell architectures (call it ESS). For a better search result, we design a joint learning method to perform intra-cell and inter-cell NAS simultaneously. We implement our model in a differentiable architecture search system. For recurrent neural language modeling, it outperforms a strong baseline significantly on the PTB and WikiText data, with a new state-of-the-art on PTB. Moreover, the learned architectures show good transferability to other systems. E.g., they improve state-of-the-art systems on the CoNLL and WNUT named entity recognition (NER) tasks and CoNLL chunking task, indicating a promising line of research on large-scale pre-learned architectures.

pdf bib
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi | Yuhao Zhang | Yuhui Zhang | Jason Bolton | Christopher D. Manning
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlp.github.io/stanza/.

pdf bib
Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations
Peng Qi | Yuhao Zhang | Christopher D. Manning
Findings of the Association for Computational Linguistics: EMNLP 2020

We investigate the problem of generating informative questions in information-asymmetric conversations. Unlike previous work on question generation which largely assumes knowledge of what the answer might be, we are interested in the scenario where the questioner is not given the context from which answers are drawn, but must reason pragmatically about how to acquire new information, given the shared conversation history. We identify two core challenges: (1) formally defining the informativeness of potential questions, and (2) exploring the prohibitively large space of potential questions to find the good candidates. To generate pragmatic questions, we use reinforcement learning to optimize an informativeness metric we propose, combined with a reward function designed to promote more specific questions. We demonstrate that the resulting pragmatic questioner substantially improves the informativeness and specificity of questions generated over a baseline model, as evaluated by our metrics as well as humans.

pdf bib
The NiuTrans Machine Translation Systems for WMT20
Yuhao Zhang | Ziyang Wang | Runzhe Cao | Binghao Wei | Weiqiao Shan | Shuhan Zhou | Abudurexiti Reheman | Tao Zhou | Xin Zeng | Laohu Wang | Yongyu Mu | Jingnan Zhang | Xiaoqian Liu | Xuanjun Zhou | Yinqiao Li | Bei Li | Tong Xiao | Jingbo Zhu
Proceedings of the Fifth Conference on Machine Translation

This paper describes NiuTrans neural machine translation systems of the WMT20 news translation tasks. We participated in Japanese<->English, English->Chinese, Inuktitut->English and Tamil->English total five tasks and rank first in Japanese<->English both sides. We mainly utilized iterative back-translation, different depth and widen model architectures, iterative knowledge distillation and iterative fine-tuning. And we find that adequately widened and deepened the model simultaneously, the performance will significantly improve. Also, iterative fine-tuning strategy we implemented is effective during adapting domain. For Inuktitut->English and Tamil->English tasks, we built multilingual models separately and employed pretraining word embedding to obtain better performance.

2019

pdf bib
The NiuTrans Machine Translation Systems for WMT19
Bei Li | Yinqiao Li | Chen Xu | Ye Lin | Jiqiang Liu | Hui Liu | Ziyang Wang | Yuhao Zhang | Nuo Xu | Zeyang Wang | Kai Feng | Hexuan Chen | Tengbo Liu | Yanyang Li | Qiang Wang | Tong Xiao | Jingbo Zhu
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper described NiuTrans neural machine translation systems for the WMT 2019 news translation tasks. We participated in 13 translation directions, including 11 supervised tasks, namely EN↔{ZH, DE, RU, KK, LT}, GU→EN and the unsupervised DE↔CS sub-track. Our systems were built on Deep Transformer and several back-translation methods. Iterative knowledge distillation and ensemble+reranking were also employed to obtain stronger models. Our unsupervised submissions were based on NMT enhanced by SMT. As a result, we achieved the highest BLEU scores in {KK↔EN, GU→EN} directions, ranking 2nd in {RU→EN, DE↔CS} and 3rd in {ZH→EN, LT→EN, EN→RU, EN↔DE} among all constrained submissions.

2018

pdf bib
Graph Convolution over Pruned Dependency Trees Improves Relation Extraction
Yuhao Zhang | Peng Qi | Christopher D. Manning
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Dependency trees help relation extraction models capture long-range relations between words. However, existing dependency-based models either neglect crucial information (e.g., negation) by pruning the dependency trees too aggressively, or are computationally inefficient because it is difficult to parallelize over different tree structures. We propose an extension of graph convolutional networks that is tailored for relation extraction, which pools information over arbitrary dependency structures efficiently in parallel. To incorporate relevant information while maximally removing irrelevant content, we further apply a novel pruning strategy to the input trees by keeping words immediately around the shortest path between the two entities among which a relation might hold. The resulting model achieves state-of-the-art performance on the large-scale TACRED dataset, outperforming existing sequence and dependency-based neural models. We also show through detailed analysis that this model has complementary strengths to sequence models, and combining them further improves the state of the art.

pdf bib
Universal Dependency Parsing from Scratch
Peng Qi | Timothy Dozat | Yuhao Zhang | Christopher D. Manning
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes Stanford’s system at the CoNLL 2018 UD Shared Task. We introduce a complete neural pipeline system that takes raw text as input, and performs all tasks required by the shared task, ranging from tokenization and sentence segmentation, to POS tagging and dependency parsing. Our single system submission achieved very competitive performance on big treebanks. Moreover, after fixing an unfortunate bug, our corrected system would have placed the 2nd, 1st, and 3rd on the official evaluation metrics LAS, MLAS, and BLEX, and would have outperformed all submission systems on low-resource treebank categories on all metrics by a large margin. We further show the effectiveness of different model components through extensive ablation studies.

pdf bib
Learning to Summarize Radiology Findings
Yuhao Zhang | Daisy Yi Ding | Tianpei Qian | Christopher D. Manning | Curtis P. Langlotz
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

The Impression section of a radiology report summarizes crucial radiology findings in natural language and plays a central role in communicating these findings to physicians. However, the process of generating impressions by summarizing findings is time-consuming for radiologists and prone to errors. We propose to automate the generation of radiology impressions with neural sequence-to-sequence learning. We further propose a customized neural model for this task which learns to encode the study background information and use this information to guide the decoding process. On a large dataset of radiology reports collected from actual hospital studies, our model outperforms existing non-neural and neural baselines under the ROUGE metrics. In a blind experiment, a board-certified radiologist indicated that 67% of sampled system summaries are at least as good as the corresponding human-written summaries, suggesting significant clinical validity. To our knowledge our work represents the first attempt in this direction.

2017

pdf bib
Position-aware Attention and Supervised Data Improve Slot Filling
Yuhao Zhang | Victor Zhong | Danqi Chen | Gabor Angeli | Christopher D. Manning
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Organized relational knowledge in the form of “knowledge graphs” is important for many applications. However, the ability to populate knowledge bases with facts automatically extracted from documents has improved frustratingly slowly. This paper simultaneously addresses two issues that have held back prior work. We first propose an effective new model, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction. Then we build TACRED, a large (119,474 examples) supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations. The combination of better supervised data and a more appropriate high-capacity model enables much better relation extraction performance. When the model trained on this new dataset replaces the previous relation extraction component of the best TAC KBP 2015 slot filling system, its F1 score increases markedly from 22.2% to 26.7%.
Search
Co-authors
Fix data