2024
pdf
bib
abs
Tokenization Is More Than Compression
Craig W Schmidt
|
Varshini Reddy
|
Haoran Zhang
|
Alec Alameddine
|
Omri Uzan
|
Yuval Pinter
|
Chris Tanner
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better downstream performance by introducing PathPiece, a new tokenizer that segments a document’s text into the minimum number of tokens for a given vocabulary. Through extensive experimentation we find this hypothesis not to be the case, casting doubt on the understanding of the reasons for effective tokenization. To examine which other factors play a role, we evaluate design decisions across all three phases of tokenization: pre-tokenization, vocabulary construction, and segmentation, offering new insights into the design of effective tokenizers. Specifically, we illustrate the importance of pre-tokenization and the benefits of using BPE to initialize vocabulary construction. We train 64 language models with varying tokenization, ranging in size from 350M to 2.4B parameters, all of which are made publicly available.
pdf
bib
abs
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Siwei Wu
|
Yizhi Li
|
Kang Zhu
|
Ge Zhang
|
Yiming Liang
|
Kaijing Ma
|
Chenghao Xiao
|
Haoran Zhang
|
Bohao Yang
|
Wenhu Chen
|
Wenhao Huang
|
Noura Al Moubayed
|
Jie Fu
|
Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2024
Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders.
pdf
bib
abs
Knowledge-Grounded Dialogue Act Transfer using Prompt-Based Learning for Controllable Open-Domain NLG
Alain Vazquez Risco
|
Angela Maria Ramirez
|
Neha Pullabhotla
|
Nan Qiang
|
Haoran Zhang
|
Marilyn Walker
|
Maria Ines Torres
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Open domain spoken dialogue systems need to controllably generate many different dialogue acts (DAs) to allow Natural Language Generation (NLG) to create interesting and engaging conversational interactions with users. We aim to create an NLG engine that can produce a variety of DAs that make substantive knowledge-grounded contributions to a conversation. Training such an NLG typically requires dialogue corpora that are labelled for DAs, which are expensive to produce and vulnerable to quality issues. Here, we present a prompt-based learning approach to transfer DAs from one domain, video games, to 7 new domains. For each novel domain, we first crawl WikiData to create Meaning Representations that systematically vary both the number of attributes and hops on the WikiData Knowledge Graph. The proposed method involves a self-training step to create prompt examples for each domain followed by an overgeneration and ranking step. The result is a novel, high-quality dataset, Wiki-Dialogue, of 71K knowledge-grounded utterances, covering 9 DAs and the Art, Movies, Music, Sports, TV, Animal, and Boardgames domains, whose combined DA and semantic accuracy is 89%. We assess the corpus quality using both automatic and human evaluations and find it high. The corpus is found to be safe, lexically rich, and large in vocabulary, when compared to similar datasets.
pdf
bib
abs
Enhancing Knowledge Selection via Multi-level Document Semantic Graph
Haoran Zhang
|
Tan Yongmei
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Knowledge selection is a crucial sub-task of Document Grounded Dialogue System. Existing methods view knowledge selection as a sentence matching or classification. However, those methods can’t capture the semantic relationships within complex document. We propose a flexible method that can construct multi-level document semantic graph from the grounding document automatically and store semantic relationships within the documents effectively. Besides, we also devise an auxiliary task to leverage the graph more efficiently and can help the optimization of knowledge selection task. We conduct extensive experiments on public datasets: WoW(CITATION) and Holl-E(CITATION). And we achieves state-of-the-art result on WoW. Our code has been released at https://github.com/ddf62/multi-level-semantic-document-graph.
2021
pdf
bib
abs
Essay Quality Signals as Weak Supervision for Source-based Essay Scoring
Haoran Zhang
|
Diane Litman
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Human essay grading is a laborious task that can consume much time and effort. Automated Essay Scoring (AES) has thus been proposed as a fast and effective solution to the problem of grading student writing at scale. However, because AES typically uses supervised machine learning, a human-graded essay corpus is still required to train the AES model. Unfortunately, such a graded corpus often does not exist, so creating a corpus for machine learning can also be a laborious task. This paper presents an investigation of replacing the use of human-labeled essay grades when training an AES system with two automatically available but weaker signals of essay quality: word count and topic distribution similarity. Experiments using two source-based essay scoring (evidence score) corpora show that while weak supervision does not yield a competitive result when training a neural source-based AES model, it can be used to successfully extract Topical Components (TCs) from a source text, which are required by a supervised feature-based AES model. In particular, results show that feature-based AES performance is comparable with either automatically or manually constructed TCs.
2020
pdf
bib
abs
Incorporating Inner-word and Out-word Features for Mongolian Morphological Segmentation
Na Liu
|
Xiangdong Su
|
Haoran Zhang
|
Guanglai Gao
|
Feilong Bao
Proceedings of the 28th International Conference on Computational Linguistics
Mongolian morphological segmentation is regarded as a crucial preprocessing step in many Mongolian related NLP applications and has received extensive attention. Recently, end-to-end segmentation approaches with long short-term memory networks (LSTM) have achieved excellent results. However, the inner-word features among characters in the word and the out-word features from context are not well utilized in the segmentation process. In this paper, we propose a neural network incorporating inner-word and out-word features for Mongolian morphological segmentation. The network consists of two encoders and one decoder. The inner-word encoder uses the self-attention mechanisms to capture the inner-word features of the target word. The out-word encoder employs a two layers BiLSTM network to extract out-word features in the sentence. Then, the decoder adopts a multi-head double attention layer to fuse the inner-word features and out-word features and produces the segmentation result. The evaluation experiment compares the proposed network with the baselines and explores the effectiveness of the sub-modules.
pdf
bib
abs
Automated Topical Component Extraction Using Neural Network Attention Scores from Source-based Essay Scoring
Haoran Zhang
|
Diane Litman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
While automated essay scoring (AES) can reliably grade essays at scale, automated writing evaluation (AWE) additionally provides formative feedback to guide essay revision. However, a neural AES typically does not provide useful feature representations for supporting AWE. This paper presents a method for linking AWE and neural AES, by extracting Topical Components (TCs) representing evidence from a source text using the intermediate output of attention layers. We evaluate performance using a feature-based AES requiring TCs. Results show that performance is comparable whether using automatically or manually constructed TCs for 1) representing essays as rubric-based features, 2) grading essays.
pdf
bib
abs
Active Learning Approaches to Enhancing Neural Machine Translation
Yuekai Zhao
|
Haoran Zhang
|
Shuchang Zhou
|
Zhihua Zhang
Findings of the Association for Computational Linguistics: EMNLP 2020
Active learning is an efficient approach for mitigating data dependency when training neural machine translation (NMT) models. In this paper, we explore new training frameworks by incorporating active learning into various techniques such as transfer learning and iterative back-translation (IBT) under a limited human translation budget. We design a word frequency based acquisition function and combine it with a strong uncertainty based method. The combined method steadily outperforms all other acquisition functions in various scenarios. As far as we know, we are the first to do a large-scale study on actively training Transformer for NMT. Specifically, with a human translation budget of only 20% of the original parallel corpus, we manage to surpass Transformer trained on the entire parallel corpus in three language pairs.
2018
pdf
bib
abs
Co-Attention Based Neural Network for Source-Dependent Essay Scoring
Haoran Zhang
|
Diane Litman
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
This paper presents an investigation of using a co-attention based neural network for source-dependent essay scoring. We use a co-attention mechanism to help the model learn the importance of each part of the essay more accurately. Also, this paper shows that the co-attention based neural network model provides reliable score prediction of source-dependent responses. We evaluate our model on two source-dependent response corpora. Results show that our model outperforms the baseline on both corpora. We also show that the attention of the model is similar to the expert opinions with examples.
2017
pdf
bib
Word Embedding for Response-To-Text Assessment of Evidence
Haoran Zhang
|
Diane Litman
Proceedings of ACL 2017, Student Research Workshop