Ji Ma


2022

pdf bib
Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models
Jianmo Ni | Gustavo Hernandez Abrego | Noah Constant | Ji Ma | Keith Hall | Daniel Cer | Yinfei Yang
Findings of the Association for Computational Linguistics: ACL 2022

We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5 encoder-decoder. We establish a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark. Our encoder-only models outperform the previous best models on both SentEval and SentGLUE transfer tasks, including semantic textual similarity (STS). Scaling up ST5 from millions to billions of parameters shown to consistently improve performance. Finally, our encoder-decoder method achieves a new state-of-the-art on STS when using sentence embeddings.

pdf bib
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference
Kai Hui | Honglei Zhuang | Tao Chen | Zhen Qin | Jing Lu | Dara Bahri | Ji Ma | Jai Gupta | Cicero Nogueira dos Santos | Yi Tay | Donald Metzler
Findings of the Association for Computational Linguistics: ACL 2022

State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models.

pdf bib
Large Dual Encoders Are Generalizable Retrievers
Jianmo Ni | Chen Qu | Jing Lu | Zhuyun Dai | Gustavo Hernandez Abrego | Ji Ma | Vincent Zhao | Yi Luan | Keith Hall | Ming-Wei Chang | Yinfei Yang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited compared to models with fine-grained interactions between the query and the passage. In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck layer as a single dot-product with a fixed size. With multi-stage training, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. We further analyze the impact of the bottleneck layer and demonstrate diminishing improvement when scaling up the embedding size. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform previous sparse and dense retrievers on the BEIR dataset significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to match the out-of-domain performance of using all supervised data.

2021

pdf bib
Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation
Ji Ma | Ivan Korotkov | Yinfei Yang | Keith Hall | Ryan McDonald
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

A major obstacle to the wide-spread adoption of neural retrieval models is that they require large supervised training sets to surpass traditional term-based techniques, which are constructed from raw corpora. In this paper, we propose an approach to zero-shot learning for passage retrieval that uses synthetic question generation to close this gap. The question generation system is trained on general domain data, but is applied to documents in the targeted domain. This allows us to create arbitrarily large, yet noisy, question-passage relevance pairs that are domain specific. Furthermore, when this is coupled with a simple hybrid term-neural model, first-stage retrieval performance can be improved further. Empirically, we show that this is an effective strategy for building neural passage retrieval models in the absence of large training corpora. Depending on the domain, this technique can even approach the accuracy of supervised models.

pdf bib
Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval
Jing Lu | Gustavo Hernandez Abrego | Ji Ma | Jianmo Ni | Yinfei Yang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In the context of neural passage retrieval, we study three promising techniques: synthetic data generation, negative sampling, and fusion. We systematically investigate how these techniques contribute to the performance of the retrieval system and how they complement each other. We propose a multi-stage framework comprising of pre-training with synthetic data, fine-tuning with labeled data, and negative sampling at both stages. We study six negative sampling strategies and apply them to the fine-tuning stage and, as a noteworthy novelty, to the synthetic data that we use for pre-training. Also, we explore fusion methods that combine negatives from different strategies. We evaluate our system using two passage retrieval tasks for open-domain QA and using MS MARCO. Our experiments show that augmenting the negative contrast in both stages is effective to improve passage retrieval accuracy and, importantly, they also show that synthetic data generation and negative sampling have additive benefits. Moreover, using the fusion of different kinds allows us to reach performance that establishes a new state-of-the-art level in two of the tasks we evaluated.

2018

pdf bib
State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Ji Ma | Kuzman Ganchev | David Weiss
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

A wide variety of neural-network architectures have been proposed for the task of Chinese word segmentation. Surprisingly, we find that a bidirectional LSTM model, when combined with standard deep learning techniques and best practices, can achieve better accuracy on many of the popular datasets as compared to models based on more complex neuralnetwork architectures. Furthermore, our error analysis shows that out-of-vocabulary words remain challenging for neural-network models, and many of the remaining errors are unlikely to be fixed through architecture changes. Instead, more effort should be made on exploring resources for further improvement.

2017

pdf bib
Natural Language Processing with Small Feed-Forward Networks
Jan A. Botha | Emily Pitler | Ji Ma | Anton Bakalov | Alex Salcianu | David Weiss | Ryan McDonald | Slav Petrov
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We show that small and shallow feed-forward neural networks can achieve near state-of-the-art results on a range of unstructured and structured language processing tasks while being considerably cheaper in memory and computational requirements than deep recurrent models. Motivated by resource-constrained environments like mobile phones, we showcase simple techniques for obtaining such small neural network models, and investigate different tradeoffs when deciding how to allocate a small memory budget.

2016

pdf bib
Generalized Transition-based Dependency Parsing via Control Parameters
Bernd Bohnet | Ryan McDonald | Emily Pitler | Ji Ma
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2014

pdf bib
Tagging The Web: Building A Robust Web Tagger with Neural Network
Ji Ma | Yue Zhang | Jingbo Zhu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Punctuation Processing for Projective Dependency Parsing
Ji Ma | Yue Zhang | Jingbo Zhu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Easy-First POS Tagging and Dependency Parsing with Beam Search
Ji Ma | Jingbo Zhu | Tong Xiao | Nan Yang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Easy-First Chinese POS Tagging and Dependency Parsing
Ji Ma | Tong Xiao | Jingbo Zhu | Feiliang Ren
Proceedings of COLING 2012

pdf bib
NEU Systems in SIGHAN Bakeoff 2012
Ji Ma | LongFei Bai | Zhuo Liu | Ao Zhang | Jingbo Zhu
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

2010

pdf bib
A Multi-stage Clustering Framework for Chinese Personal Name Disambiguation
Huizhen Wang | Haibo Ding | Yingchao Shi | Ji Ma | Xiao Zhou | Jingbo Zhu
CIPS-SIGHAN Joint Conference on Chinese Language Processing