Pascale Fung


2021

pdf bib
Preserving Cross-Linguality of Pre-trained Models via Continual Learning
Zihan Liu | Genta Indra Winata | Andrea Madotto | Pascale Fung
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Recently, fine-tuning pre-trained language models (e.g., multilingual BERT) to downstream cross-lingual tasks has shown promising results. However, the fine-tuning process inevitably changes the parameters of the pre-trained model and weakens its cross-lingual ability, which leads to sub-optimal performance. To alleviate this problem, we leverage continual learning to preserve the original cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks. The experimental result shows that our fine-tuning methods can better preserve the cross-lingual ability of the pre-trained model in a sentence retrieval task. Our methods also achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.

pdf bib
X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented Compositional Semantic Parsing
Zihan Liu | Genta Indra Winata | Peng Xu | Pascale Fung
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries and serves as an essential component of virtual assistants. Current TCSP models rely on numerous training data to achieve decent performance but fail to generalize to low-resource target languages or domains. In this paper, we present X2Parser, a transferable Cross-lingual and Cross-domain Parser for TCSP. Unlike previous models that learn to generate the hierarchical representations for nested intents and slots, we propose to predict intents and slots separately and cast both prediction tasks into sequence labeling problems. After that, we further propose a fertility-based slot predictor that first learns to detect the number of labels for each token, and then predicts the slot types. Experimental results illustrate that our model can significantly outperform existing strong baselines in cross-lingual and cross-domain settings, and our model can also achieve a good generalization ability on target languages of target domains. Furthermore, we show that our model can reduce the latency by up to 66% compared to the generation-based model.

pdf bib
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
Tiezheng Yu | Wenliang Dai | Zihan Liu | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multimodal abstractive summarization (MAS) models that summarize videos (vision modality) and their corresponding transcripts (text modality) are able to extract the essential information from massive multimodal data on the Internet. Recently, large-scale generative pre-trained language models (GPLMs) have been shown to be effective in text generation tasks. However, existing MAS models cannot leverage GPLMs’ powerful generation ability. To fill this research gap, we aim to study two research questions: 1) how to inject visual information into GPLMs without hurting their generation ability; and 2) where is the optimal place in GPLMs to inject the visual information? In this paper, we present a simple yet effective method to construct vision guided (VG) GPLMs for the MAS task using attention-based add-on layers to incorporate visual information while maintaining their original text generation ability. Results show that our best model significantly surpasses the prior state-of-the-art model by 5.7 ROUGE-1, 5.3 ROUGE-2, and 5.1 ROUGE-L scores on the How2 dataset, and our vision guidance method contributes 83.6% of the overall improvement. Furthermore, we conduct thorough ablation studies to analyze the effectiveness of various modality fusion methods and fusion locations.

pdf bib
Continual Learning in Task-Oriented Dialogue Systems
Andrea Madotto | Zhaojiang Lin | Zhenpeng Zhou | Seungwhan Moon | Paul Crook | Bing Liu | Zhou Yu | Eunjoon Cho | Pascale Fung | Zhiguang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Continual learning in task-oriented dialogue systems allows the system to add new domains and functionalities overtime after deployment, without incurring the high cost of retraining the whole system each time. In this paper, we propose a first-ever continual learning benchmark for task-oriented dialogue systems with 37 domains to be learned continuously in both modularized and end-to-end learning settings. In addition, we implement and compare multiple existing continual learning baselines, and we propose a simple yet effective architectural method based on residual adapters. We also suggest that the upper bound performance of continual learning should be equivalent to multitask learning when data from all domain is available at once. Our experiments demonstrate that the proposed architectural method and a simple replay-based strategy perform better, by a large margin, compared to other continuous learning techniques, and only slightly worse than the multitask learning upper bound while being 20X faster in learning new domains. We also report several trade-offs in terms of parameter usage, memory size and training time, which are important in the design of a task-oriented dialogue system. The proposed benchmark is released to promote more research in this direction.

pdf bib
Zero-Shot Dialogue State Tracking via Cross-Task Transfer
Zhaojiang Lin | Bing Liu | Andrea Madotto | Seungwhan Moon | Zhenpeng Zhou | Paul Crook | Zhiguang Wang | Zhou Yu | Eunjoon Cho | Rajen Subba | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the cross-task knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle none value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.

pdf bib
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation
Samuel Cahyawijaya | Genta Indra Winata | Bryan Wilie | Karissa Vincentio | Xiaohong Li | Adhiguna Kuncoro | Sebastian Ruder | Zhi Yuan Lim | Syafri Bahar | Masayu Khodra | Ayu Purwarianti | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks—despite using only one-fifth the parameters of a larger multilingual model, mBART-large (Liu et al., 2020). This finding emphasizes the importance of pretraining on closely related, localized languages to achieve more efficient learning and faster inference at very low-resource languages like Javanese and Sundanese.

pdf bib
Are Multilingual Models Effective in Code-Switching?
Genta Indra Winata | Samuel Cahyawijaya | Zihan Liu | Zhaojiang Lin | Andrea Madotto | Pascale Fung
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this paper, we study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting by considering the inference speed, performance, and number of parameters to measure their practicality. We conduct experiments in three language pairs on named entity recognition and part-of-speech tagging and compare them with existing methods, such as using bilingual embeddings and multilingual meta-embeddings. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching, while using meta-embeddings achieves similar results with significantly fewer parameters.

pdf bib
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
Zihan Liu | Genta Indra Winata | Pascale Fung
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance
Dan Su | Tiezheng Yu | Pascale Fung
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Language Models are Few-shot Multilingual Learners
Genta Indra Winata | Andrea Madotto | Zhaojiang Lin | Rosanne Liu | Jason Yosinski | Pascale Fung
Proceedings of the 1st Workshop on Multilingual Representation Learning

General-purpose language models have demonstrated impressive capabilities, performing on par with state-of-the-art approaches on a range of downstream natural language processing (NLP) tasks and benchmarks when inferring instructions from very few examples. Here, we evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages without any parameter updates. We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones. Finally, we find the in-context few-shot cross-lingual prediction results of language models are significantly better than random prediction, and they are competitive compared to the existing state-of-the-art cross-lingual models and translation models.

pdf bib
Towards Few-shot Fact-Checking via Perplexity
Nayeon Lee | Yejin Bang | Andrea Madotto | Pascale Fung
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Few-shot learning has drawn researchers’ attention to overcome the problem of data scarcity. Recently, large pre-trained language models have shown great performance in few-shot learning for various downstream tasks, such as question answering and machine translation. Nevertheless, little exploration has been made to achieve few-shot learning for the fact-checking task. However, fact-checking is an important problem, especially when the amount of information online is growing exponentially every day. In this paper, we propose a new way of utilizing the powerful transfer learning ability of a language model via a perplexity score. The most notable strength of our methodology lies in its capability in few-shot learning. With only two training samples, our methodology can already outperform the Major Class baseline by more than an absolute 10% on the F1-Macro metric across multiple datasets. Through experiments, we empirically verify the plausibility of the rather surprising usage of the perplexity score in the context of fact-checking and highlight the strength of our few-shot methodology by comparing it to strong fine-tuning-based baseline models. Moreover, we construct and publicly release two new fact-checking datasets related to COVID-19.

pdf bib
Multimodal End-to-End Sparse Model for Emotion Recognition
Wenliang Dai | Samuel Cahyawijaya | Zihan Liu | Pascale Fung
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale well to different tasks, which can lead to sub-optimal performance. In this paper, we develop a fully end-to-end model that connects the two phases and optimizes them jointly. In addition, we restructure the current datasets to enable the fully end-to-end training. Furthermore, to reduce the computational overhead brought by the end-to-end model, we introduce a sparse cross-modal attention mechanism for the feature extraction. Experimental results show that our fully end-to-end model significantly surpasses the current state-of-the-art models based on the two-phase pipeline. Moreover, by adding the sparse cross-modal attention, our model can maintain the performance with around half less computation in the feature extraction part of the model.

pdf bib
On Unifying Misinformation Detection
Nayeon Lee | Belinda Z. Li | Sinong Wang | Pascale Fung | Hao Ma | Wen-tau Yih | Madian Khabsa
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we introduce UnifiedM2, a general-purpose misinformation model that jointly models multiple domains of misinformation with a single, unified setup. The model is trained to handle four tasks: detecting news bias, clickbait, fake news, and verifying rumors. By grouping these tasks together, UnifiedM2 learns a richer representation of misinformation, which leads to state-of-the-art or comparable performance across all tasks. Furthermore, we demonstrate that UnifiedM2’s learned representation is helpful for few-shot learning of unseen misinformation tasks/datasets and the model’s generalizability to unseen events.

pdf bib
AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization
Tiezheng Yu | Zihan Liu | Pascale Fung
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

State-of-the-art abstractive summarization models generally rely on extensive labeled data, which lowers their generalization ability on domains where such data are not available. In this paper, we present a study of domain adaptation for the abstractive summarization task across six diverse target domains in a low-resource setting. Specifically, we investigate the second phase of pre-training on large-scale generative models under three different settings: 1) source domain pre-training; 2) domain-adaptive pre-training; and 3) task-adaptive pre-training. Experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. Moreover, we find that continuing pre-training could lead to the pre-trained model’s catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. Furthermore, results illustrate that a huge gap still exists between the low-resource and high-resource settings, which highlights the need for more advanced domain adaptation methods for the abstractive summarization task.

pdf bib
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

pdf bib
ERICA: An Empathetic Android Companion for Covid-19 Quarantine
Etsuko Ishii | Genta Indra Winata | Samuel Cahyawijaya | Divesh Lala | Tatsuya Kawahara | Pascale Fung
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface: a web-based virtual agent, Nora vs. the android ERICA via a video call. The experimental results show that the android can offer a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures.

pdf bib
Assessing Political Prudence of Open-domain Chatbots
Yejin Bang | Nayeon Lee | Etsuko Ishii | Andrea Madotto | Pascale Fung
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Politically sensitive topics are still a challenge for open-domain chatbots. However, dealing with politically sensitive content in a responsible, non-partisan, and safe behavior way is integral for these chatbots. Currently, the main approach to handling political sensitivity is by simply changing such a topic when it is detected. This is safe but evasive and results in a chatbot that is less engaging. In this work, as a first step towards a politically safe chatbot, we propose a group of metrics for assessing their political prudence. We then conduct political prudence analysis of various chatbots and discuss their behavior from multiple angles through our automatic metric and human evaluation metrics. The testsets and codebase are released to promote research in this area.

pdf bib
XPersona: Evaluating Multilingual Personalized Chatbot
Zhaojiang Lin | Zihan Liu | Genta Indra Winata | Samuel Cahyawijaya | Andrea Madotto | Yejin Bang | Etsuko Ishii | Pascale Fung
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Personalized dialogue systems are an essential step toward better human-machine interaction. Existing personalized dialogue agents rely on properly designed conversational datasets, which are mostly monolingual (e.g., English), which greatly limits the usage of conversational agents in other languages. In this paper, we propose a multi-lingual extension of Persona-Chat, namely XPersona. Our dataset includes persona conversations in six different languages other than English for evaluating multilingual personalized agents. We experiment with both multilingual and cross-lingual trained baselines and evaluate them against monolingual and translation-pipeline models using both automatic and human evaluation. Experimental results show that the multilingual trained models outperform the translation pipeline and that they are on par with the monolingual models, with the advantage of having a single model across multiple languages. On the other hand, the state-of-the-art cross-lingual trained models achieve inferior performance to the other models, showing that cross-lingual conversation modeling is a challenging task. We hope that our dataset and baselines will accelerate research in multilingual dialogue systems.

pdf bib
CAiRE in DialDoc21: Data Augmentation for Information Seeking Dialogue System
Yan Xu | Etsuko Ishii | Genta Indra Winata | Zhaojiang Lin | Andrea Madotto | Zihan Liu | Peng Xu | Pascale Fung
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative responses based on users’ needs, which. To tackle this challenge, we utilize data augmentation methods and several training techniques with the pre-trained language models to learn a general pattern of the task and thus achieve promising performance. In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2. Empirical analysis is provided to explain the effectiveness of our approaches.

2020

pdf bib
Dimsum @LaySumm 20
Tiezheng Yu | Dan Su | Wenliang Dai | Pascale Fung
Proceedings of the First Workshop on Scholarly Document Processing

Lay summarization aims to generate lay summaries of scientific papers automatically. It is an essential task that can increase the relevance of science for all of society. In this paper, we build a lay summary generation system based on BART model. We leverage sentence labels as extra supervision signals to improve the performance of lay summarization. In the CL-LaySumm 2020 shared task, our model achieves 46.00 Rouge1-F1 score.

pdf bib
Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-TaskLearning for Offensive Language Detection
Wenliang Dai | Tiezheng Yu | Zihan Liu | Pascale Fung
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Nowadays, offensive content in social media has become a serious problem, and automatically detecting offensive language is an essential task. In this paper, we build an offensive language detection system, which combines multi-task learning with BERT-based models. Using a pre-trained language model such as BERT, we can effectively learn the representations for noisy text in social media. Besides, to boost the performance of offensive language detection, we leverage the supervision signals from other related tasks. In the OffensEval-2020 competition, our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place (92.23%F1). An empirical analysis is provided to explain the effectiveness of our approaches.

pdf bib
Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling
Zihan Liu | Genta Indra Winata | Peng Xu | Pascale Fung
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

As an essential task in task-oriented dialog systems, slot filling requires extensive training data in a certain domain. However, such data are not always available. Hence, cross-domain slot filling has naturally arisen to cope with this data scarcity problem. In this paper, we propose a Coarse-to-fine approach (Coach) for cross-domain slot filling. Our model first learns the general pattern of slot entities by detecting whether the tokens are slot entities or not. It then predicts the specific types for the slot entities. In addition, we propose a template regularization approach to improve the adaptation robustness by regularizing the representation of utterances based on utterance templates. Experimental results show that our model significantly outperforms state-of-the-art approaches in slot filling. Furthermore, our model can also be applied to the cross-domain named entity recognition task, and it achieves better adaptation performance than other existing baselines. The code is available at https://github.com/zliucr/coach.

pdf bib
Meta-Transfer Learning for Code-Switched Speech Recognition
Genta Indra Winata | Samuel Cahyawijaya | Zhaojiang Lin | Zihan Liu | Peng Xu | Pascale Fung
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

An increasing number of people in the world today speak a mixed-language as a result of being multilingual. However, building a speech recognition system for code-switching remains difficult due to the availability of limited resources and the expense and significant effort required to collect mixed-language data. We therefore propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from high-resource monolingual datasets. Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data. Based on experimental results, our model outperforms existing baselines on speech recognition and language modeling tasks, and is faster to converge.

pdf bib
CAiRE-COVID: A Question Answering and Query-focused Multi-Document Summarization System for COVID-19 Scholarly Information Management
Dan Su | Yan Xu | Tiezheng Yu | Farhad Bin Siddique | Elham Barezi | Pascale Fung
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We present CAiRE-COVID, a real-time question answering (QA) and multi-document summarization system, which won one of the 10 tasks in the Kaggle COVID-19 Open Research Dataset Challenge, judged by medical experts. Our system aims to tackle the recent challenge of mining the numerous scientific articles being published on COVID-19 by answering high priority questions from the community and summarizing salient question-related information. It combines information extraction with state-of-the-art QA and query-focused multi-document summarization techniques, selecting and highlighting evidence snippets from existing literature given a query. We also propose query-focused abstractive and extractive multi-document summarization methods, to provide more relevant information related to the question. We further conduct quantitative experiments that show consistent improvements on various metrics for each module. We have launched our website CAiRE-COVID for broader use by the medical community, and have open-sourced the code for our system, to bootstrap further study by other researches.

pdf bib
Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition
Wenliang Dai | Zihan Liu | Tiezheng Yu | Pascale Fung
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Despite the recent achievements made in the multi-modal emotion recognition task, two problems still exist and have not been well investigated: 1) the relationship between different emotion categories are not utilized, which leads to sub-optimal performance; and 2) current models fail to cope well with low-resource emotions, especially for unseen emotions. In this paper, we propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. We use pre-trained word embeddings to represent emotion categories for textual data. Then, two mapping functions are learned to transfer these embeddings into visual and acoustic spaces. For each modality, the model calculates the representation distance between the input sequence and target emotions and makes predictions based on the distances. By doing so, our model can directly adapt to the unseen emotions in any modality since we have their pre-trained embeddings and modality mapping functions. Experiments show that our model achieves state-of-the-art performance on most of the emotion categories. Besides, our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.

pdf bib
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
Bryan Wilie | Karissa Vincentio | Genta Indra Winata | Samuel Cahyawijaya | Xiaohong Li | Zhi Yuan Lim | Sidik Soleman | Rahmad Mahendra | Pascale Fung | Syafri Bahar | Ayu Purwarianti
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset (Indo4B) collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.

pdf bib
Zero-Resource Cross-Domain Named Entity Recognition
Zihan Liu | Genta Indra Winata | Pascale Fung
Proceedings of the 5th Workshop on Representation Learning for NLP

Existing models for cross-domain named entity recognition (NER) rely on numerous unlabeled corpus or labeled NER training data in target domains. However, collecting data for low-resource target domains is not only expensive but also time-consuming. Hence, we propose a cross-domain NER model that does not use any external resources. We first introduce a Multi-Task Learning (MTL) by adding a new objective function to detect whether tokens are named entities or not. We then introduce a framework called Mixture of Entity Experts (MoEE) to improve the robustness for zero-resource domain adaptation. Finally, experimental results show that our model outperforms strong unsupervised cross-domain sequence labeling models, and the performance of our model is close to that of the state-of-the-art model which leverages extensive resources.

pdf bib
MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models
Peng Xu | Mostofa Patwary | Mohammad Shoeybi | Raul Puri | Pascale Fung | Anima Anandkumar | Bryan Catanzaro
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Existing pre-trained large language models have shown unparalleled generative capabilities. However, they are not controllable. In this paper, we propose MEGATRON-CNTRL, a novel framework that uses large-scale language models and adds control to text generation by incorporating an external knowledge base. Our framework consists of a keyword predictor, a knowledge retriever, a contextual knowledge ranker, and a conditional text generator. As we do not have access to ground-truth supervision for the knowledge ranker, we make use of weak supervision from sentence embedding. The empirical results show that our model generates more fluent, consistent, and coherent stories with less repetition and higher diversity compared to prior work on the ROC story dataset. We showcase the controllability of our model by replacing the keywords used to generate stories and re-running the generation process. Human evaluation results show that 77.5% of these stories are successfully controlled by the new keywords. Furthermore, by scaling our model from 124 million to 8.3 billion parameters we demonstrate that larger models improve both the quality of generation (from 74.5% to 93.0% for consistency) and controllability (from 77.5% to 91.5%).

pdf bib
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Zhaojiang Lin | Andrea Madotto | Genta Indra Winata | Pascale Fung
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, we propose Minimalist Transfer Learning (MinTL) to simplify the system design process of task-oriented dialogue systems and alleviate the over-dependency on annotated data. MinTL is a simple yet effective transfer learning framework, which allows us to plug-and-play pre-trained seq2seq models, and jointly learn dialogue state tracking and dialogue response generation. Unlike previous approaches, which use a copy mechanism to “carryover” the old dialogue states to the new one, we introduce Levenshtein belief spans (Lev), that allows efficient dialogue state tracking with a minimal generation length. We instantiate our learning framework with two pre-trained backbones: T5 and BART, and evaluate them on MultiWOZ. Extensive experiments demonstrate that: 1) our systems establish new state-of-the-art results on end-to-end response generation, 2) MinTL-based systems are more robust than baseline methods in the low resource setting, and they achieve competitive results with only 20% training data, and 3) Lev greatly improves the inference efficiency.

pdf bib
Cross-lingual Spoken Language Understanding with Regularized Representation Alignment
Zihan Liu | Genta Indra Winata | Peng Xu | Zhaojiang Lin | Pascale Fung
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite the promising results of current cross-lingual models for spoken language understanding systems, they still suffer from imperfect cross-lingual representation alignments between the source and target languages, which makes the performance sub-optimal. To cope with this issue, we propose a regularization approach to further align word-level and sentence-level representations across languages without any external resource. First, we regularize the representation of user utterances based on their corresponding labels. Second, we regularize the latent variable model (Liu et al., 2019) by leveraging adversarial training to disentangle the latent variables. Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios, and our model, trained on a few-shot setting with only 3% of the target language training data, achieves comparable performance to the supervised training with all the training data.

pdf bib
Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction
Marta R. Costa-jussà | Cristina España-Bonet | Pascale Fung | Noah A. Smith
Computational Linguistics, Volume 46, Issue 2 - June 2020

We introduce the Computational Linguistics special issue on Multilingual and Interlingual Semantic Representations for Natural Language Processing. We situate the special issue’s five articles in the context of our fast-changing field, explaining our motivation for this project. We offer a brief summary of the work in the issue, which includes developments on lexical and sentential semantic representations, from symbolic and neural perspectives.

pdf bib
Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning
Zhaojiang Lin | Andrea Madotto | Pascale Fung
Findings of the Association for Computational Linguistics: EMNLP 2020

Fine-tuning pre-trained generative language models to down-stream language generation tasks has shown promising results. However, this comes with the cost of having a single, large model for each task, which is not ideal in low-memory/power scenarios (e.g., mobile). In this paper, we propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pretrained model. The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.

pdf bib
Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems
Andrea Madotto | Samuel Cahyawijaya | Genta Indra Winata | Yan Xu | Zihan Liu | Zhaojiang Lin | Pascale Fung
Findings of the Association for Computational Linguistics: EMNLP 2020

Task-oriented dialogue systems are either modularized with separate dialogue state tracking (DST) and management steps or end-to-end trainable. In either case, the knowledge base (KB) plays an essential role in fulfilling user requests. Modularized systems rely on DST to interact with the KB, which is expensive in terms of annotation and inference time. End-to-end systems, instead, use the KB directly as input, but they cannot scale when the KB is larger than a few hundred entries. In this paper, we propose a method to embed the KB, of any size, directly into the model parameters. The resulting model does not require any DST or template responses, nor the KB as input, and it can dynamically update its KB via fine-tuning. We evaluate our solution in five task-oriented dialogue datasets with small, medium, and large KB size. Our experiments show that end-to-end models can effectively embed knowledge bases in their parameters and achieve competitive performance in all evaluated datasets.

pdf bib
Plug-and-Play Conversational Models
Andrea Madotto | Etsuko Ishii | Zhaojiang Lin | Sumanth Dathathri | Pascale Fung
Findings of the Association for Computational Linguistics: EMNLP 2020

There has been considerable progress made towards conversational models that generate coherent and fluent responses; however, this often involves training large language models on large dialogue datasets, such as Reddit. These large conversational models provide little control over the generated responses, and this control is further limited in the absence of annotated conversational datasets for attribute specific generation that can be used for fine-tuning the model. In this paper, we first propose and evaluate plug-and-play methods for controllable response generation, which does not require dialogue specific datasets and does not rely on fine-tuning a large model. While effective, the decoding procedure induces considerable computational overhead, rendering the conversational model unsuitable for interactive usage. To overcome this, we introduce an approach that does not require further computation at decoding time, while also does not require any fine-tuning of a large language model. We demonstrate, through extensive automatic and human evaluation, a high degree of control over the generated conversational responses with regard to multiple desired attributes, while being fluent.

pdf bib
Multi-hop Question Generation with Graph Convolutional Network
Dan Su | Yan Xu | Wenliang Dai | Ziwei Ji | Tiezheng Yu | Pascale Fung
Findings of the Association for Computational Linguistics: EMNLP 2020

Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs. It is a more challenging yet under-explored task compared to conventional single-hop QG, where the questions are generated from the sentence containing the answer or nearby sentences in the same paragraph without complex reasoning. To address the additional challenges in multi-hop QG, we propose Multi-Hop Encoding Fusion Network for Question Generation (MulQG), which does context encoding in multiple hops with Graph Convolutional Network and encoding fusion via an Encoder Reasoning Gate. To the best of our knowledge, we are the first to tackle the challenge of multi-hop reasoning over paragraphs without any sentence-level information. Empirical results on HotpotQA dataset demonstrate the effectiveness of our method, in comparison with baselines on automatic evaluation metrics. Moreover, from the human evaluation, our proposed model is able to generate fluent questions with high completeness and outperforms the strongest baseline by 20.8% in the multi-hop evaluation. on. The code is publicly availableat https://github.com/HLTCHKU

pdf bib
Getting To Know You: User Attribute Extraction from Dialogues
Chien-Sheng Wu | Andrea Madotto | Zhaojiang Lin | Peng Xu | Pascale Fung
Proceedings of the 12th Language Resources and Evaluation Conference

User attributes provide rich and useful information for user understanding, yet structured and easy-to-use attributes are often sparsely populated. In this paper, we leverage dialogues with conversational agents, which contain strong suggestions of user information, to automatically extract user attributes. Since no existing dataset is available for this purpose, we apply distant supervision to train our proposed two-stage attribute extractor, which surpasses several retrieval and generation baselines on human evaluation. Meanwhile, we discuss potential applications (e.g., personalized recommendation and dialogue systems) of such extracted user attributes, and point out current limitations to cast light on future work.

2019

pdf bib
CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification
Genta Indra Winata | Andrea Madotto | Zhaojiang Lin | Jamin Shin | Yan Xu | Peng Xu | Pascale Fung
Proceedings of the 13th International Workshop on Semantic Evaluation

Detecting emotion from dialogue is a challenge that has not yet been extensively surveyed. One could consider the emotion of each dialogue turn to be independent, but in this paper, we introduce a hierarchical approach to classify emotion, hypothesizing that the current emotional state depends on previous latent emotions. We benchmark several feature-based classifiers using pre-trained word and emotion embeddings, state-of-the-art end-to-end neural network models, and Gaussian processes for automatic hyper-parameter search. In our experiments, hierarchical architectures consistently give significant improvements, and our best model achieves a 76.77% F1-score on the test set.

pdf bib
Team yeon-zi at SemEval-2019 Task 4: Hyperpartisan News Detection by De-noising Weakly-labeled Data
Nayeon Lee | Zihan Liu | Pascale Fung
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system that has been submitted to SemEval-2019 Task 4: Hyperpartisan News Detection. We focus on removing the noise inherent in the hyperpartisanship dataset from both data-level and model-level by leveraging semi-supervised pseudo-labels and the state-of-the-art BERT model. Our model achieves 75.8% accuracy in the final by-article dataset without ensemble learning.

pdf bib
Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences
Genta Indra Winata | Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated sentences achieves state-of-the-art performance and improves end-to-end automatic speech recognition.

pdf bib
MoEL: Mixture of Empathetic Listeners
Zhaojiang Lin | Andrea Madotto | Jamin Shin | Peng Xu | Pascale Fung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Previous research on empathetic dialogue systems has mostly focused on generating responses given certain emotions. However, being empathetic not only requires the ability of generating emotional responses, but more importantly, requires the understanding of user emotions and replying appropriately. In this paper, we propose a novel end-to-end approach for modeling empathy in dialogue systems: Mixture of Empathetic Listeners (MoEL). Our model first captures the user emotions and outputs an emotion distribution. Based on this, MoEL will softly combine the output states of the appropriate Listener(s), which are each optimized to react to certain emotions, and generate an empathetic response. Human evaluations on EMPATHETIC-DIALOGUES dataset confirm that MoEL outperforms multitask training baseline in terms of empathy, relevance, and fluency. Furthermore, the case study on generated responses of different Listeners shows high interpretability of our model.

pdf bib
Zero-shot Cross-lingual Dialogue Systems with Transferable Latent Variables
Zihan Liu | Jamin Shin | Yan Xu | Genta Indra Winata | Peng Xu | Andrea Madotto | Pascale Fung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite the surging demands for multilingual task-oriented dialog systems (e.g., Alexa, Google Home), there has been less research done in multilingual or cross-lingual scenarios. Hence, we propose a zero-shot adaptation of task-oriented dialogue system to low-resource languages. To tackle this challenge, we first use a set of very few parallel word pairs to refine the aligned cross-lingual word-level representations. We then employ a latent variable model to cope with the variance of similar sentences across different languages, which is induced by imperfect cross-lingual alignments and inherent differences in languages. Finally, the experimental results show that even though we utilize much less external resources, our model achieves better adaptation performance for natural language understanding task (i.e., the intent detection and slot filling) compared to the current state-of-the-art model in the zero-shot scenario.

pdf bib
Clickbait? Sensational Headline Generation with Auto-tuned Reinforcement Learning
Peng Xu | Chien-Sheng Wu | Andrea Madotto | Pascale Fung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Sensational headlines are headlines that capture people’s attention and generate reader interest. Conventional abstractive headline generation methods, unlike human writers, do not optimize for maximal reader attention. In this paper, we propose a model that generates sensational headlines without labeled data. We first train a sensationalism scorer by classifying online headlines with many comments (“clickbait”) against a baseline of headlines generated from a summarization model. The score from the sensationalism scorer is used as the reward for a reinforcement learner. However, maximizing the noisy sensationalism reward will generate unnatural phrases instead of sensational headlines. To effectively leverage this noisy reward, we propose a novel loss function, Auto-tuned Reinforcement Learning (ARL), to dynamically balance reinforcement learning (RL) with maximum likelihood estimation (MLE). Human evaluation shows that 60.8% of samples generated by our model are sensational, which is significantly better than the Pointer-Gen baseline and other RL models.

pdf bib
Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition
Genta Indra Winata | Zhaojiang Lin | Jamin Shin | Zihan Liu | Pascale Fung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings. However, in many cases, languages share common subwords, especially for closely related languages, but also for languages that are seemingly irrelevant. Therefore, we propose Hierarchical Meta-Embeddings (HME) that learn to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic lexical representations. On the task of Named Entity Recognition for English-Spanish code-switching data, our model achieves the state-of-the-art performance in the multilingual settings. We also show that, in cross-lingual settings, our model not only leverages closely related languages, but also learns from languages with different roots. Finally, we show that combining different subunits are crucial for capturing code-switching entities.

pdf bib
Generalizing Question Answering System with Pre-trained Language Model Fine-tuning
Dan Su | Yan Xu | Genta Indra Winata | Peng Xu | Hyeondey Kim | Zihan Liu | Pascale Fung
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

With a large number of datasets being released and new techniques being proposed, Question answering (QA) systems have witnessed great breakthroughs in reading comprehension (RC)tasks. However, most existing methods focus on improving in-domain performance, leaving open the research question of how these mod-els and techniques can generalize to out-of-domain and unseen RC tasks. To enhance the generalization ability, we propose a multi-task learning framework that learns the shared representation across different tasks. Our model is built on top of a large pre-trained language model, such as XLNet, and then fine-tuned on multiple RC datasets. Experimental results show the effectiveness of our methods, with an average Exact Match score of 56.59 and an average F1 score of 68.98, which significantly improves the BERT-Large baseline by8.39 and 7.22, respectively

pdf bib
Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
Chien-Sheng Wu | Andrea Madotto | Ehsan Hosseini-Asl | Caiming Xiong | Richard Socher | Pascale Fung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Over-dependence on domain ontology and lack of sharing knowledge across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short when tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using copy mechanism, facilitating transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art 48.62% joint goal accuracy for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show the transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.

pdf bib
Personalizing Dialogue Agents via Meta-Learning
Andrea Madotto | Zhaojiang Lin | Chien-Sheng Wu | Pascale Fung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing personalized dialogue models use human designed persona descriptions to improve dialogue consistency. Collecting such descriptions from existing dialogues is expensive and requires hand-crafted feature designs. In this paper, we propose to extend Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) to personalized dialogue learning without using any persona descriptions. Our model learns to quickly adapt to new personas by leveraging only a few dialogue samples collected from the same user, which is fundamentally different from conditioning the response on the persona descriptions. Empirical results on Persona-chat dataset (Zhang et al., 2018) indicate that our solution outperforms non-meta-learning baselines using automatic evaluation metrics, and in terms of human-evaluated fluency and consistency.

bib
Understanding the Shades of Sexism in Popular TV Series
Nayeon Lee | Yejin Bang | Jamin Shin | Pascale Fung
Proceedings of the 2019 Workshop on Widening NLP

[Multiple-submission] In the midst of a generation widely exposed to and influenced by media entertainment, the NLP research community has shown relatively little attention on the sexist comments in popular TV series. To understand sexism in TV series, we propose a way of collecting distant supervision dataset using Character Persona information with the psychological theories on sexism. We assume that sexist characters from TV shows are more prone to making sexist comments when talking about women, and show that this hypothesis is valid through experiment. Finally, we conduct an interesting analysis on popular TV show characters and successfully identify different shades of sexism that is often overlooked.

bib
Exploring Social Bias in Chatbots using Stereotype Knowledge
Nayeon Lee | Andrea Madotto | Pascale Fung
Proceedings of the 2019 Workshop on Widening NLP

Exploring social bias in chatbot is an important, yet relatively unexplored problem. In this paper, we propose an approach to understand social bias in chatbots by leveraging stereotype knowledge. It allows interesting comparison of bias between chatbots and humans, and provides intuitive analysis of existing chatbots by borrowing the finer-grain concepts of sexism and racism.

pdf bib
Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition
Genta Indra Winata | Zhaojiang Lin | Pascale Fung
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

In this paper, we propose Multilingual Meta-Embeddings (MME), an effective method to learn multilingual representations by leveraging monolingual pre-trained embeddings. MME learns to utilize information from these embeddings via a self-attention mechanism without explicit language identification. We evaluate the proposed embedding method on the code-switching English-Spanish Named Entity Recognition dataset in a multilingual and cross-lingual setting. The experimental results show that our proposed method achieves state-of-the-art performance on the multilingual setting, and it has the ability to generalize to an unseen language task.

pdf bib
Modality-based Factorization for Multimodal Fusion
Elham J. Barezi | Pascale Fung
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

We propose a novel method, Modality-based Redundancy Reduction Fusion (MRRF), for understanding and modulating the relative contribution of each modality in multimodal inference tasks. This is achieved by obtaining an (M+1)-way tensor to consider the high-order relationships between M modalities and the output layer of a neural network model. Applying a modality-based tensor factorization method, which adopts different factors for different modalities, results in removing information present in a modality that can be compensated by other modalities, with respect to model outputs. This helps to understand the relative utility of information in each modality. In addition it leads to a less complicated model with less parameters and therefore could be applied as a regularizer avoiding overfitting. We have applied this method to three different multimodal datasets in sentiment analysis, personality trait recognition, and emotion recognition. We are able to recognize relationships and relative importance of different modalities in these tasks and achieves a 1% to 4% improvement on several evaluation measures compared to the state-of-the-art for all three tasks.

pdf bib
Incorporating Word and Subword Units in Unsupervised Machine Translation Using Language Model Rescoring
Zihan Liu | Yan Xu | Genta Indra Winata | Pascale Fung
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes CAiRE’s submission to the unsupervised machine translation track of the WMT’19 news shared task from German to Czech. We leverage a phrase-based statistical machine translation (PBSMT) model and a pre-trained language model to combine word-level neural machine translation (NMT) and subword-level NMT models without using any parallel data. We propose to solve the morphological richness problem of languages by training byte-pair encoding (BPE) embeddings for German and Czech separately, and they are aligned using MUSE (Conneau et al., 2018). To ensure the fluency and consistency of translations, a rescoring mechanism is proposed that reuses the pre-trained language model to select the translation candidates generated through beam search. Moreover, a series of pre-processing and post-processing approaches are applied to improve the quality of final translations.

pdf bib
Learning to Learn Sales Prediction with Social Media Sentiment
Zhaojiang Lin | Andrea Madotto | Genta Indra Winata | Zihan Liu | Yan Xu | Cong Gao | Pascale Fung
Proceedings of the First Workshop on Financial Technology and Natural Language Processing

pdf bib
A Submodular Feature-Aware Framework for Label Subset Selection in Extreme Classification Problems
Elham J. Barezi | Ian D. Wood | Pascale Fung | Hamid R. Rabiee
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Extreme classification is a classification task on an extremely large number of labels (tags). User generated labels for any type of online data can be sparing per individual user but intractably large among all users. It would be useful to automatically select a smaller, standard set of labels to represent the whole label set. We can then solve efficiently the problem of multi-label learning with an intractably large number of interdependent labels, such as automatic tagging of Wikipedia pages. We propose a submodular maximization framework with linear cost to find informative labels which are most relevant to other labels yet least redundant with each other. A simple prediction model can then be trained on this label subset. Our framework includes both label-label and label-feature dependencies, which aims to find the labels with the most representation and prediction ability. In addition, to avoid information loss, we extract and predict outlier labels with weak dependency on other labels. We apply our model to four standard natural language data sets including Bibsonomy entries with users assigned tags, web pages with user assigned tags, legal texts with EUROVOC descriptors(A topic hierarchy with almost 4000 categories regarding different aspects of European law) and Wikipedia pages with tags from social bookmarking as well as news videos for automated label detection from a lexicon of semantic concepts. Experimental results show that our proposed approach improves label prediction quality, in terms of precision and nDCG, by 3% to 5% in three of the 5 tasks and is competitive in the others, even with a simple linear prediction model. An ablation study shows how different data sets benefit from different aspects of our model, with all aspects contributing substantially to at least one data set.

2018

pdf bib
Improving Large-Scale Fact-Checking using Decomposable Attention Models and Lexical Tagging
Nayeon Lee | Chien-Sheng Wu | Pascale Fung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Fact-checking of textual sources needs to effectively extract relevant information from large knowledge bases. In this paper, we extend an existing pipeline approach to better tackle this problem. We propose a neural ranker using a decomposable attention model that dynamically selects sentences to achieve promising improvement in evidence retrieval F1 by 38.80%, with (x65) speedup compared to a TF-IDF method. Moreover, we incorporate lexical tagging methods into our pipeline framework to simplify the tasks and render the model more generalizable. As a result, our framework achieves promising performance on a large-scale fact extraction and verification dataset with speedup.

pdf bib
Reducing Gender Bias in Abusive Language Detection
Ji Ho Park | Jamin Shin | Pascale Fung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Abusive language detection models tend to have a problem of being biased toward identity words of a certain group of people because of imbalanced training datasets. For example, “You are a good woman” was considered “sexist” when trained on an existing dataset. Such model bias is an obstacle for models to be robust enough for practical use. In this work, we measure them on models trained with different datasets, while analyzing the effect of different pre-trained word embeddings and model architectures. We also experiment with three mitigation methods: (1) debiased word embeddings, (2) gender swap data augmentation, and (3) fine-tuning with a larger corpus. These methods can effectively reduce model bias by 90-98% and can be extended to correct model bias in other scenarios.

pdf bib
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems
Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

End-to-end task-oriented dialog systems usually suffer from the challenge of incorporating knowledge bases. In this paper, we propose a novel yet simple end-to-end differentiable model called memory-to-sequence (Mem2Seq) to address this issue. Mem2Seq is the first neural generative model that combines the multi-hop attention over memories with the idea of pointer network. We empirically show how Mem2Seq controls each generation step, and how its multi-hop attention mechanism helps in learning correlations between memories. In addition, our model is quite general without complicated task-specific designs. As a result, we show that Mem2Seq can be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets.

pdf bib
Investigating Audio, Video, and Text Fusion Methods for End-to-End Automatic Personality Prediction
Onno Kampman | Elham J. Barezi | Dario Bertero | Pascale Fung
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose a tri-modal architecture to predict Big Five personality trait scores from video clips with different channels for audio, text, and video data. For each channel, stacked Convolutional Neural Networks are employed. The channels are fused both on decision-level and by concatenating their respective fully connected layers. It is shown that a multimodal fusion approach outperforms each single modality channel, with an improvement of 9.4% over the best individual modality (video). Full backpropagation is also shown to be better than a linear combination of modalities, meaning complex interactions between modalities can be leveraged to build better models. Furthermore, we can see the prediction relevance of each modality for each trait. The described model can be used to increase the emotional intelligence of virtual agents.

pdf bib
Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning
Genta Indra Winata | Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Lack of text data has been the major issue on code-switching language modeling. In this paper, we introduce multi-task learning based language model which shares syntax representation of languages to leverage linguistic information and tackle the low resource data issue. Our model jointly learns both language modeling and Part-of-Speech tagging on code-switched utterances. In this way, the model is able to identify the location of code-switching points and improves the prediction of next word. Our approach outperforms standard LSTM based language model, with an improvement of 9.7% and 7.4% in perplexity on SEAME Phase I and Phase II dataset respectively.

pdf bib
Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
Genta Indra Winata | Chien-Sheng Wu | Andrea Madotto | Pascale Fung
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

We propose an LSTM-based model with hierarchical architecture on named entity recognition from code-switching Twitter data. Our model uses bilingual character representation and transfer learning to address out-of-vocabulary words. In order to mitigate data noise, we propose to use token replacement and normalization. In the 3rd Workshop on Computational Approaches to Linguistic Code-Switching Shared Task, we achieved second place with 62.76% harmonic mean F1-score for English-Spanish language pair without using any gazetteer and knowledge-based information.

pdf bib
Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training
Peng Xu | Andrea Madotto | Chien-Sheng Wu | Ji Ho Park | Pascale Fung
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we propose Emo2Vec which encodes emotional semantics into vectors. We train Emo2Vec by multi-task learning six different emotion-related tasks, including emotion/sentiment analysis, sarcasm classification, stress detection, abusive language classification, insult detection, and personality recognition. Our evaluation of Emo2Vec shows that it outperforms existing affect-related representations, such as Sentiment-Specific Word Embedding and DeepMoji embeddings with much smaller training corpora. When concatenated with GloVe, Emo2Vec achieves competitive performances to state-of-the-art results on several tasks using a simple logistic regression classifier.

pdf bib
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts
Donia Scott | Marilyn Walker | Pascale Fung
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib
PlusEmo2Vec at SemEval-2018 Task 1: Exploiting emotion knowledge from emoji and #hashtags
Ji Ho Park | Peng Xu | Pascale Fung
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our system that has been submitted to SemEval-2018 Task 1: Affect in Tweets (AIT) to solve five subtasks. We focus on modeling both sentence and word level representations of emotion inside texts through large distantly labeled corpora with emojis and hashtags. We transfer the emotional knowledge by exploiting neural network models as feature extractors and use these representations for traditional machine learning models such as support vector regression (SVR) and logistic regression to solve the competition tasks. Our system is placed among the Top3 for all subtasks we participated.

2017

pdf bib
Zara Returns: Improved Personality Induction and Adaptation by an Empathetic Virtual Agent
Farhad Bin Siddique | Onno Kampman | Yang Yang | Anik Dey | Pascale Fung
Proceedings of ACL 2017, System Demonstrations

pdf bib
One-step and Two-step Classification for Abusive Language Detection on Twitter
Ji Ho Park | Pascale Fung
Proceedings of the First Workshop on Abusive Language Online

Automatic abusive language detection is a difficult but important task for online social media. Our research explores a two-step approach of performing classification on abusive language and then classifying into specific types and compares it with one-step approach of doing one multi-class classification for detecting sexist and racist languages. With a public English Twitter corpus of 20 thousand tweets in the type of sexism and racism, our approach shows a promising performance of 0.827 F-measure by using HybridCNN in one-step and 0.824 F-measure by using logistic regression in two-steps.

2016

pdf bib
Proceedings of the Second Workshop on Computational Approaches to Code Switching
Mona Diab | Pascale Fung | Mahmoud Ghoneim | Julia Hirschberg | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Real-Time Speech Emotion and Sentiment Recognition for Interactive Dialogue Systems
Dario Bertero | Farhad Bin Siddique | Chien-Sheng Wu | Yan Wan | Ricky Ho Yin Chan | Pascale Fung
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Long Short-Term Memory Framework for Predicting Humor in Dialogues
Dario Bertero | Pascale Fung
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Zara The Supergirl: An Empathetic Personality Recognition System
Pascale Fung | Anik Dey | Farhad Bin Siddique | Ruixi Lin | Yang Yang | Yan Wan | Ho Yin Ricky Chan
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Zara: A Virtual Interactive Dialogue System Incorporating Emotion, Sentiment and Personality Recognition
Pascale Fung | Anik Dey | Farhad Bin Siddique | Ruixi Lin | Yang Yang | Dario Bertero | Yan Wan | Ricky Ho Yin Chan | Chien-Sheng Wu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Zara, or ‘Zara the Supergirl’ is a virtual robot, that can exhibit empathy while interacting with an user, with the aid of its built in facial and emotion recognition, sentiment analysis, and speech module. At the end of the 5-10 minute conversation, Zara can give a personality analysis of the user based on all the user utterances. We have also implemented a real-time emotion recognition, using a CNN model that detects emotion from raw audio without feature extraction, and have achieved an average of 65.7% accuracy on six different emotion classes, which is an impressive 4.5% improvement from the conventional feature based SVM classification. Also, we have described a CNN based sentiment analysis module trained using out-of-domain data, that recognizes sentiment from the speech recognition transcript, which has a 74.8 F-measure when tested on human-machine dialogues.

pdf bib
Deep Learning of Audio and Language Features for Humor Prediction
Dario Bertero | Pascale Fung
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We propose a comparison between various supervised machine learning methods to predict and detect humor in dialogues. We retrieve our humorous dialogues from a very popular TV sitcom: “The Big Bang Theory”. We build a corpus where punchlines are annotated using the canned laughter embedded in the audio track. Our comparative study involves a linear-chain Conditional Random Field over a Recurrent Neural Network and a Convolutional Neural Network. Using a combination of word-level and audio frame-level features, the CNN outperforms the other methods, obtaining the best F-score of 68.5% over 66.5% by CRF and 52.9% by RNN. Our work is a starting point to developing more effective machine learning and neural network models on the humor prediction task, as well as developing machines capable in understanding humor in general.

pdf bib
A Machine Learning based Music Retrieval and Recommendation System
Naziba Mostafa | Yan Wan | Unnayan Amitabh | Pascale Fung
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present a music retrieval and recommendation system using machine learning techniques. We propose a query by humming system for music retrieval that uses deep neural networks for note transcription and a note-based retrieval system for retrieving the correct song from the database. We evaluate our query by humming system using the standard MIREX QBSH dataset. We also propose a similar artist recommendation system which recommends similar artists based on acoustic features of the artists’ music, online text descriptions of the artists and social media data. We use supervised machine learning techniques over all our features and compare our recommendation results to those produced by a popular similar artist recommendation website.

2015

pdf bib
HLTC-HKUST: A Neural Network Paraphrase Classifier using Translation Metrics, Semantic Roles and Lexical Similarity Features
Dario Bertero | Pascale Fung
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Proceedings of the First Workshop on Computational Approaches to Code Switching
Mona Diab | Julia Hirschberg | Pascale Fung | Thamar Solorio
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Overview for the First Shared Task on Language Identification in Code-Switched Data
Thamar Solorio | Elizabeth Blair | Suraj Maharjan | Steven Bethard | Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Julia Hirschberg | Alison Chang | Pascale Fung
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Language Modeling with Functional Head Constraint for Code Switching Speech Recognition
Ying Li | Pascale Fung
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Co-Training for Classification of Live or Studio Music Recordings
Nicolas Auguin | Pascale Fung
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The fast-spreading development of online streaming services has enabled people from all over the world to listen to music. However, it is not always straightforward for a given user to find the “right” song version he or she is looking for. As streaming services may be affected by the potential dissatisfaction among their customers, the quality of songs and the presence of tags (or labels) associated with songs returned to the users are very important. Thus, the need for precise and reliable metadata becomes paramount. In this work, we are particularly interested in distinguishing between live and studio versions of songs. Specifically, we tackle the problem in the case where very little-annotated training data are available, and demonstrate how an original co-training algorithm in a semi-supervised setting can alleviate the problem of data scarcity to successfully discriminate between live and studio music recordings.

pdf bib
A Hindi-English Code-Switching Corpus
Anik Dey | Pascale Fung
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The aim of this paper is to investigate the rules and constraints of code-switching (CS) in Hindi-English mixed language data. In this paper, we’ll discuss how we collected the mixed language corpus. This corpus is primarily made up of student interview speech. The speech was manually transcribed and verified by bilingual speakers of Hindi and English. The code-switching cases in the corpus are discussed and the reasons for code-switching are explained.

2013

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition
Ping Xu | Pascale Fung
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition
Ying Li | Pascale Fung
Proceedings of COLING 2012

pdf bib
A Multilingual Natural Stress Emotion Database
Xin Zuo | Tian Li | Pascale Fung
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we describe an ongoing effort in collecting and annotating a multilingual speech database of natural stress emotion from university students. The goal is to detect natural stress emotions and study the stress expression differences in different languages, which may help psychologists in the future. We designed a common questionnaire of stress-inducing and non-stress-inducing questions in English, Mandarin and Cantonese and collected a first ever, multilingual corpus of natural stress emotion. All of the students are native speakers of the corresponding language. We asked native language speakers to annotate recordings according to the participants' self-label states and obtained a very good kappa inter labeler agreement. We carried out human perception tests where listeners who do not understand Chinese were asked to detect stress emotion from the Mandarin Chinese database. Compared to the annotation labels, these human perceived emotions are of low accuracy, which shows a great necessity for natural stress detection research.

pdf bib
A Mandarin-English Code-Switching Corpus
Ying Li | Yue Yu | Pascale Fung
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of code-switching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four parts: 1) conversational meeting speech and its data; 2) project meeting speech data; 3) student interviews speech; 4) text data of on-line news. The speech was transcribed by an annotator and verified by Mandarin-English bilingual speakers manually. We propose an approach for automatically downloading from the web text data that contains code-switching. The corpus includes both intra-sentential code-switching (switch in the middle of a sentence) and inter-sentential code-switching (switch at the end of the sentence). The distribution of part-of-speech (POS) tags and code-switching reasons are reported.

pdf bib
Using English Acoustic Models for Hindi Automatic Speech Recognition
Anik Dey | Ying Li | Pascale Fung
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

2011

pdf bib
Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web
Simon Shi | Pascale Fung | Emmanuel Prochasson | Chi-kiu Lo | Dekai Wu
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Rare Word Translation Extraction from Aligned Comparable Documents
Emmanuel Prochasson | Pascale Fung
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Unsupervised Synthesis of Multilingual Wikipedia Articles
Yuncong Chen | Pascale Fung
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A Rhetorical Syntax-Driven Model for Speech Summarization
Jian Zhang | Pascale Fung
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project
Yi Liu | Pascale Fung | Yongsheng Yang | Denise DiPersio | Meghan Glenn | Stephanie Strassel | Christopher Cieri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUST’s acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.

2009

pdf bib
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)
Pascale Fung | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

pdf bib
Active Learning of Extractive Reference Summaries for Lecture Speech Summarization
Jian Zhang | Pascale Fung
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

pdf bib
Semantic Roles for SMT: A Hybrid Two-Pass Model
Dekai Wu | Pascale Fung
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Can Semantic Role Labeling Improve SMT?
Dekai Wu | Pascale Fung
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2007

pdf bib
Speech Summarization Without Lexical Features for Mandarin Broadcast News
Jian Zhang | Pascale Fung
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Learning bilingual semantic frames: shallow semantic parsing vs. semantic role projection
Pascale Fung | Zhaojun Wu | Yongsheng Yang | Dekai Wu
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf bib
Robust Word Sense Translation by EM Learning of Frame Semantics
Pascale Fung | Benfeng Chen
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora
Dekai Wu | Pascale Fung
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib
Automatic Construction of an English-Chinese Bilingual FrameNet
Benfeng Chen | Pascale Fung
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
Using N-best lists for Named Entity Recognition from Chinese Speech
Lufeng Zhai | Pascale Fung | Richard Schwartz | Marine Carpuat | Dekai Wu
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
BiFrameNet: Bilingual Frame Semantics Resource Construction by Cross-lingual Induction
Pascale Fung | Benfeng Chen
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus
Pascale Fung | Percy Cheung
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E
Pascale Fung | Percy Cheung
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
Combining Optimal Clustering and Hidden Markov Models for Extractive Summarization
Pascale Fung | Grace Ngai | Chi-Shun Cheung
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

2002

pdf bib
Identifying Concepts Across Languages: A First Step towards a Corpus-based Approach to Automatic Ontology Alignment
Grace Ngai | Marine Carpuat | Pascale Fung
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Semantic Networks: the Path to Profitability
Pascale Fung
COLING-02: SEMANET: Building and Using Semantic Networks

1999

pdf bib
Mixed Language Query Disambiguation
Pascale Fung | Xiaohu Liu | Chi Shun Cheung
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
An IR Approach for Translating New Words from Nonparallel, Comparable Texts
Pascale Fung | Lo Yuen Yee
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
A statistical view on bilingual lexicon extraction
Pascale Fung
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

We present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using only clean parallel corpora. DKvec is a method for extracting bilingual lexicons, from noisy parallel corpora based on arrival distances of words in noisy parallel corpora. Using DKvec on noisy parallel corpora in English/Japanese and English/Chinese, our evaluations show a 55.35% precision from a small corpus and 89.93% precision from a larger corpus. Our major contribution is in the extraction of bilingual lexicon from non-parallel corpora. We present a first such result in this area, from a new method-Convec. Convec is based on context information of a word to be translated.

pdf bib
An IR Approach for Translating New Words from Nonparallel, Comparable Texts
Pascale Fung | Lo Yuen Yee
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

1997

pdf bib
Finding Terminology Translations from Non-parallel Corpora
Pascale Fung
Fifth Workshop on Very Large Corpora

pdf bib
Dealing with Multilinguality in a Spoken Language Query Translator
Pascale Fung | Bertram Shi | Dekai Wu | Lain Wai Bun | Wong Shuen Kong
Spoken Language Translation

1995

pdf bib
Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus
Pascale Fung
Third Workshop on Very Large Corpora

pdf bib
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora
Pascale Fung
33rd Annual Meeting of the Association for Computational Linguistics

pdf bib
Coerced Markov Models for Cross-Lingual Lexical-Tag Relations
Pascale Fung | Dekai Wu
Proceedings of the Sixth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

1994

pdf bib
Aligning Noisy Parallel Corpora Across Language Groups: Word Pair Feature Matching by Dynamic Time Warping
Pascale Fung | Kathleen McKeown
Proceedings of the First Conference of the Association for Machine Translation in the Americas

pdf bib
K-vec: A New Approach for Aligning Parallel Texts
Pascale Fung | Kenneth Ward Church
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

pdf bib
Improving Chinese Tokenization With Linguistic Filters on Statistical Lexical Acquisition
Dekai Wu | Pascale Fung
Fourth Conference on Applied Natural Language Processing

1992

pdf bib
BBN BYBLOS and HARC February 1992 ATIS Benchmark Results
Francis Kubala | Chris Barry | Madeleine Bates | Robert Bobrow | Pascale Fung | Robert Ingria | John Makhoul | Long Nguyen | Richard Schwartz | David Stallard
Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

Search
Co-authors