Yunyao Li


2021

pdf bib
Improving Cross-lingual Text Classification with Zero-shot Instance-Weighting
Irene Li | Prithviraj Sen | Huaiyu Zhu | Yunyao Li | Dragomir Radev
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Cross-lingual text classification (CLTC) is a challenging task made even harder still due to the lack of labeled data in low-resource languages. In this paper, we propose zero-shot instance-weighting, a general model-agnostic zero-shot learning framework for improving CLTC by leveraging source instance weighting. It adds a module on top of pre-trained language models for similarity computation of instance weights, thus aligning each source instance to the target language. During training, the framework utilizes gradient descent that is weighted by instance weights to update parameters. We evaluate this framework over seven target languages on three fundamental tasks and show its effectiveness and extensibility, by improving on F1 score up to 4% in single-source transfer and 8% in multi-source transfer. To the best of our knowledge, our method is the first to apply instance weighting in zero-shot CLTC. It is simple yet effective and easily extensible into multi-source transfer.

pdf bib
Leveraging Abstract Meaning Representation for Knowledge Base Question Answering
Pavan Kapanipathi | Ibrahim Abdelaziz | Srinivas Ravishankar | Salim Roukos | Alexander Gray | Ramón Fernandez Astudillo | Maria Chang | Cristina Cornelio | Saswati Dana | Achille Fokoue | Dinesh Garg | Alfio Gliozzo | Sairam Gurajada | Hima Karanam | Naweed Khan | Dinesh Khandelwal | Young-Suk Lee | Yunyao Li | Francois Luus | Ndivhuwo Makondo | Nandana Mihindukulasooriya | Tahira Naseem | Sumit Neelam | Lucian Popa | Revanth Gangi Reddy | Ryan Riegel | Gaetano Rossiello | Udit Sharma | G P Shrivatsa Bhargav | Mo Yu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Domain-Aware Dependency Parsing for Questions
Aparna Garimella | Laura Chiticariu | Yunyao Li
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

pdf bib
Deep Learning on Graphs for Natural Language Processing
Lingfei Wu | Yu Chen | Heng Ji | Yunyao Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

Due to its great power in modeling non-Euclidean data like graphs or manifolds, deep learning on graph techniques (i.e., Graph Neural Networks (GNNs)) have opened a new door to solving challenging graph-related NLP problems. There has seen a surge of interests in applying deep learning on graph techniques to NLP, and has achieved considerable success in many NLP tasks, ranging from classification tasks like sentence classification, semantic role labeling and relation extraction, to generation tasks like machine translation, question generation and summarization. Despite these successes, deep learning on graphs for NLP still face many challenges, including automatically transforming original text sequence data into highly graph-structured data, and effectively modeling complex data that involves mapping between graph-based inputs and other highly structured output data such as sequences, trees, and graph data with multi-types in both nodes and edges. This tutorial will cover relevant and interesting topics on applying deep learning on graph techniques to NLP, including automatic graph construction for NLP, graph representation learning for NLP, advanced GNN based models (e.g., graph2seq, graph2tree, and graph2graph) for NLP, and the applications of GNNs in various NLP tasks (e.g., machine translation, natural language generation, information extraction and semantic parsing). In addition, hands-on demonstration sessions will be included to help the audience gain practical experience on applying GNNs to solve challenging NLP problems using our recently developed open source library – Graph4NLP, the first library for researchers and practitioners for easy use of GNNs for various NLP tasks.

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
Young-bum Kim | Yunyao Li | Owen Rambow
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

pdf bib
Development of an Enterprise-Grade Contract Understanding System
Arvind Agarwal | Laura Chiticariu | Poornima Chozhiyath Raman | Marina Danilevsky | Diman Ghazi | Ankush Gupta | Shanmukha Guttula | Yannis Katsis | Rajasekar Krishnamurthy | Yunyao Li | Shubham Mudgal | Vitobha Munigala | Nicholas Phan | Dhaval Sonawane | Sneha Srinivasan | Sudarshan R. Thitte | Mitesh Vasa | Ramiya Venkatachalam | Vinitha Yaski | Huaiyu Zhu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Contracts are arguably the most important type of business documents. Despite their significance in business, legal contract review largely remains an arduous, expensive and manual process. In this paper, we describe TECUS: a commercial system designed and deployed for contract understanding and used by a wide range of enterprise users for the past few years. We reflect on the challenges and design decisions when building TECUS. We also summarize the data science life cycle of TECUS and share lessons learned.

pdf bib
LNN-EL: A Neuro-Symbolic Approach to Short-text Entity Linking
Hang Jiang | Sairam Gurajada | Qiuhao Lu | Sumit Neelam | Lucian Popa | Prithviraj Sen | Yunyao Li | Alexander Gray
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Entity linking (EL) is the task of disambiguating mentions appearing in text by linking them to entities in a knowledge graph, a crucial task for text understanding, question answering or conversational systems. In the special case of short-text EL, which poses additional challenges due to limited context, prior approaches have reached good performance by employing heuristics-based methods or purely neural approaches. Here, we take a different, neuro-symbolic approach that combines the advantages of using interpretable rules based on first-order logic with the performance of neural learning. Even though constrained to use rules, we show that we reach competitive or better performance with SoTA black-box neural approaches. Furthermore, our framework has the benefits of extensibility and transferability. We show that we can easily blend existing rule templates given by a human expert, with multiple types of features (priors, BERT encodings, box embeddings, etc), and even with scores resulting from previous EL methods, thus improving on such methods. As an example of improvement, on the LC-QuAD-1.0 dataset, we show more than 3% increase in F1 score relative to previous SoTA. Finally, we show that the inductive bias offered by using logic results in a set of learned rules that transfers from one dataset to another, sometimes without finetuning, while still having high accuracy.

2020

pdf bib
CORD-19: The COVID-19 Open Research Dataset
Lucy Lu Wang | Kyle Lo | Yoganand Chandrasekhar | Russell Reas | Jiangjiang Yang | Doug Burdick | Darrin Eide | Kathryn Funk | Yannis Katsis | Rodney Michael Kinney | Yunyao Li | Ziyang Liu | William Merrill | Paul Mooney | Dewey A. Murdick | Devvret Rishi | Jerry Sheehan | Zhihong Shen | Brandon Stilson | Alex D. Wade | Kuansan Wang | Nancy Xin Ru Wang | Christopher Wilhelm | Boya Xie | Douglas M. Raymond | Daniel S. Weld | Oren Etzioni | Sebastian Kohlmeier
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for COVID-19.

pdf bib
Jennifer for COVID-19: An NLP-Powered Chatbot Built for the People and by the People to Combat Misinformation
Yunyao Li | Tyrone Grandison | Patricia Silveyra | Ali Douraghy | Xinyu Guan | Thomas Kieselbach | Chengkai Li | Haiqi Zhang
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

Just as SARS-CoV-2, a new form of coronavirus continues to infect a growing number of people around the world, harmful misinformation about the outbreak also continues to spread. With the goal of combating misinformation, we designed and built Jennifer–a chatbot maintained by a global group of volunteers. With Jennifer, we hope to learn whether public information from reputable sources could be more effectively organized and shared in the wake of a crisis as well as to understand issues that the public were most immediately curious about. In this paper, we introduce Jennifer and describe the design of this proof-of-principle system. We also present lessons learned and discuss open challenges. Finally, to facilitate future research, we release COVID-19 Question Bank, a dataset of 3,924 COVID-19-related questions in 944 groups, gathered from our users and volunteers.

pdf bib
Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization
Qiuhao Lu | Nisansa de Silva | Dejing Dou | Thien Huu Nguyen | Prithviraj Sen | Berthold Reinwald | Yunyao Li
Proceedings of the 28th International Conference on Computational Linguistics

Network representation learning (NRL) is crucial in the area of graph learning. Recently, graph autoencoders and its variants have gained much attention and popularity among various types of node embedding approaches. Most existing graph autoencoder-based methods aim to minimize the reconstruction errors of the input network while not explicitly considering the semantic relatedness between nodes. In this paper, we propose a novel network embedding method which models the consistency across different views of networks. More specifically, we create a second view from the input network which captures the relation between nodes based on node content and enforce the latent representations from the two views to be consistent by incorporating a multiview adversarial regularization module. The experimental studies on benchmark datasets prove the effectiveness of this method, and demonstrate that our method compares favorably with the state-of-the-art algorithms on challenging tasks such as link prediction and node clustering. We also evaluate our method on a real-world application, i.e., 30-day unplanned ICU readmission prediction, and achieve promising results compared with several baseline methods.

pdf bib
Small but Mighty: New Benchmarks for Split and Rephrase
Li Zhang | Huaiyu Zhu | Siddhartha Brahma | Yunyao Li
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models.

pdf bib
Learning Explainable Linguistic Expressions with Neural Inductive Logic Programming for Sentence Classification
Prithviraj Sen | Marina Danilevsky | Yunyao Li | Siddhartha Brahma | Matthias Boehm | Laura Chiticariu | Rajasekar Krishnamurthy
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Interpretability of predictive models is becoming increasingly important with growing adoption in the real-world. We present RuleNN, a neural network architecture for learning transparent models for sentence classification. The models are in the form of rules expressed in first-order logic, a dialect with well-defined, human-understandable semantics. More precisely, RuleNN learns linguistic expressions (LE) built on top of predicates extracted using shallow natural language understanding. Our experimental results show that RuleNN outperforms statistical relational learning and other neuro-symbolic methods, and performs comparably with black-box recurrent neural networks. Our user studies confirm that the learned LEs are explainable and capture domain semantics. Moreover, allowing domain experts to modify LEs and instill more domain knowledge leads to human-machine co-creation of models with better performance.

pdf bib
Learning Structured Representations of Entity Names using Active Learning and Weak Supervision
Kun Qian | Poornima Chozhiyath Raman | Yunyao Li | Lucian Popa
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Structured representations of entity names are useful for many entity-related tasks such as entity normalization and variant generation. Learning the implicit structured representations of entity names without context and external knowledge is particularly challenging. In this paper, we present a novel learning framework that combines active learning and weak supervision to solve this problem. Our experimental evaluation show that this framework enables the learning of high-quality models from merely a dozen or so labeled examples.

pdf bib
A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels
Youxuan Jiang | Huaiyu Zhu | Jonathan K. Kummerfeld | Yunyao Li | Walter Lasecki
Findings of the Association for Computational Linguistics: EMNLP 2020

Resources for Semantic Role Labeling (SRL) are typically annotated by experts at great expense. Prior attempts to develop crowdsourcing methods have either had low accuracy or required substantial expert annotation. We propose a new multi-stage crowd workflow that substantially reduces expert involvement without sacrificing accuracy. In particular, we introduce a unique filter stage based on the key observation that crowd workers are able to almost perfectly filter out incorrect options for labels. Our three-stage workflow produces annotations with 95% accuracy for predicate labels and 93% for argument labels, which is comparable to expert agreement. Compared to prior work on crowdsourcing for SRL, we decrease expert effort by 4x, from 56% to 14% of cases. Our approach enables more scalable annotation of SRL, and could enable annotation of NLP tasks that have previously been considered too complex to effectively crowdsource.

pdf bib
CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling
Ishan Jindal | Yunyao Li | Siddhartha Brahma | Huaiyu Zhu
Findings of the Association for Computational Linguistics: EMNLP 2020

Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be effective with polyglot training by representing the distant vocabularies in a shared representation space. Meanwhile, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages (e.g. adjuncts have more or less similar semantic meaning across languages). To leverage such similarity in annotation space across languages, we propose a method called Cross-Lingual Argument Regularizer (CLAR). CLAR identifies such linguistic annotation similarity across languages and exploits this information to map the target language arguments using a transformation of the space on which source language arguments lie. By doing so, our experimental results show that CLAR consistently improves SRL performance on multiple languages over monolingual and polyglot baselines for low resource languages.

pdf bib
Answering Complex Questions by Combining Information from Curated and Extracted Knowledge Bases
Nikita Bhutani | Xinyi Zheng | Kun Qian | Yunyao Li | H. Jagadish
Proceedings of the First Workshop on Natural Language Interfaces

Knowledge-based question answering (KB_QA) has long focused on simple questions that can be answered from a single knowledge source, a manually curated or an automatically extracted KB. In this work, we look at answering complex questions which often require combining information from multiple sources. We present a novel KB-QA system, Multique, which can map a complex question to a complex query pattern using a sequence of simple queries each targeted at a specific KB. It finds simple queries using a neural-network based model capable of collective inference over textual relations in extracted KB and ontological relations in curated KB. Experiments show that our proposed system outperforms previous KB-QA systems on benchmark datasets, ComplexWebQuestions and WebQuestionsSP.

2019

pdf bib
Low-resource Deep Entity Resolution with Transfer and Active Learning
Jungo Kasai | Kun Qian | Sairam Gurajada | Yunyao Li | Lucian Popa
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.

pdf bib
HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop
Prithviraj Sen | Yunyao Li | Eser Kandogan | Yiwei Yang | Walter Lasecki
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

While the role of humans is increasingly recognized in machine learning community, representation of and interaction with models in current human-in-the-loop machine learning (HITL-ML) approaches are too low-level and far-removed from human’s conceptual models. We demonstrate HEIDL, a prototype HITL-ML system that exposes the machine-learned model through high-level, explainable linguistic expressions formed of predicates representing semantic structure of text. In HEIDL, human’s role is elevated from simply evaluating model predictions to interpreting and even updating the model logic directly by enabling interaction with rule predicates themselves. Raising the currency of interaction to such semantic levels calls for new interaction paradigms between humans and machines that result in improved productivity for text analytics model development process. Moreover, by involving humans in the process, the human-machine co-created models generalize better to unseen data as domain experts are able to instill their expertise by extrapolating from what has been learned by automated algorithms from few labelled data.

pdf bib
Towards Universal Semantic Representation
Huaiyu Zhu | Yunyao Li | Laura Chiticariu
Proceedings of the First International Workshop on Designing Meaning Representations

Natural language understanding at the semantic level and independent of language variations is of great practical value. Existing approaches such as semantic role labeling (SRL) and abstract meaning representation (AMR) still have features related to the peculiarities of the particular language. In this work we describe various challenges and possible solutions in designing a semantic representation that is universal across a variety of languages.

2018

pdf bib
DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm Based on Learned High Dimensional Encoding
Min Li | Marina Danilevsky | Sara Noeman | Yunyao Li
Proceedings of the 22nd Conference on Computational Natural Language Learning

Phonetic similarity algorithms identify words and phrases with similar pronunciation which are used in many natural language processing tasks. However, existing approaches are designed mainly for Indo-European languages and fail to capture the unique properties of Chinese pronunciation. In this paper, we propose a high dimensional encoded phonetic similarity algorithm for Chinese, DIMSIM. The encodings are learned from annotated data to separately map initial and final phonemes into n-dimensional coordinates. Pinyin phonetic similarities are then calculated by aggregating the similarities of initial, final and tone. DIMSIM demonstrates a 7.5X improvement on mean reciprocal rank over the state-of-the-art phonetic similarity approaches.

pdf bib
Exploiting Structure in Representation of Named Entities using Active Learning
Nikita Bhutani | Kun Qian | Yunyao Li | H. V. Jagadish | Mauricio Hernandez | Mitesh Vasa
Proceedings of the 27th International Conference on Computational Linguistics

Fundamental to several knowledge-centric applications is the need to identify named entities from their textual mentions. However, entities lack a unique representation and their mentions can differ greatly. These variations arise in complex ways that cannot be captured using textual similarity metrics. However, entities have underlying structures, typically shared by entities of the same entity type, that can help reason over their name variations. Discovering, learning and manipulating these structures typically requires high manual effort in the form of large amounts of labeled training data and handwritten transformation programs. In this work, we propose an active-learning based framework that drastically reduces the labeled data required to learn the structures of entities. We show that programs for mapping entity mentions to their structures can be automatically generated using human-comprehensible labels. Our experiments show that our framework consistently outperforms both handwritten programs and supervised learning models. We also demonstrate the utility of our framework in relation extraction and entity resolution tasks.

pdf bib
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
Srinivas Bangalore | Jennifer Chu-Carroll | Yunyao Li
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

pdf bib
SystemT: Declarative Text Understanding for Enterprise
Laura Chiticariu | Marina Danilevsky | Yunyao Li | Frederick Reiss | Huaiyu Zhu
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

The rise of enterprise applications over unstructured and semi-structured documents poses new challenges to text understanding systems across multiple dimensions. We present SystemT, a declarative text understanding system that addresses these challenges and has been deployed in a wide range of enterprise applications. We highlight the design considerations and decisions behind SystemT in addressing the needs of the enterprise setting. We also summarize the impact of SystemT on business and education.

2017

pdf bib
CROWD-IN-THE-LOOP: A Hybrid Approach for Annotating Semantic Roles
Chenguang Wang | Alan Akbik | Laura Chiticariu | Yunyao Li | Fei Xia | Anbang Xu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Crowdsourcing has proven to be an effective method for generating labeled data for a range of NLP tasks. However, multiple recent attempts of using crowdsourcing to generate gold-labeled training data for semantic role labeling (SRL) reported only modest results, indicating that SRL is perhaps too difficult a task to be effectively crowdsourced. In this paper, we postulate that while producing SRL annotation does require expert involvement in general, a large subset of SRL labeling tasks is in fact appropriate for the crowd. We present a novel workflow in which we employ a classifier to identify difficult annotation tasks and route each task either to experts or crowd workers according to their difficulties. Our experimental evaluation shows that the proposed approach reduces the workload for experts by over two-thirds, and thus significantly reduces the cost of producing SRL annotation at little loss in quality.

2016

pdf bib
Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages
Alan Akbik | Vishwajeet Kumar | Yunyao Li
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
K-SRL: Instance-based Learning for Semantic Role Labeling
Alan Akbik | Yunyao Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Semantic role labeling (SRL) is the task of identifying and labeling predicate-argument structures in sentences with semantic frame and role labels. A known challenge in SRL is the large number of low-frequency exceptions in training data, which are highly context-specific and difficult to generalize. To overcome this challenge, we propose the use of instance-based learning that performs no explicit generalization, but rather extrapolates predictions from the most similar instances in the training data. We present a variant of k-nearest neighbors (kNN) classification with composite features to identify nearest neighbors for SRL. We show that high-quality predictions can be derived from a very small number of similar instances. In a comparative evaluation we experimentally demonstrate that our instance-based learning approach significantly outperforms current state-of-the-art systems on both in-domain and out-of-domain data, reaching F1-scores of 89,28% and 79.91% respectively.

pdf bib
Multilingual Aliasing for Auto-Generating Proposition Banks
Alan Akbik | Xinyu Guan | Yunyao Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Semantic Role Labeling (SRL) is the task of identifying the predicate-argument structure in sentences with semantic frame and role labels. For the English language, the Proposition Bank provides both a lexicon of all possible semantic frames and large amounts of labeled training data. In order to expand SRL beyond English, previous work investigated automatic approaches based on parallel corpora to automatically generate Proposition Banks for new target languages (TLs). However, this approach heuristically produces the frame lexicon from word alignments, leading to a range of lexicon-level errors and inconsistencies. To address these issues, we propose to manually alias TL verbs to existing English frames. For instance, the German verb drehen may evoke several meanings, including “turn something” and “film something”. Accordingly, we alias the former to the frame TURN.01 and the latter to a group of frames that includes FILM.01 and SHOOT.03. We execute a large-scale manual aliasing effort for three target languages and apply the new lexicons to automatically generate large Proposition Banks for Chinese, French and German with manually curated frames. We present a detailed evaluation in which we find that our proposed approach significantly increases the quality and consistency of the generated Proposition Banks. We release these resources to the research community.

pdf bib
Multilingual Information Extraction with PolyglotIE
Alan Akbik | Laura Chiticariu | Marina Danilevsky | Yonas Kbrom | Yunyao Li | Huaiyu Zhu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present PolyglotIE, a web-based tool for developing extractors that perform Information Extraction (IE) over multilingual data. Our tool has two core features: First, it allows users to develop extractors against a unified abstraction that is shared across a large set of natural languages. This means that an extractor needs only be created once for one language, but will then run on multilingual data without any additional effort or language-specific knowledge on part of the user. Second, it embeds this abstraction as a set of views within a declarative IE system, allowing users to quickly create extractors using a mature IE query language. We present PolyglotIE as a hands-on demo in which users can experiment with creating extractors, execute them on multilingual text and inspect extraction results. Using the UI, we discuss the challenges and potential of using unified, crosslingual semantic abstractions as basis for downstream applications. We demonstrate multilingual IE for 9 languages from 4 different language groups: English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi.

pdf bib
POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels
Alan Akbik | Yunyao Li
Proceedings of ACL-2016 System Demonstrations

2015

bib
Transparent Machine Learning for Information Extraction: State-of-the-Art and the Future
Laura Chiticariu | Yunyao Li | Frederick Reiss
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The rise of Big Data analytics over unstructured text has led to renewed interest in information extraction (IE). These applications need effective IE as a first step towards solving end-to-end real world problems (e.g. biology, medicine, finance, media and entertainment, etc). Much recent NLP research has focused on addressing specific IE problems using a pipeline of multiple machine learning techniques. This approach requires an analyst with the expertise to answer questions such as: “What ML techniques should I combine to solve this problem?”; “What features will be useful for the composite pipeline?”; and “Why is my model giving the wrong answer on this document?”. The need for this expertise creates problems in real world applications. It is very difficult in practice to find an analyst who both understands the real world problem and has deep knowledge of applied machine learning. As a result, the real impact by current IE research does not match up to the abundant opportunities available.In this tutorial, we introduce the concept of transparent machine learning. A transparent ML technique is one that:- produces models that a typical real world use can read and understand;- uses algorithms that a typical real world user can understand; and- allows a real world user to adapt models to new domains.The tutorial is aimed at IE researchers in both the academic and industry communities who are interested in developing and applying transparent ML.

pdf bib
An In-depth Analysis of the Effect of Text Normalization in Social Media
Tyler Baldwin | Yunyao Li
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling
Alan Akbik | Laura Chiticariu | Marina Danilevsky | Yunyao Li | Shivakumar Vaithyanathan | Huaiyu Zhu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2013

pdf bib
Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
Laura Chiticariu | Yunyao Li | Frederick R. Reiss
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Adaptive Parser-Centric Text Normalization
Congle Zhang | Tyler Baldwin | Howard Ho | Benny Kimelfeld | Yunyao Li
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Automatic Term Ambiguity Detection
Tyler Baldwin | Yunyao Li | Bogdan Alexe | Ioana R. Stanoi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
WizIE: A Best Practices Guided Development Environment for Information Extraction
Yunyao Li | Laura Chiticariu | Huahai Yang | Frederick Reiss | Arnaldo Carreno-fuentes
Proceedings of the ACL 2012 System Demonstrations

2011

pdf bib
A Graph Approach to Spelling Correction in Domain-Centric Search
Zhuowei Bao | Benny Kimelfeld | Yunyao Li
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SystemT: A Declarative Information Extraction System
Yunyao Li | Frederick Reiss | Laura Chiticariu
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

pdf bib
SystemT: An Algebraic Approach to Declarative Information Extraction
Laura Chiticariu | Rajasekar Krishnamurthy | Yunyao Li | Sriram Raghavan | Frederick Reiss | Shivakumar Vaithyanathan
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
Laura Chiticariu | Rajasekar Krishnamurthy | Yunyao Li | Frederick Reiss | Shivakumar Vaithyanathan
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Regular Expression Learning for Information Extraction
Yunyao Li | Rajasekar Krishnamurthy | Sriram Raghavan | Shivakumar Vaithyanathan | H. V. Jagadish
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

Search
Co-authors