Peiran Yao


2023

pdf bib
NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools
Peiran Yao | Matej Kosmajac | Abeer Waheed | Kostyantyn Guzhva | Natalie Hervieux | Denilson Barbosa
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

NLP Workbench is a web-based platform for text mining that allows non-expert users to obtain semantic understanding of large-scale corpora using state-of-the-art text mining models. The platform is built upon latest pre-trained models and open source systems from academia that provide semantic analysis functionalities, including but not limited to entity linking, sentiment analysis, semantic parsing, and relation extraction. Its extensible design enables researchers and developers to smoothly replace an existing model or integrate a new one. To improve efficiency, we employ a microservice architecture that facilitates allocation of acceleration hardware and parallelization of computation. This paper presents the architecture of NLP Workbench and discusses the challenges we faced in designing it. We also discuss diverse use cases of NLP Work- bench and the benefits of using it over other approaches. The platform is under active devel- opment, with its source code released under the MIT license. A website and a short video demonstrating our platform are also available.

2022

pdf bib
WordTies: Measuring Word Associations in Language Models via Constrained Sampling
Peiran Yao | Tobias Renwick | Denilson Barbosa
Findings of the Association for Computational Linguistics: EMNLP 2022

Word associations are widely used in psychology to provide insights on how humans perceive and understand concepts. Comparing word associations in language models (LMs) to those generated by human subjects can serve as a proxy to uncover embedded lexical and commonsense knowledge in language models. While much helpful work has been done applying direct metrics, such as cosine similarity, to help understand latent spaces, these metrics are symmetric, while human word associativity is asymmetric. We propose WordTies, an algorithm based on constrained sampling from LMs, which allows an asymmetric measurement of associated words, given a cue word as the input. Comparing to existing methods, word associations found by this method share more overlap with associations provided by humans, and observe the asymmetric property of human associations. To examine possible reasons behind associations, we analyze the knowledge and reasoning behind the word pairings as they are linked to lexical and commonsense knowledge graphs. When the knowledge about the nature of the word pairings is combined with a probability that the LM has learned that information, we have a new way to examine what information is captured in LMs.