Shaobin Xu


pdf bib
Recovering Lexically and Semantically Reused Texts
Ansel MacLaughlin | Shaobin Xu | David A. Smith
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Writers often repurpose material from existing texts when composing new documents. Because most documents have more than one source, we cannot trace these connections using only models of document-level similarity. Instead, this paper considers methods for local text reuse detection (LTRD), detecting localized regions of lexically or semantically similar text embedded in otherwise unrelated material. In extensive experiments, we study the relative performance of four classes of neural and bag-of-words models on three LTRD tasks – detecting plagiarism, modeling journalists’ use of press releases, and identifying scientists’ citation of earlier papers. We conduct evaluations on three existing datasets and a new, publicly-available citation localization dataset. Our findings shed light on a number of previously-unexplored questions in the study of LTRD, including the importance of incorporating document-level context for predictions, the applicability of of-the-shelf neural models pretrained on “general” semantic textual similarity tasks such as paraphrase detection, and the trade-offs between more efficient bag-of-words and feature-based neural models and slower pairwise neural models.


pdf bib
A Multi-Context Character Prediction Model for a Brain-Computer Interface
Shiran Dudy | Shaobin Xu | Steven Bedrick | David Smith
Proceedings of the Second Workshop on Subword/Character LEvel Models

Brain-computer interfaces and other augmentative and alternative communication devices introduce language-modeing challenges distinct from other character-entry methods. In particular, the acquired signal of the EEG (electroencephalogram) signal is noisier, which, in turn, makes the user intent harder to decipher. In order to adapt to this condition, we propose to maintain ambiguous history for every time step, and to employ, apart from the character language model, word information to produce a more robust prediction system. We present preliminary results that compare this proposed Online-Context Language Model (OCLM) to current algorithms that are used in this type of setting. Evaluation on both perplexity and predictive accuracy demonstrates promising results when dealing with ambiguous histories in order to provide to the front end a distribution of the next character the user might type.


pdf bib
Detecting and Evaluating Local Text Reuse in Social Networks
Shaobin Xu | David Smith | Abigail Mullen | Ryan Cordell
Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media