Jian Zhu


2021

pdf bib
Idiosyncratic but not Arbitrary: Learning Idiolects in Online Registers Reveals Distinctive yet Consistent Individual Styles
Jian Zhu | David Jurgens
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

An individual’s variation in writing style is often a function of both social and personal attributes. While structured social variation has been extensively studied, e.g., gender based variation, far less is known about how to characterize individual styles due to their idiosyncratic nature. We introduce a new approach to studying idiolects through a massive cross-author comparison to identify and encode stylistic features. The neural model achieves strong performance at authorship identification on short texts and through an analogy-based probing task, showing that the learned representations exhibit surprising regularities that encode qualitative and quantitative shifts of idiolectal styles. Through text perturbation, we quantify the relative contributions of different linguistic elements to idiolectal variation. Furthermore, we provide a description of idiolects through measuring inter- and intra-author variation, showing that variation in idiolects is often distinctive yet consistent.

pdf bib
The structure of online social networks modulates the rate of lexical change
Jian Zhu | David Jurgens
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New words are regularly introduced to communities, yet not all of these words persist in a community’s lexicon. Among the many factors contributing to lexical change, we focus on the understudied effect of social networks. We conduct a large-scale analysis of over 80k neologisms in 4420 online communities across a decade. Using Poisson regression and survival analysis, our study demonstrates that the community’s network structure plays a significant role in lexical change. Apart from overall size, properties including dense connections, the lack of local clusters, and more external contacts promote lexical innovation and retention. Unlike offline communities, these topic-based communities do not experience strong lexical leveling despite increased contact but accommodate more niche words. Our work provides support for the sociolinguistic hypothesis that lexical change is partially shaped by the structure of the underlying network but also uncovers findings specific to online communities.

2019

pdf bib
UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs
Jian Zhu | Zuoyu Tian | Sandra Kübler
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes the UM-IU@LING’s system for the SemEval 2019 Task 6: Offens-Eval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 submissions. In subtasks B and C, we used a linear SVM with selected character n-gram features. For subtask C, our system could identify the target of abuse with a macro F1 score of 0.5243, ranking it 27th out of 65 submissions.

2011

pdf bib
A New Unsupervised Approach to Word Segmentation
Hanshi Wang | Jian Zhu | Shiping Tang | Xiaozhong Fan
Computational Linguistics, Volume 37, Issue 3 - September 2011