Patrick Juola


pdf bib
A Call for Clarity in Contemporary Authorship Attribution Evaluation
Allen Riddell | Haining Wang | Patrick Juola
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problems—the lack of fixed test sets and the use of inappropriately homogeneous corpora—our paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing.

pdf bib
Mode Effects’ Challenge to Authorship Attribution
Haining Wang | Allen Riddell | Patrick Juola
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The success of authorship attribution relies on the presence of linguistic features specific to individual authors. There is, however, limited research assessing to what extent authorial style remains constant when individuals switch from one writing modality to another. We measure the effect of writing mode on writing style in the context of authorship attribution research using a corpus of documents composed online (in a web browser) and documents composed offline using a traditional word processor. The results confirm the existence of a “mode effect” on authorial style. Online writing differs systematically from offline writing in terms of sentence length, word use, readability, and certain part-of-speech ratios. These findings have implications for research design and feature engineering in authorship attribution studies.


pdf bib
Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification
Bashar Talafha | Ali Fadel | Mahmoud Al-Ayyoub | Yaser Jararweh | Mohammad AL-Smadi | Patrick Juola
Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper, we describe our team’s effort on the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. The task requires building a system capable of differentiating between 25 different Arabic dialects in addition to MSA. Our approach is simple. After preprocessing the data, we use Data Augmentation (DA) to enlarge the training data six times. We then build a language model and extract n-gram word-level and character-level TF-IDF features and feed them into an MNB classifier. Despite its simplicity, the resulting model performs really well producing the 4th highest F-measure and region-level accuracy and the 5th highest precision, recall, city-level accuracy and country-level accuracy among the participating teams.


pdf bib
Detecting Stylistic Deception
Patrick Juola
Proceedings of the Workshop on Computational Approaches to Deception Detection

pdf bib
Authorship Attribution and Optical Character Recognition Errors
Patrick Juola | John I. Jr Noecker | Michael V. Ryan
Traitement Automatique des Langues, Volume 53, Numéro 3 : Du bruit dans le signal : gestion des erreurs en traitement automatique des langues [Managing noise in the signal: Error handling in natural language processing]


pdf bib
Cross-Entropy and Linguistic Typology
Patrick Juola
New Methods in Language Processing and Computational Natural Language Learning