Lewis Mitchell


2024

pdf bib
Personality Profiling: How informative are social media profiles in predicting personal information?
Joshua Watt | Lewis Mitchell | Jonathan Tuke
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

Personality profiling has been utilised by companies for targeted advertising, political campaigns and public health campaigns. However, the accuracy and versatility of such models remains relatively unknown. Here we explore the extent to which peoples’ online digital footprints can be used to profile their Myers- Briggs personality type. We analyse and compare four models: logistic regression, naive Bayes, support vector machines (SVMs) and random forests. We discover that a SVM model achieves the best accuracy of 20.95% for predicting a complete personality type. However, logistic regression models perform only marginally worse and are significantly faster to train and perform predictions. Moreover, we develop a statistical framework for assessing the importance of different sets of features in our models. We discover some features to be more informative than others in the Intuitive/Sensory (p = 0.032) and Thinking/Feeling (p = 0.019) models. Many labelled datasets present substantial class imbalances of personal characteristics on social media, including our own. We therefore highlight the need for attentive consideration when reporting model performance on such datasets and compare a number of methods to fix class-imbalance problems.

pdf bib
Simple models are all you need: Ensembling stylometric, part-of-speech, and information-theoretic models for the ALTA 2024 Shared Task
Joel Thomas | Gia Bao Hoang | Lewis Mitchell
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association

The ALTA 2024 shared task concerned automated detection of AI-generated text. Large language models (LLM) were used to generate hybrid documents, where individual sentences were authored by either humans or a state-of-the-art LLM. Rather than rely on similarly computationally expensive tools like transformer-based methods, we decided to approach this task using only an ensemble of lightweight “traditional” methods that could be trained on a standard desktop machine. Our approach used models based on word counts, stylometric features, readability metrics, part-of-speech tagging, and an information-theoretic entropy estimator to predict authorship. These models, combined with a simple weighting scheme, performed well on a held-out test set, achieving an accuracy of 0.855 and a kappa score of 0.695. Our results show that relatively simple, interpretable models can perform effectively at tasks like authorship prediction, even on short texts, which is important for democratisation of AI as well as future applications in edge computing.

2020

pdf bib
Life still goes on: Analysing Australian WW1 Diaries through Distant Reading
Ashley Dennis-Henderson | Matthew Roughan | Lewis Mitchell | Jonathan Tuke
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

An increasing amount of historic data is now available in digital (text) formats. This gives quantitative researchers an opportunity to use distant reading techniques, as opposed to traditional close reading, in order to analyse larger quantities of historic data. Distant reading allows researchers to view overall patterns within the data and reduce researcher bias. One such data set that has recently been transcribed is a collection of over 500 Australian World War I (WW1) diaries held by the State Library of New South Wales. Here we apply distant reading techniques to this corpus to understand what soldiers wrote about and how they felt over the course of the war. Extracting dates accurately is important as it allows us to perform our analysis over time, however, it is very challenging due to the variety of date formats and abbreviations diarists use. But with that data, topic modelling and sentiment analysis can then be applied to show trends, for instance, that despite the horrors of war, Australians in WW1 primarily wrote about their everyday routines and experiences. Our results detail some of the challenges likely to be encountered by quantitative researchers intending to analyse historical texts, and provide some approaches to these issues.

2019

pdf bib
Podlab at SemEval-2019 Task 3: The Importance of Being Shallow
Andrew Nguyen | Tobin South | Nigel Bean | Jonathan Tuke | Lewis Mitchell
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our linear SVM system for emotion classification from conversational dialogue, entered in SemEval2019 Task 3. We used off-the-shelf tools coupled with feature engineering and parameter tuning to create a simple, interpretable, yet high-performing, classification model. Our system achieves a micro F1 score of 0.7357, which is 92% of the top score for the competition, demonstrating that “shallow” classification approaches can perform well when coupled with detailed fea- ture selection and statistical analysis.

pdf bib
A framework for streamlined statistical prediction using topic models
Vanessa Glenny | Jonathan Tuke | Nigel Bean | Lewis Mitchell
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In the Humanities and Social Sciences, there is increasing interest in approaches to information extraction, prediction, intelligent linkage, and dimension reduction applicable to large text corpora. With approaches in these fields being grounded in traditional statistical techniques, the need arises for frameworks whereby advanced NLP techniques such as topic modelling may be incorporated within classical methodologies. This paper provides a classical, supervised, statistical learning framework for prediction from text, using topic models as a data reduction method and the topics themselves as predictors, alongside typical statistical tools for predictive modelling. We apply this framework in a Social Sciences context (applied animal behaviour) as well as a Humanities context (narrative analysis) as examples of this framework. The results show that topic regression models perform comparably to their much less efficient equivalents that use individual words as predictors.