On Text-based Personality Computing: Challenges and Future Directions

Text-based personality computing (TPC) has gained many research interests in NLP. In this paper, we describe 15 challenges that we consider deserving the attention of the research community. These challenges are organized by the following topics: personality taxonomies, measurement quality, datasets, performance evaluation, modelling choices, as well as ethics and fairness. When addressing each challenge, not only do we combine perspectives from both NLP and social sciences, but also offer concrete suggestions. We hope to inspire more valid and reliable TPC research.


Introduction
According to the APA Dictionary of Psychology (APA, 2022), personality refers to personality traits, which are "relatively stable, consistent, and enduring internal characteristics inferred from a pattern of behaviours, attitudes, feelings and habits in individuals".Knowledge about personality can be useful in many societal and scientific applications.For instance, it can help individuals choose learning styles (Komarraju et al., 2011) and occupations (Kern et al., 2019) suited for their personality; it can help clinical psychologists to better understand psychological disorders (Khan et al., 2005) and to deliver personalised treatment plans for mental health patients (Bagby et al., 2016); changes in personality can even help with early diagnosis of Alzheimer's (Robins Wahlin and Byrne, 2011) and Parkinson's disease (Santangelo et al., 2017).
Traditionally, personality assessment is based on self-and other-report questionnaires, which is labour-and time-intensive.Recently, however, automatic personality assessment based on usergenerated data (e.g., texts, images, videos) and machine learning algorithms has become a popular alternative.This is known as personality computing (PC) (Phan and Rauthmann, 2021), among many other names 1 .
In this paper, we focus on evaluating PC research in NLP, where personality is primarily inferred from text, such as tweets and Reddit posts (Hosseinia et al., 2021), conversations (Mairesse and Walker, 2006) and speech transcriptions (Das and Das, 2017).We refer to such research as text-based personality computing (TPC).
In TPC, on the one hand, we see an increasing number of datasets curated, complex deep-learning algorithms adopted, and (sometimes) high prediction scores achieved.On the other hand, we see relatively little discussion about open challenges and future research directions.For example, the presence of measurement error in questionnaire-based personality scores remains an un(der)addressed issue.Given that such scores are often used as the gold standard for training and validating TPC algorithms, we find it important to discuss related implications and remedies.Another relevant issue concerns how to reduce the risks of TPC research.
Therefore, in this paper, we reflect on current TPC research practices, identify open challenges, and suggest better ways forward.To spot such challenges, we conduct a literature search in the scope of the ACL Anthology 2 .While there are TPC papers published in other venues, we consider our selection from ACL Anthology a good representation of TPC research in NLP.Appendix A describes our search strategy and results in detail.In total, we find and review 60 empirical TPC papers, based on which we identify 15 challenges.They are organized by the following topics: personality taxonomies ( §3), measurement quality ( §4), datasets ( §5), performance evaluation ( §6), modelling choices ( §7), as well as ethics and fairness ( §8).We discuss each challenge and give concrete suggestions where we also draw on broader NLP and social science literature.Furthermore, during our literature review, we identify 18 TPC datasets that can be (re)used by other researchers.We summarize them in Appendix B.
Note that our paper focuses on identifying and discussing TPC-related research challenges instead of providing a comprehensive overview of past TPC studies.For the latter, we refer you to the survey papers by Stajner and Yenikent (2020); Derakhshi et al. (2021); Mushtaq and Kumar (2023).

Current TPC in a Nutshell
TPC concerns computing personality information from texts.This can be either a regression or a classification task, depending on whether the personality measurements are continuous or discrete.Supervised learning has been the predominant approach, relying on text datasets labelled with personality traits via either self-report or crowdsourced annotations.Among many others, log-linear models (Volkova et al., 2015), random forests (Levitan et al., 2016), GloVe embeddings with Gaussian processes (Arnoux et al., 2017), recurrent neural networks (Liu et al., 2017), convolutional neural networks (Majumder et al., 2017), support vector machines (Lan and Paraboni, 2018), ridge regression (He and de Melo, 2021), graphical networks (Yang et al., 2021b) and transformers (Kreuter et al., 2022) have been used.Another popular, psycho-linguistically motivated approach are dictionaries/lexicons (e.g., Oberlander and Nowson, 2006;Sinha et al., 2015;Das and Das, 2017).They are typically lists of curated terms that have pre-assigned weights associated with different personality traits.This allows researchers to compute personality scores from texts by simply matching a dictionary to the texts and aggregating the weights of matched words and phrases.

Related Work
We find three relevant TPC survey papers.The first one, Stajner and Yenikent (2020), not only summarizes previous research in TPC but also discusses three interesting issues: the difference between MBTI and the Big-5 (two most popular personality taxonomies, see §3), the difficulty in predicting MBTI from Twitter data, and ethical concerns about TPC.We find all three discussions necessary and helpful.Our paper not only provides our own (differing) perspectives on these issues, but also raises and discusses many others.The other two survey papers, Mushtaq and Kumar (2023) and Derakhshi et al. (2021), after reviewing prior TPC research, very briefly suggest some open challenges in TPC research.While the open challenges mentioned in these two papers (e.g., better data quality, data sharing, and ethics) partially overlap with our list of challenges, we adopt a more evidence-based approach to listing challenges, engage in much more thorough discussion, and offer concrete solutions.Furthermore, our list of challenges goes beyond those in these two papers (e.g., measurement error reduction, performance expectation, joint personality modelling).
We also find three additional papers (Bleidorn and Hopwood, 2019;Stachl et al., 2020;Phan and Rauthmann, 2021) concerning issues in general PC research.Our paper, in contrast, focuses on challenges specific to TPC research.
Lastly, TPC is closely related to other fields, such as automatic emotion recognition (Barriere et al., 2022), opinion mining (Hosseinia et al., 2021) and mental health prediction (Guntuku et al., 2018).These fields are all concerned with the computation of social science constructs (see §4).

Personality Taxonomies
A TPC research project typically starts by choosing a personality taxonomy, which is a descriptive framework for personality traits (John and Srivastava, 1999).Among the 60 papers we review, we find two prominent taxonomies: the Myers-Briggs Type Indicator (MBTI; 14 papers) and the Big-5 (45 papers).3Fifty of these papers adopt either the MBTI or the Big-5 but not both.This invites the first challenge: Challenge 1 (C1): MBTI vs. Big-5 The MBTI originated from the theoretical work of Jung (1971) and was further developed by Briggs Myers and Myers (1995).It proposes four personality dimensions that characterize people's differences in perception and judgement processes: Extraversion/Introversion (E/I), Sensing/iNtuition (S/N), Thinking/Feeling (T/F), and Judgement/Perception (J/P).Individuals are classified into one of the two possible categories across each dimension (e.g., INFP and ESTJ).
In contrast, the Big-5 was developed based on the lexical hypothesis: all important personality traits must have been encoded in natural language and therefore, analysis of personality-related terms should reveal the true personality taxonomy (Goldberg, 1990).Independent research groups (Cattell, 1946;Goldberg, 1982;Costa and McCrae, 1992;Tupes and Christal, 1992) investigated this hypothesis.They identified numerous English terms that might describe inter-individual differences (e.g., warm, curious), asked participants to rate how well these terms describe them on numerical scales, and factor-analyzed the responses, which revealed five consistent dimensions of personality: Openness (O), Conscientiousness (C), Extraversion (E), Agreeableness (A), and Neuroticism (N). 4 Furthermore, each dimension includes six finer-grained sub-traits called facets (e.g., agreeableness includes trust, straightforwardness, altruism, compliance, modesty and tender-mindedness).Big-5 is the most widely accepted and researched taxonomy of personality traits in psychology (as opposed to the popularity of MBTI in non-academic settings like job interviews) (Phan and Rauthmann, 2021).
We recommend the Big-5 over MBTI for the following reasons: First, the Big-5 is a more realistic and accurate personality taxonomy.It scores individuals along a continuous spectrum, which describes interindividual differences more accurately and preserves more information (as opposed to MBTI's dichotomous approach).Also, the Big-5 includes facets, which allows for finer-grained analysis of personality.Facets are also more predictive of life outcomes (compared to dimensions; Mershon and Gorsuch 1988;Paunonen and Ashton 2001).A potential new TPC research direction can be to predict facets in addition to dimensions.
Second, Big-5 has a much stronger empirical basis than MBTI.Namely, it is grounded in largescale quantitative analysis of natural language and survey data.Also, Big-5 questionnaires have undergone much more extensive development and validation processes than MBTI's.Consequently, many validated Big-5 questionnaires exist, which vary in length (15-240 items), the inclusion of facet measures, as well as the target populations (e.g., nationalities, professions, languages, and age groups). 5This enables researchers to choose a questionnaire most appropriate given a population of interest, available resources (e.g., can you afford a longer questionnaire?) and research interests (e.g., are you interested in facets?).In comparison, MBTI is purely theory-driven, lacks empirical support and has officially only four questionnaires that have not been thoroughly tested (Pittenger, 1993;Nowack, 1996;Grant, 2013). 6Therefore, compared to MBTI, Big-5 is a much more credible and flexible choice for research purposes.
Third, the Big-5 is rooted in natural language (i.e., the lexical hypothesis), suggesting that Big-5related cues may be more present than do MBTIrelated cues in text data.This conjecture is supported by Stajner and Yenikent (2021), who find either insufficient or mixed signals for MBTI dimensions in tweets and short essays.
Nevertheless, despite the many advantages of Big-5 over MBTI, we acknowledge that studying MBTI can still be useful given its popularity in nonacademic settings (Lloyd, 2012).Furthermore, it is also important to mention that Big-5 is not without criticisms and other personality taxonomies may be preferred, which invites the next challenge.

C2: Beyond the Big-5
While earlier lexical studies of the English language revealed five core dimensions of personality, more recent analysis of both English and non-English languages (e.g., Italian, Dutch, German, Korean), based on larger sets of adjectives, has suggested the existence of a sixth core dimension: Honesty/Humility, giving rise to a new "Big-6" taxonomy (a.k.a.HEXACO) (Ashton and Lee, 2007).Therefore, we encourage TPC researchers to explore alternatives to the Big-5, in timely accordance with developments in personality psychology.For a comprehensive overview of (other) personality taxonomies, see Cervone and Pervin (2022).

Measurement Quality
Personality traits are latent, theoretical variables (a.k.a.social science constructs), which can not be directly or objectively observed.Examples are emotions, prejudice and political orientation.Thus, 1989), the Revised NEO-PI and the NEO Five Factor Inventory (Costa and McCrae, 1992), the Big-5 Inventory (BFI) (Benet-Martinez and John, 1998), BFI-2 (Soto and John, 2017), the Short 15-item Big-5 Inventory (Lang et al., 2011).
personality traits are inherently difficult to measure.We can only approximate the true underlying personality trait scores from often noisy observations collected using personality instruments such as selfand other-report questionnaires.Due to the uncertainty in this approximation process, measurements for personality traits likely contain non-negligible error.This error is called measurement error, defined as the difference between an observed measurement and its true value (Dodge et al., 2003).
The presence of error in personality measurements can have negative consequences for TPC research.For instance, consider the case of having substantial measurement error in questionnairebased personality scores.When a TPC model treats these measurements as the gold standard for training and validation, the measurement error will likely propagate to the predictions, rendering the model less helpful (or even harmful) especially for diagnostic or clinical purposes.The study by Akrami et al. (2019) lends support to this hypothesis, where the authors find TPC models to perform better on small datasets with low measurement error than on large datasets with high measurement error.Therefore, it is important that TPC researchers are aware of the presence and influence of measurement error and can deal with it.Unfortunately, none of the 60 TPC studies that we survey touch upon this issue, suggesting that this is an underexplored issue in TPC research.This observation inspires the next four challenges (3-6).

C3: Choose high-quality instruments
Collecting high-quality personality measurements begins with using high-quality instruments, be they questionnaires or models. 7By high quality, we specifically mean high measurement quality.To determine the measurement quality of an instrument, it is important to understand the two components of measurement error: random error and systematic error, and how they relate to two quality criteria: reliability and validity.
Random error refers to random variations in measurements across comparable conditions, due to factors that cannot be controlled (Trochim et al., 2015).For instance, when someone completes a personality questionnaire twice, the responses may differ between the two attempts because the person misread a question in the second attempt.Random error is always present and unpredictable; it can be reduced but not eliminated.In NLP systems, random error can be due to random data splitting (Gorman and Bedrick, 2019), stochastic algorithms (Zhou et al., 2020), and certain random processes in data annotations, such as sampling of annotators and random annotation mistakes (Uma et al., 2021).
In contrast, systematic error occurs due to factors inherent to an instrument (Trochim et al., 2015).For instance, a poorly constructed Big-5 questionnaire may contain an item that is used to measure neuroticism while in fact it does not.Consequently, anyone taking this questionnaire will get a biased estimate of their neuroticism.Thus, systematic error is foreseeable and often constant or proportional to the true value.As long as its cause is identified, systematic error can be removed.In NLP systems, systematic error can occur when spurious correlations (or "short-cuts", instead of causal relationships) are learned (Wang et al., 2022).
Reliability and validity are the two criteria to the measurement quality of an instrument.The former concerns the extent to which an instrument can obtain the same measurement under comparable conditions, while the latter concerns the extent to which an instrument captures what it is supposed to (Trochim et al., 2015).Random error reduces an instrument's reliability, while systematic error undermines its validity.Therefore, a high-quality instrument is a reliable and valid one.We describe below how we can find out about the validity and reliability of a personality instrument.
For personality questionnaires, especially of Big-5, it is relatively easy to determine their measurement quality because many corresponding validity and reliability studies exist (see van der Linden et al. 2010 for an overview).For model-based personality instruments, however, they rarely undergo comprehensive analysis of measurement quality.Typically, studies report the predictive performance of a personality model on some test data (using metrics like accuracy, recall, precision, F1, mean squared error, correlation).Such performance numbers can offer insight into the model's validity, assuming that the gold-standard personality measurements are low in error.However, the model's validity in the presence of substantial measurement error in the data, as well as the model's reliability, remains unclear.Therefore, we urge future researchers to also examine and report both the validity and reliability of a model-based personality instrument.In Challenge 5, we discuss how this can be done.
However, it is important to exercise caution when selecting a personality instrument based on validity and reliability information obtained from earlier studies, as these studies had limitations in terms of the populations they examined (especially concerning demographic and linguistic characteristics), the time frames and contexts in which they were conducted.In a new study, the population of interest, the time and context may differ from those of previous studies.Consequently, researchers must carefully evaluate the validity and reliability evidence of an existing instrument in light of the specifics of the new study.
Once a personality instrument is selected, it is also important to cite the source of the instrument and report its validity and reliability information.Among the 60 reviewed papers, 5 use self-identified outcomes (like one's MBTI type or Big-5 scores mentioned in a tweet or user profile) where tracing down the instruments is impossible; 14 make use of proprietary instruments whose reliability and validity information is inaccessible to the public; among the 41 that use an existing personality instrument, 9 do not mention the specific instrument and only 4 report validity or reliability information based on previous studies.

C4: Further reduce measurement error by study design
Even when the best possible instrument is used, there can still be substantial measurement error that results from other design factors of a study, especially when questionnaires are used.Factors like questionnaire characteristics (e.g., the number of questions, visual layout, topics, wording) and data collection modes (e.g., online, in person) can affect the measurement quality of the responses (Biemer et al., 2013).Furthermore, factors related to respondents (e.g., inattention) can also affect measurement quality (Fleischer et al., 2015).
Therefore, when planning personality data collection using questionnaires, it can be beneficial to take into account different possible sources of measurement error.This helps to further reduce measurement error in questionnaire responses, in addition to using a valid, reliable questionnaire.For a comprehensive overview of factors that can influence measurement quality in questionnaires and the possible ways to control for them, we refer you to Callegaro et al. (2015) and Biemer et al. (2013).We also encourage collaboration with survey methodology experts.

C5: Quantify measurement error
Now, assume that personality measurements have been collected.The next step is to quantify both random and systematic error.
For questionnaire-based measurements, we recommend using factor analysis, which is a type of latent variable model that relates a set of observed variables (e.g., personality questionnaire items) to some latent variables (e.g., one's true, underlying personality traits) (Oberski, 2016).Depending on the model specification and data characteristics, factor analysis can decompose the total variation in the observed variables into different sources: variation due to the latent personality traits, variation due to systematic factors like questionnaire characteristics and the time of data collection, and finally, the unexplained variation (at the item level and at the questionnaire level).Larger variation due to the underlying personality traits and lower variation due to systematic factors are desirable, because they indicate higher measurement validity (i.e., less systematic error).In contrast, more unexplained variation indicates more random error (i.e., lack of reliability).Based on this variance decomposition, estimators of reliability and validity can be derived.We refer you to Saris and Gallhofer (2014, Chapter 9-12) for an overview of various factor analysis strategies and estimators of reliability and validity.
For model-based measurements, the number of measurements per personality trait and person is typically limited to one.Factor analysis cannot be applied to such data because the factor model is mathematically non-identifiable (i.e., there is no unique solution).Therefore, different methods for quantifying random and systematic error (or equivalently, reliability and validity) are needed.
Random error is due to factors that cannot be controlled.Therefore, by varying the instrument or measurement condition along such factors, we can quantify the associated random error.In TPC models, one such factor is small variation in text (e.g., the use of singular vs. plural noun, which has not been linked to personality traits by prior research) that should not affect predictions.By introducing simple perturbations to the data, and comparing the new predictions with the ones based on the original data, we can gauge the degree of random error associated with this factor.This approach is analogous to Ribeiro et al. ( 2020)'s invariance tests.Du et al. (2021) also showcase such reliability analyses for word embedding-based gender bias scores.
To quantify validity, apart from the usual performance metrics, we can check whether the predicted scores of different personality traits correlate with one another in expected ways.For instance, while the overall correlations should be low, as different personality traits are distinct constructs, some correlations should be more positive (e.g., between conscientiousness and agreeableness) than others (e.g., between openness and neuroticism).van der Linden et al. ( 2010) provide an overview of empirical correlations between personality traits across demographic groups that can be expected.To the best of our knowledge, no TPC study has assessed the correlations among predicted personality scores.Furthermore, it can be helpful to check whether the predicted personality scores relate to other constructs like emotions (if data is available) in expected ways.Such tests are conceptually similar to convergent and discriminant validity analyses in the social sciences (Stachl et al., 2020).Even in the presence of large measurement error in the data, they can be useful.For a more in-depth discussion of validity testing in machine learning and NLP, see Jacobs and Wallach (2021) and Fang et al. (2022).

C6: Correct for measurement error
Choosing high-quality personality instruments and reducing measurement error by design are likely the most important and effective ways to ensure high quality measurements.Once personality measurements have been collected, however, much less can be done about measurement error.
For questionnaire-based measurements, it might still be helpful to take a closer look at the results from factor analysis.For instance, do the measurements fit the assumed personality model (e.g., the Big-5)?If not, is it due to one or more questionnaire items that show unexpected relationships with the personality traits (e.g., the relationship is zero; the item correlates strongly with a different personality trait than expected)?If so, removing those problematic items can improve the validity of the personality measurements.Are there items with large unexplained variances?If so, removing them may increase reliability.
As for model-based measurements, if the personality models are proprietary or cannot be modified and retrained (e.g., due to lack of data or model details), then no correction for measurement error is possible.If retraining the model is possible, several techniques may help (see C12, C13 in §7).

Datasets
Across the 60 TPC studies, we find 41 unique datasets, which vary in terms of the personality taxonomy, instrument, type of text data, sample size, sample characteristics etc.Among them, however, only 18 are potentially accessible to other researchers (see Appendix B).Shareable datasets are key to advancing TPC research, as it leads to accumulation of data and allows for replication studies.This invites the next challenge:

C7: Construct shareable datasets
One obstacle to sharing TPC datasets is privacy preservation, as TPC datasets often contain identifiable information (e.g., names, locations, events) about data subjects.For instance, with social media posts, their authors can be easily found by using the content of the posts as search terms (Norman Adams, 2022).We suggest two ways to make data sharing more privacy-preserving.
First, data pseudonymization and anonymization techniques can be used.With pseudonymization, the data subjects can still be identified if additional information is provided.With anonymization, however, re-identification is impossible.Whether to pseudonymize or anonymize depends on many factors, such as the difficulty in data anonymization and the severity of re-identification.Nevertheless, for TPC datasets containing social media posts, anonymization is likely impossible.We refer you to Lison et al. (2021) for more information.
Second, we can replace texts with paraphrases or synthetic data.The latter aims to "preserve the overall properties and characteristics of the original data without revealing information about actual individual data samples" (Hittmeir et al., 2019).However, whether these strategies are effective enough remain an open research question in NLP.

C8: Finer-grained measurements
All the 18 shareable datasets we find include only aggregated measurements of personality traits.Namely, for MBTI, only the classification types (e.g., INFP and ESTJ; instead of scores on each questionnaire item) are available; for Big-5, only the aggregated scores (e.g., means across items) for the five dimensions.This makes it impossible for other researchers to investigate measurement quality or train TPC models on the facet or item level.Even worse, some datasets provide no information about the personality instrument used.This especially concerns datasets that obtain personality labels from Twitter or Reddit based on the mention of MBTI or Big-5 information in a post or user profile (e.g., "INTJ"; "As an extravert. . .").
Other problematic treatments of aggregated personality measurements include further discretization, within-sample standardization or normalization to the target population.All leads to loss of information and limits the reusability of the measurements.Therefore, we suggest providing raw personality measurements, ideally on the item level.

C9: Include demographic information
We argue that the inclusion of demographic information (e.g., age, gender, education) can be important.Not only can this help researchers decide the appropriate personality instrument to use (in relation to the population of interest; see earlier discussion in §4: C3), it can also provide additional useful features for TPC models.Furthermore, researchers can make use of the demographic information to diagnose the model (e.g., whether the measurement quality or the model's prediction performance differs across demographic groups) (i.e., fairness).However, it is important to weigh the gain from including extra personal information against potential harm (see §8: C14).

Performance Evaluation
Across the 60 surveyed TPC papers, we identify two challenges related to performance evaluation: C10: Use more appropriate and consistent performance metrics 11 out of the 60 studies model TPC as a regression task.Among them, 9 use Pearson's correlation between predicted personality trait scores and the true scores as the performance metric.However, correlation-based metrics can be misleading, as they register only ranks and do not reflect how accurate the predictions are on the original scale of the personality scores (Stachl et al., 2020).
Some studies report mean squared error (MSE), which is arguably better than correlations because it quantifies the absolute difference between the predictions and the true values.However, MSE scores depend on the scale of the personality measurements and are not bounded, making interpretation and comparison (between studies) difficult.Stachl et al. (2020) propose a better performance metric: R 2 = 1 − RSS/T SS, where RSS refers to the sum of squares of residuals and T SS the total sum of squares.R 2 has several benefits.First, it can be considered a normalised version of MSE, which has an upper limit of 1 (perfect agreement).Second, R 2 has a natural zero point, which occurs when the mean is used as the prediction.Third, when the model makes worse predictions than a simple mean baseline, R 2 becomes negative.
While two studies report R 2 , their calculation of R 2 is unclear.The researchers may have calculated R 2 not based on the formula shown earlier, but by squaring Pearson's correlations.This would lead to always positive R 2 , which can be misleading.
In the 33 studies that model TPC as a classification task, a more diverse set of metrics are used (i.e., accuracy, recall, precision, F1, and AUC, in macro, micro, weighted or unweighted forms).One problem, however, is that different studies report different metrics (sometimes, only one).This makes comparison across studies difficult.We encourage future researchers to report all common metrics for classification studies (like those mentioned above).

C11: Report performance expectation
While it is normal to optimize the prediction performance of TPC models, it is also important to set correct expectations: What kind of performance can we realistically expect?How accurate can personality predictions be when only (short) text is used?How good does the performance need to be for a particular system?Stajner and Yenikent (2021) make the first question even more relevant, as they find either few or mixed MBTI-related signals in typical text data used for MBTI prediction.
Quantifying measurement error in the personality scores used for modelling can help researchers to set clearer expectations about model performance, because high systematic error will limit the model's generalizability to new data, while high random error will result in unstable predictions.
Thus, setting expectations "forces" researchers to learn more about their data and to avoid unrealistic expectations that may lead to problematic research practices like cherry-picking results.Unfortunately, none of the 60 reviewed papers discusses performance expectation.

Modelling Choices
Most TPC studies model different personality traits (i.e., dimensions) separately.This strategy is unde-sirable, because it ignores the correlations among personality traits that models can learn from.Modelling personality traits jointly may also help to prevent overfitting to a specific trait and thus learn more universal personality representations (Liu et al., 2019).Hence, the next challenge: C12: Joint personality modelling Out of the 60 studies, only 5 attempt at (some form of) joint modelling of personality traits.Yang et al. (2021a) implement a transformer-based model to predict MBTI types, where the use of questionnaire texts allows the model to infer automatically the relevant MBTI dimension, and hence removes the need for independent modelling of different MBTI dimensions.Gjurković and Šnajder (2018), Bassignana et al. (2020b) and Hosseinia et al. (2021) frame the prediction of MBTI types as a 16-class classification task (as there are in total 16 MBTI types), thereby using only one single model.Hull et al. (2021) apply "stacked single target chains" (Xioufis et al., 2016), which feeds the predictions of one personality trait back in as features for the prediction of the next trait(s).
Multitask learning may also be useful, which trains a model on multiple tasks simultaneously and thus might help to improve the generalizability of the model (Caruana, 1997).In addition, Stachl et al. (2020) suggest modifying a model's loss function such that the correlations between theoretically distinct constructs are minimised.Building on this idea, we can also specify the loss function in a way that it not only focuses on general prediction performance but also minimises the difference between the predicted covariance matrix and the observed covariance matrix of personality traits.

C13: Build on best modelling choices
As the field progresses, it is important to not only investigate new modelling ideas, but also accumulate knowledge about best modelling practices.We list below several empirically supported ideas that should not slip past the community's attention.
First, researchers should leverage the texts in personality questionnaires.Kreuter et al. (2022), Vu et al. (2020) and Yang et al. (2021a) find that incorporating personality questionnaire texts into model learning can lead to better personality predictions.
Second, when sample sizes are small, data augmentation and dimensionality reduction techniques are beneficial.Kreuter et al. (2022) show that using data augmentation to increase the training size of personality questionnaire items leads to better predictions.V Ganesan et al. (2021) show that PCA helps to overcome the problem of fine-tuning large language models with a small TPC dataset.

Ethics and Fairness
Out of the 60 reviewed papers and the 2 additional survey papers, only 7 provide some reflection about ethics and none about fairness.This can be because ethics and fairness only became central in NLP recently.Nevertheless, the last two challenges: C14: More ethical and useful TPC As useful as TPC can be, it is important to ask whether gathering personal information like personality or computing them is necessary.This is especially relevant for research where TPC is only an intermediate step to another end such as opinion mining (Hosseinia et al., 2021), dialogue generation (Mairesse and Walker, 2008) and brand preference prediction (Yang et al., 2015).Such studies typically argue that the computed personality traits can be used as features for another task and that it leads to better task performance; however, they do not consider alternatives (e.g., replacing the prediction of personality traits with using lexical cues that are non-personal but still indicative of personality).Thus, we encourage researchers to justify PC and to find alternatives when PC is only a means.
Even when TPC can be justified, it is important to reduce potential harm.For instance, many TPC studies and datasets make use of public social media profiles for predicting personality traits.While this is often legal, no explicit consent for PC is obtained from the social media users, which makes using public social media data an ethically ambiguous issue (Norman Adams, 2022).Boeschoten et al. (2022) proposes a privacy-preserving data donation framework that may help to alleviate this problem, where data subjects can voluntarily donate their data download packages (e.g., from social media accounts) for research and give explicit consent.
To further increase the benefit of TPC, we can consider applying it to clinical, professional or educational settings, where (traditional) personality assessment has proven useful and relevant (e.g., personalised treatments; career recommendation; individualised learning).None of the 60 TPC studies in our review investigates these applications.

C15: Research on Fair TPC
Fairness research in machine learning concerns identifying and mitigating biases that may be present within a system, particularly towards specific groups (Mehrabi et al., 2021).For instance, a fair TPC model should exhibit equal predictive performance across different demographic groups.Among many others, biased training data is a significant factor contributing to algorithmic bias.From the perspective of measurement quality, the choice of a personality instrument that lacks equal validity and reliability across all demographic groups of interest can introduce variations in the quality of personality measurements among different groups.Consequently, these discrepancies can perpetuate algorithmic bias within the system.
Remarkably, none of the 60 TPC papers we survey address the topic of fairness.Therefore, there is a clear need for future research on fairness in the context of TPC.

Conclusion
In this paper, we review 60 TPC papers from the ACL Anthology and identify 15 challenges that we consider deserving the attention of the research community.We focus on the following 6 topics: personality taxonomies, measurement quality, datasets, performance evaluation, modelling choices, as well as ethics and fairness.While some of these topics (e.g., personality taxonomies and ethics) have been discussed elsewhere, we provide new perspectives.Furthermore, in light of these challenges, we offer concrete recommendations for future TPC research, which we summarise below: • Personality taxonomies: Choose Big-5 over MBTI; Try modelling facets and using other taxonomies like HEXACO where appropriate.
• Measurement quality: Pay attention to measurement error in personality measurements, be they based on questionnaires or models; Try to reduce measurement error by design (e.g., choose higher-quality instruments; use better data collection practices); Provide quality evaluation (i.e., validity and reliability) for any new (and also existing) approaches.
• Datasets: Make TPC datasets shareable, which should also contain fine-grained personality measurements and descriptions of the target population; • Performance evaluation: Report a diverse set of performance metrics; Report R 2 for a regression task.
• Modelling choices: Make use of their psychometric properties when modelling personality traits (e.g., use joint modelling; modify the loss function to preserve the covariance information); For even better predictions, try incorporating personality questionnaire texts, applying data augmentation and dimensionality reduction techniques, as well as incorporating more personality-related variables.
• Ethics and fairness: Avoid unnecessary TPC; Apply TPC to clinical, professional and educational settings; Investigate fairness.
We hope that our paper will inspire better TPC research and new research directions.

Limitations
Our paper has some limitations.First, we do not give detailed instructions about techniques that we recommended (e.g., factor analysis, synthetic data generation).We rely on our readers' autonomy to acquire the necessary information (that is specific to their research projects) by further reading our recommended references.Second, we only survey TPC papers included in the ACL Anthology, despite other TPC papers existing outside this venue.While this means that the challenges we identified might be specific to these papers, we believe they are still a good representation of the TPC research done in NLP.Lastly, we limit our discussion to text data.It would be beneficial for future research to also discuss challenges facing PCs based on other types of data (e.g., images, behaviours, videos), which may offer additional insights to TPC research.

Statement of Ethics and Impact
Our work provides a critical evaluation of past TPC research, where we present 15 open challenges that we consider deserving the attention of the research community.For each of these challenges, we offer concrete suggestions, thereby hoping to inspire higher-quality TPC research (e.g., more valid and reliability personality measurements, better datasets, better modelling practices).We also discuss issues related to ethics and fairness (see §8).We hope to see more ethical and fair TPC research.