Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality

In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in human-level tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just 1/12 of the embedding dimensions.


Introduction
Transformer based language models (LMs) have quickly become the foundation for accurately approaching many tasks in natural language processing (Vaswani et al., 2017;Devlin et al., 2019). Owing to their success is their ability to capture both syntactic and semantic information (Tenney et al., 2019), modeled over large, deep attention-based networks (transformers) with hidden state sizes on the order of 1000 over 10s of layers (Liu et al., 2019;Gururangan et al., 2020). In total such models typically have from hundreds of millions (Devlin et al., 2019) to a few billion parameters (Raffel et al., 2020). However, the size of such models presents a challenge for tasks involving small numbers of observations, such as for the growing number of tasks focused on human-level NLP.
Human-level NLP tasks, rooted in computational social science, focus on making predictions about people from their language use patterns. Some of the more common tasks include age and gender prediction (Sap et al., 2014;Morgan-Lopez et al., 2017) , personality (Park et al., 2015;Lynn et al., 2020), and mental health prediction (Coppersmith et al., 2014;Guntuku et al., 2017;Lynn et al., 2018). Such tasks present an interesting challenge for the NLP community to model the people behind the language rather than the language itself, and the social scientific community has begun to see success of such approaches as an alternative or supplement to standard psychological assessment techniques like questionnaires (Kern et al., 2016;Eichstaedt et al., 2018). Generally, such work is helping to embed NLP in a greater social and human context (Hovy and Spruit, 2016;Lynn et al., 2019).
Despite the simultaneous growth of both (1) the use of transformers and (2) human-level NLP, the effective merging of transformers for humanlevel tasks has received little attention. In a recent human-level shared task on mental health, most participants did not utilize transformers (Zirikly et al., 2019). A central challenge for their utilization in such scenarios is that the number of training examples (i.e. sample size) is often only hundreds while the parameters for such deep models are in the hundreds of millions. For example, recent human-level NLP shared tasks focused on mental health have had N = 947 (Milne et al., 2016), N = 9, 146 (Lynn et al., 2018) and N = 993 (Zirikly et al., 2019) training examples. Such sizes all but rules out the increasingly popular approach of fine-tuning transformers whereby all its millions of parameters are allowed to be updated toward the specific task one is trying to achieve (Devlin et al., 2019;Mayfield and Black, 2020). Recent research not only highlights the difficulty in fine-tuning with few samples (Jiang et al., 2020) but it also becomes unreliable even with thousands of training examples (Mosbach et al., 2020).
On the other hand, some of the common transformer-based approaches of deriving contextual embeddings from the top layers of a pretrained model (Devlin et al., 2019;Clark et al., 2019) still leaves one with approximately an equal number of embedding dimensions as training size. In fact, in one of the few successful cases of using transformers for a human-level task, further dimensionality reduction was used to avoid overfit (Matero et al., 2019), but an empirical understanding of the application of transformers for human-level tasks -which models are best and the relationship between embedding dimensions, sample size, and accuracy -has yet to be established.
In this work, we empirically explore strategies to effectively utilize transformer-based LMs for relatively small sample-size human-level tasks. We provide the first systematic comparison of the most widely used transformer models for demographic, personality, and mental health prediction tasks. Then, we consider the role of dimension reduction to address the challenge of applying such models on small sample sizes, yielding a suggested minimum number of dimensions necessary given a sample size for each of demographic, personality, and mental health tasks 1 . While it is suspected that transformer LMs contain more dimensions than necessary for document-or word-level NLP (Li and Eisner, 2019;Bao and Qiao, 2019), this represents the first study on transformer dimensionality for human-level tasks.

Related Work
Recently, NLP has taken to human-level predictive tasks using increasingly sophisticated techniques. The most common approaches use n-grams and LDA (Blei et al., 2003) to model a person's language and behaviors (Resnik et al., 2013;Kern et al., 2016). Other approaches utilize word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and more recently, contextual word representations (Ambalavanan et al., 2019).
Our work is inspired by one of the top performing systems at a recent mental health prediction shared task (Zirikly et al., 2019) that utilized transformer-based contextualized word embeddings fed through a non-negative matrix fac-1 dimension reduction techniques can also be pre-trained leveraging larger sets of unlabeled data torization to reduce dimensionality (Matero et al., 2019). While the approach seems reasonable for addressing the dimensionality challenge in using transformers, many critical questions remain unanswered: (a) Which type of transformer model is best? (b) Would fine-tuning have worked instead? and (c) Does such an approach generalize to other human-level tasks? Most of the time, one does not have a luxury of a shared task for their problem at hand to determine a best approach. Here, we look across many human-level tasks, some of which with the luxury of having relatively large sample sizes (in the thousands) from which to establish upper-bounds, and ultimately to draw generalizable information on how to approach a human-level task given its domain (demographic, personality, mental health) and sample size.
Our work also falls in line with a rising trend in AI and NLP to quantify the number of dimensions necessary. While this has not been considered for human-level tasks, it has been explored in other domains. The post processing algorithm (Mu and Viswanath, 2018) of the static word embeddings motivated by the power law distribution of maximum explained variance and the domination of mean vector turned out to be very effective in making these embeddings more discriminative. The analysis of contextual embedding models (Ethayarajh, 2019) suggest that the static embeddings contribute to less than 5% to the explained variance, the contribution of the mean vector starts dominating when contextual embedding models are used for human-level tasks. This is an effect of averaging the message embeddings to form user representations in human-level tasks. This further motivates the need to process these contextual embeddings into more discriminative features.
Lastly, our work weighs into the discussion on just which type of model is best in order to produce effective contextual embedding models. A majority of the models fall under two broad categories based on how they are pre-trained -auto-encoders (AE) and auto-regressive (AR) models. We compare the performance of AE and AR style LMs by comparing the performance of two widely used models from each category with comparable number of parameters. From the experiments involving BERT, RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019) and GPT-2 (Radford et al., 2019), we find that AE based models perform better than AR style models (with comparable model sizes), and RoBERTa is the best choice amongst these four widely used models.

Data & Tasks
We evaluate approaches over 7 human-level tasks spanning Demographics, Mental Health, and personality prediction. The 3 datasets used for these tasks are described below. FB-Demogs. (age, gen, ope, ext) One of our goals was to leverage one of the largest humanlevel datasets in order to evaluate over subsamples of sizes. For this, we used the Facebook demographic and personality dataset of Kosinski et al. (2013). The data was collected from approximately 71k consenting participants who shared Facebook posts along with demographic and personality scores from Jan-2009 through Oct-2011. The users in this sample had written at least a 1000 words and had selected English as their primary language. Age (age) was self-reported and limited to those 65 years or younger (data beyond this age becomes very sparse) as in (Sap et al., 2014). Gender (gen) was only provided as a limited single binary, male-female classification.
Personality was derived from the Big 5 personality traits questionnaires, including both extraversion (ext -one's tendency to be energized by social interaction) and openess (ope, one's tendency to be open to new ideas) (Schwartz et al., 2013). Disattenuated Pearson correlation 2 (r dis ) was used to measure the performance of these two personality prediction tasks.
CLPsych-2018. (bsag, gen2) The CLPsych 2018 shared task (Lynn et al., 2018) consisted of sub-tasks aimed at early prediction of mental health scores (depression, anxiety and BSAG 3 score) based on their language. The data for this shared task (Power and Elliott, 2005) comprised of English essays written by 11 year old students along with their gender (gen2) and income classes. There were 9217 students' essays for training and 1000 for testing. The average word count in an essay was less than 200. Each essay was annotated with the student's psychological health measure, BSAG (when 11 years old) and distress scores at ages 23, 33, 42 and 50. This task used a disattenuated pearson correlation as the metric (r dis ).

CLPsych-2019.
(sui) This 2019 shared task (Zirikly et al., 2019) comprised of 3 sub-tasks for predicting the suicide risk level in reddit users. This included a history of user posts on r/SuicideWatch (SW), a subreddit dedicated to those wanting to seek outside help for processing their current state of emotions. Their posts on other subreddits (NonSuicideWatch) were also collected. The users were annotated with one of the 4 risk levels: none, low, moderate and severe risk based on their history of posts. In total this task spans 496 users in training and 125 in testing. We focused on Task A, predicting suicide risk of a user by evaluating their (English) posts across SW, measured via macro-F1.

FB-Demogs
CLPsych 2018  Table 1: Summary of the datasets. N pt is the number of users available for pre-training the dimension reduction model; N max is the maximum number of users available for task training. For CLPsych 2018 and CLPsych 2019, this would be the same sample as pre-training data. For Facebook, a disjoint set of 10k users was available for task training; N te is the number of test users. This is always a disjoint set of users from the pre-training and task training samples.

Methods
Here we discuss how we utilized representations from transformers, our approaches to dimensionality reduction, and our technique for robust evaluation using bootstrapped sampling.

Transformer Representations
The second to last layer representation of all the messages was averaged to produce a 768 dimensional feature for each user 4 . These user representations are reduced to lower dimensions as described in the following paragraphs. The message representation from a layer was attained by averaging the token embeddings of that layer. To con-sider a variety of transformer LM architectures, we explored two popular auto-encoder (BERT and RoBERTa) and two auto-regressive (XLNet and GPT-2) transformer-based models. For fine-tuning evaluations, we used the transformer based model that performs best across the majority of our task suite. Transformers are typically trained on single messages or pairs of messages, at a time. Since we are tuning towards a human-level task, we label each user's message with their human-level attribute and treat it as a standard document-level task (Morales et al., 2019). Since we are interested in relative differences in performance, we limit each user to at most 20 messages -approximately the median number of messages, randomly sampled, to save compute time for the fine tuning experiments.

Dimension Reduction
We explore singular value decomposition-based methods such as Principal components analysis (PCA) (Halko et al., 2011), Non-negative matrix factorization (NMF) (Févotte and Idier, 2011) and Factor analysis (FA) as well as a deep learning approach: multi-layer non linear auto encoders (NLAE) (Hinton and Salakhutdinov, 2006). We also considered the post processing algorithm (PPA) of word embeddings 5 (Mu and Viswanath, 2018) that has shown effectiveness with PCA on word level (Raunak et al., 2019). Importantly, besides transformer LMs being pre-trained, so too can dimension reduction. Therefore, we distinguish: (1) learning the transformation from higher dimension to lower dimensions (preferably on a large data sample from the same domain) and (2) applying the learned transformation (on the task's train/test set). For the first step, we used a separate set of 56k unlabeled user data in the case of FB-demog 6 . For CLPsych-2018 and -2019 (where separate data from the exact domains was not readily available), we used the task training data to train the dimension reduction. Since variance explained in factor analysis typically follows a power law, these methods transformed the 768 original embedding dimensions down to k, in powers of 2: 16, 32, 64, 128, 256 or 512.

Bootstrapped Sampling & Training
We systematically evaluate the role of training sample (N ta ) versus embedding dimensions (k) for human-level prediction tasks. The approach is described in algorithm 1. Varying N ta , the taskspecific train data (after dimension reduction) is sampled randomly (with replacement) to get ten training samples with N ta users each. Small N ta values simulate a low-data regime and were used to understand its relationship with the least number of dimensions required to perform the best (N ta vs k). Bootstrapped sampling was done to arrive at a conservative estimate of performance. Each of the bootstrapped samples was used to train either an L2 penalized (ridge) regression model or logistic regression for the regression and classification tasks respectively. The performance on the test set using models from each bootstrapped training sample was recorded in order to derive a mean and standard error for each N ta and k for each task.
To summarize results over the many tasks and possible k and N ta values in a useful fashion, we propose a 'first k to peak (fkp)' metric. For each N ta , this is the first observed k value for which the mean score is within the 95% confidence interval of the peak performance. This quantifies the minimum number of dimensions required for peak performance. 5 The 'D' value was set to number of dimensions 100 . 6 these pre-trained dimension reduction models are made available. LM demographics personality mental health N ta type name age (r)  Table 2: Comparison of most commonly used auto-encoders (AE) and auto-regressor (AR) language models after reducing the 768 dimensions to 128 using NMF and trained on 100 and 500 samples (N ta ) for each task. (N ta ) pertains to the number of samples used for training each task. Classification tasks (gen, gen2 and sui) were scored using macro-F1 (F1); the remaining regression tasks were scored using pearson-r (r)/ disattenuated pearson-r (r dis ). AE models predominantly perform the best. RoBERTa and BERT show consistent performance, with the former performing the best in most tasks. The LMs in the table were base models (approx. 110M parameters).

Best LM for Human-Level Tasks
We start by comparing transformer LMs, replicating the setup of one of the state-of-the-art systems for the CLPsych-2019 task in which embeddings were reduced from BERT-base to approximately 100 dimensions using NMF (Matero et al., 2019). Specifically, we used 128 dimensions (to stick with powers of 2 that we use throughout this work) as we explore the other LMs over multiple tasks (we will explore other dimensions next) and otherwise use the bootstrapped evaluation described in the method. Table 2 shows the comparison of the four transformer LMs when varying the sample size (N ta ) between two low data regimes: 100 and 500 7 . RoBERTa and BERT were the best performing models in almost all the tasks, suggesting autoencoders based LMs are better than auto-regressive models for these human-level tasks. Further, RoBERTa performed better than BERT in the majority of cases. Since the number of model parameters are comparable, this may be attributable to RoBERTa's increased pre-training corpus, which is inclusive of more human discourse and larger vocabularies in comparison to BERT. 7 The performance of all transformer embeddings without any dimension reduction along with smaller sized models can be found in the appendix section D.3.  Table 3: Comparison of task specific fine tuning of RoBERTa (top 2 layers) and pre-trained RoBERTa embeddings (second to last layer) for age and gender prediction tasks. Results are averaged across 5 trials randomly sampling users equal to N ta from the Facebook data and reducing messages to maximum of 20 per user.

Fine-Tuning Best LM
We next evaluate fine-tuning in these low data situations 8 . Utilizing RoBERTa, the best performing transformer from the previous experiments, we perform fine-tuning across the age and gender tasks. Following (Sun et al., 2019;Mosbach et al., 2020), we freeze layers 0-9 and fine-tune layers 10 and 11. Even these top 2 layers alone of RoBERTa still result in a model that is updating tens of millions of parameters while being tuned to a dataset of hundreds of users and at most 10,000 messages. In table 3, results for age and gender are shown for both sample sizes of 100 and 500. For Age, the average prediction across all of a user's messages was used as the user's prediction and for gender the mode was used. Overall, we find that fine-tuning demographics personality mental health  offers lower performance with increased overhead for both train time and modeling complexity (hyperparameter tuning, layer selection, etc). We did robustness checks for hyper-parameters to offer more confidence that this result was not simply due to the fastidious nature of fine-tuning. The process is described in Appendix B, including an extensive exploration of hyper-parameters, which never resulted in improvements over the pretrained setup. We are left to conclude that finetuning over such small user samples, at least with current typical techniques, is not able to produce results on par with using transformers to produce pre-trained embeddings.

Best Reduction technique for Human-Level Tasks
We evaluated the reduction techniques in low data regime by comparing their performance on the downstream tasks across 100 and 500 training samples (N ta ). As described in the methods, techniques including PCA, NMF and FA along with NLAE, were applied to reduce the 768 dimensional RoBERTa embeddings to 128 features. The results in table 4 show that PCA and NLAE perform most consistently, with PCA having the best scores in the majority tasks. NLAE's performance appears dependent on the amount of data available during the pre-training. This is evident from the results in Table 4 where the N pt was set to a uniform value and tested for all the tasks with N ta set to 100 and 500. Thus, PCA appears a more reliable, showing more generalization for low samples.

Performance by Sample Size and Dimensions
Now that we have found (1) RoBERTa generally performed best, (2) pre-trainining worked better than fine-tuning, and (3) PCA was most consistently best for dimension reduction (often doing better than the full dimensions), we can systematically evaluate model performance as a function of training sample size (N ta ) and number of dimensions (k) over tasks spanning demographics, personality, and mental health. We exponentially increase k from 16 to 512, recognizing that variance explained decreases exponentially with dimension (Mu and Viswanath, 2018). The performance is also compared with that of using the RoBERTa embeddings without any reduction. Figure 1 compares the scores at reduced dimensions for age, ext, ope and bsag. These charts depict the experiments on typical low data regime (N ta ≤ 1000). Lower dimensional representations performed comparable to the peak performance with just 1 3 the features while covering the most Dimension reduced (mean ± std err) All dimensions (mean) Figure 1: Comparison of performance for all regression tasks: age, ext, ope and bsag over varying N ta and k. Results vary by task, but predominantly, performance at k=64 is better than the performance without any reduction. It is conclusive that the reduced features almost always performs better or as good as the original embeddings.
number of tasks and just 1 12 features for the majority of tasks. Charts exploring other ranges of N ta values and remaining tasks can be found in the appendix D.1.

Least Number of Dimensions Required
Lastly, we devise an experiment motivated by answering the question of how many dimensions are necessary to achieve top results, given a limited sample size. Specifically, we define 'first k to peak' (fkp) as the least valued k that produces an accuracy equivalent to the peak performance. A 95% confidence interval was computed for the best score (peak) for each task and each N ta based on bootstrapped resamples, and fkp was the least number of dimensions where this threshold was passed.
Our goal is that such results can provide a systematic guide for making such modeling decisions  in future human-level NLP tasks, where such an experiment (which relies on resampling over larger amounts of training data) is typically not feasible. Table 5 shows the fkp over all of the training sample sizes (N ta ). The exponential median (med) in the table is calculated as follows: med = 2 Median(log(x)) The fkp results suggest that more training samples available yield ability to leverage more dimensions, but the degree to which depends on the task. In fact, utilizing all the embedding dimensions was only effective for demographic prediction tasks. The other two tasks benefited from reduction, often with only 1 12 to 1 6 of the original second to last transformer layer dimensions.

Error Analysis
Here, we seek to better understand why using pretrained models worked better than fine-tuning, and differences between using PCA and NMF components in the low sample setting (N ta = 500).  This suggests that the fine-tuned models have lesser error than pre-trained model when the language is informal and consists of more affect words.

Association LIWC variables
Pre-trained vs Fine-tuned. We looked at categories of language from LIWC (Tausczik and Pennebaker, 2010), correlated with the difference in the absolute error of the pre-trained and fine-tuned model in age prediction. Table 6 suggests that pre-trained model is better at handling users with language conforming to the formal rules, and finetuning helps in learning better representation of the affect words and captures informal language well. Furthermore, these LIWC variables are also known to be associated with age (Schwartz et al., 2013). Additional analysis comparing these two models is available in appendix E.1.
PCA vs NMF. Figure 2 suggests that PCA is better at handling longer text sequences than NMF (> 55 one grams on avg) when trained with less data. This choice wouldn't make much difference when used for Tweet-like short texts, but the errors diverge rapidly for longer samples. We also see that PCA is better at capturing information from these texts that have higher predictive power in downstream tasks. This is discussed in appendix E.2 along with other interesting findings involving the comparison of PCA and the pre-trained model in E.3.

Discussion
Ethical Consideration. We used existing datasets that were either collected with participant consent (FB and CLPsych 2018) or public data with identifiers removed and collected in a non-intrusive manner (CLPsych 2019). All procedures were reviewed and approved by both our institutional review board as well as the IRB of the creators of the data set. Our work can be seen as part of the growing body of interdisciplinary research intended to understanding human attributes associated with language, aiming towards applications that can improve human life, such as producing better mental health assessments that could ultimately save lives. However, at this stage, our models are not intended to be used in practice for mental health care nor labeling of individuals publicly with mental health, personality, or demographic scores. Even when the point comes where such models are ready for testing in clinical settings, this should only be done with oversight from professionals in mental health care to establish the failure modes and their rates (e.g. false-positives leading to incorrect treatment or false-negatives leading to missed care; increased inaccuracies due to evolving language; disparities in failure modes by demographics). Malicious use possibilities for which this work is not intended include targeting advertising to individuals using language-based psychology scores, which could present harmful content to those suffering from mental health conditions. We intend that the results of our empirical study are used to inform fellow researchers in computational linguistics and psychology on how to better utilize contextual embeddings towards the goal of improving psychological and mental health assessments. Mental health conditions, such as depression, are widespread and many suffering from such conditions are under-served with only 13 -49% receiving minimally adequate treatment (Kessler et al., 2003;Wang et al., 2005). Marginalized populations, such as those with low income or minorities, are especially under-served (Saraceno et al., 2007). Such populations are well represented in social media (Center, 2021) and with this technology developed largely over social media and predominantly using self-reported labels from users (i.e., rather than annotator-perceived labels that sometimes introduce bias (Sap et al., 2019;Flekova et al., 2016)), we do not expect that marginalized populations are more likely to hit failure modes. Still, tests for error disparities (Shah et al., 2020) should be carried out in conjunction with clinical researchers before this technology is deployed. We believe this technology offers the potential to broaden the coverage of mental health care to such populations where resources are currently limited.
Future assessments built on the learnings of this work, and in conjunction with clinical mental health researchers, could help the under-served by both better classifying one's condition as well as identifying an ideal treatment. Any applications to human subjects should consider the ethical implications, undergo human subjects review, and the predictions made by the model should not be shared with the individuals without consulting the experts.
Limitations. Each dataset brings its own unique selection biases across groups of people, which is one reason we tested across many datasets covering a variety of human demographics. Most notably, the FB dataset is skewed young and is geographically focused on residents within the United States. The CLPsych 2018 dataset is a representative sample of citizens of the United Kingdom, all born on the same week, and the CLPsych-2019 dataset was further limited primarily to those posting in a suicide-related forum (Zirikly et al., 2019). Further, tokenization techniques can also impact language model performance (Bostrom and Durrett, 2020). To avoid oversimplification of complex human attributes, in line with psychological research (Haslam et al., 2012), all outcomes were kept in their most dimensional form -e.g. personality scores were kept as real values rather than divided into bins and the CLPsych-2019 risk levels were kept at 4 levels to yield gradation in assessments as justified by Zirikly et al., 2019.

Conclusion
We provide the first empirical evaluation of the effectiveness of contextual embeddings as a function of dimensionality and sample size for human-level prediction tasks. Multiple human-level tasks along with many of the most popular language model techniques, were systematically evaluated in conjunction with dimension reduction techniques to derive optimal setups for low sample regimes characteristic of many human-level tasks.
We first show the fine-tuning transformer LMs in low-data scenarios yields worse performance than pre-trained models. We then show that reducing dimensions of contextual embeddings can improve performance and while past work used non-negative matrix factorization (Matero et al., 2019), we note that PCA gives the most reliable improvement. Auto-encoder based transformer language models gave better performance, on average, than their auto-regressive contemporaries of comparable sizes. We find optimized versions of BERT, specifically RoBERTa, to yield the best results.
Finally, we find that many human-level tasks can be achieved with a fraction, often 1 where GPU memory and model sizes (space occupied by the model) are in bytes, trainable params corresponds to number of trainable parameters during fine tuning and layers corresponds to the number of layers of embeddings required, the hidden_size is the number of dimensions in the hidden state and max_tokens is the maximum number of tokens (after tokenization) in any batch. We carried out the experiments with 1 NVIDIA Titan Xp GPU which has around 12 GB of memory. All the other methods were implemented on CPU. 50, 100, 200, ...] x 10 samples  Figure A1: Depiction of Dimension Reduction method -Transformer embeddings of domain data (N pt users' embeddings 9 ) is used to pre-train a dimension reduction model that transforms the embeddings down to k dimensions. This step is followed by applying this learned reduction model on task's train and test data embeddings. These reduced train features (N max users) are then bootstrap sampled to produce 10 sets of N ta users each for training task specific models. All these 10 task specific models are evaluated on the reduced test features consisting of N te users during task evaluation. The mean and standard deviation of the task specific metric are collected.

B Model Details
NLAE architecture. The model architecture for the Non-linear auto-encoders in Table 4 was a twin network taking inputs of 768 dimensions and reducing it to 128 dimensions through 2 layers and reconstructs the original 768 dimensional representation with 2 layers. This architecture was chosen balancing the constraints of enabling the non-linear associations while keeping total parameters low given the low sample size context. The formal definition of the model is: f (a) = max(a, 0); ∀a ∈ R NLAE Training. The data for domain pretraining of dimension reduction was split into 2 sets for NLAE alone: training and validation sets. 90% of the domain data was randomly sampled for training the NLAE and the remaining 10% of pretraining data was used to validate hyper-parameters after every epoch. This model was trained with an objective to minimise the reconstruction mean squared loss over multiple epochs. It was trained until the validation loss increased over 3 consecutive epochs. AdamW was the optimizer used with the learning rate set to 0.001. This took around 30-40 epochs depending upon the dataset.
Fine-tuning. In our fine-tuning configuration we freeze all but the top 2 layers of the best LM, to prevent over fitting and vanishing gradients at the lower layers (Sun et al., 2019;Mosbach et al., 2020). We also apply early stopping (varied the patience between 3 and 6 depending upon the task).
Other hyperparameters for this experiment include L2-regularization (in the form of weight-decay on AdamW optimizer, set to 1), dropout set to 0.3, batch size set to 10, learning rate initialized to 5e-5, and the number of epochs was set to max of 15, which was limited by early stopping between 5-10 depending on the task and early stopping patience. We arrived at these hyperparameter values after an extensive search. The weight decay param was searched in [100, 0.01], dropout within [0.1, 0.5], and learning rate between [5e-4, 5e-5].

C Data
Due to human subjects privacy constraints, most data are not able to be publicly distributed but they are available from the original data owners via requests for research purposes (e.g. CLPsych-2018 and CLPsych-2019 shared tasks).

D.1 Results on higher N ta
We can see that reduction still helps in majority of tasks in higher N ta from Figure A2. As expected, the performance starts to plateau at higher N ta values and it is visibly consistent across most tasks. With the exception of age and gender prediction using facebook data, all the other tasks benefit from reduction. Figure A3 compares the performance of reduced dimensions at low samples size scenario (N ta ≤ 1000) in classification tasks. Except for a few N ta values in gender prediction using the facebook data, all the other tasks benefits from reduction in achieving the best performance.

D.3 LM comparison for no reduction &
Smaller models. Table A1 compares the performance of the language models without applying any dimension reduction of the embeddings and the performance of the best transformer models is also compared with smaller models (and distil version) after reducing second to last lasyer representation to 128 dimensions in table A2.

D.4 Least dimensions required: Higher N ta
The 'fkp' plateaus as the the number of training samples grow as seen in table A3.

E Additional Analysis E.1 Pre-trained vs Fine-Tuned models
We also find that fine-tuned model doesn't perform better than the pre-trained model for users with typical message lengths, but is better at handling longer sequences upon training it on the tasks' data. This is evident from the graphs in figure A4.    Table A2: Comparison of the best performing auto-encoder models with a smaller LMs (like ALBERT (Lan et al., 2019) and DistilRoBERTa  after reduction to 128 dimensions. These results suggest that the reduction of the larger counterparts produce better results than reducing these smaller LMs' representations. process), COGPROC (cognitive process) negatively correlates to the difference in absolute error of PCA and NMF. These variables also happen to have higher correlation with the openness scores (Schwartz et al., 2013). We also see that characteristics typical of an open person like interest in arts, music, and writing (Kern et al., 2014) appear in the word clouds. The divergence of the absolute errors in NMF and PCA is seen in bsag and ope tasks as well. From graphs in figure A6 we can see that the sequence length at which we see this behavior is close to the previously observed value in age and ext tasks.

E.3 PCA vs Pre-trained.
PCA models overall perform better than pre-trained model in low sample regime and from figure A7, we can see that PCA captures slang, affect and standard social media abbreviations better than the pre-trained models. The task specific linear layer is better able to capture social media language with fewer dimensions (reduced by PCA) than from the original 768 features produced by the pre-trained models.

Dimension reduced
(mean ± std err) All dimensions (mean) Figure A2: Performance recorded for reduced dimensions for all tasks at higher N ta values (≥ 1000). Reduction continues to help in performing the best in personality and mental-health tasks. The 'fkp' is observed to be shifting to a higher value, due to the rise in performance of no reduction and the reduction of standard error.

Dimension reduced (mean ± std err)
All dimensions (mean) Figure A3: Comparison of performance in gen, gen2 and sui tasks for varying N ta between 50 and 1000.  Figure A4: The absolute error in age prediction for the fine-tuned model is higher than pre-trained models for users with short messages. Fine-tuned models have smaller errors for users with longer messages. Figure A5: The word cloud of the LIWC variables (left) and the 1 grams (right) having negative correlation with the difference in the absolute error of PCA and NMF in Openness prediction. Benjamini-Hochberg FDR. p < .05. We can see that LIWC variables and 1 grams more correlative of a person exhibiting more openness are better captured by the PCA model than the NMF.  Figure A6: Comparison of the absolute error of NMF and PCA with the average number of 1 grams per message. We see that the absolute error of NMF models starts diverging at longer text sequences for the bsag and the ope tasks as well. Figure A7: Terms having negative (left) and positive (right) correlations with the difference in the absolute error of the PCA and pre-trained model in age prediction. Benjamini-Hochberg FDR. p < .05. The error in the PCA model is lesser than pre-trained models when messages contain more slang, affect words and social media abbreviations.