Psycholinguistic Tripartite Graph Network for Personality Detection

Most of the recent work on personality detection from online posts adopts multifarious deep neural networks to represent the posts and builds predictive models in a data-driven manner, without the exploitation of psycholinguistic knowledge that may unveil the connections between one’s language use and his psychological traits. In this paper, we propose a psycholinguistic knowledge-based tripartite graph network, TrigNet, which consists of a tripartite graph network and a BERT-based graph initializer. The graph network injects structural psycholinguistic knowledge in LIWC, a computerized instrument for psycholinguistic analysis, by constructing a heterogeneous tripartite graph. The initializer is employed to provide initial embeddings for the graph nodes. To reduce the computational cost in graph learning, we further propose a novel flow graph attention network (GAT) that only transmits messages between neighboring parties in the tripartite graph. Benefiting from the tripartite graph, TrigNet can aggregate post information from a psychological perspective, which is a novel way of exploiting domain knowledge. Extensive experiments on two datasets show that TrigNet outperforms the existing state-of-art model by 3.47 and 2.10 points in average F1. Moreover, the flow GAT reduces the FLOPS and Memory measures by 38% and 32%, respectively, in comparison to the original GAT in our setting.


Introduction
Personality detection from online posts aims to identify one's personality traits from the online texts he creates. This emerging task has attracted great interest from researchers in computational psycholinguistics and natural language processing due to the extensive application scenarios such as * Corresponding author.  Figure 1: An example of our tripartite graph. The content of Post-1 and Post-2 are "A lot of good advise for me." and "Love it! Thanks for sharing!", respectively.
personalized recommendation systems (Yang and Huang, 2019; Jeong et al., 2020), job screening (Hiemstra et al., 2019) and psychological studies (Goreis and Voracek, 2019). Psychological research shows that the words people use in daily life reflect their cognition, emotion, and personality (Gottschalk, 1997;Golbeck, 2016). As a major psycholinguistic instrument, Linguistic Inquiry and Word Count (LIWC) (Tausczik and Pennebaker, 2010) divides words into psychologically relevant categories (e.g., Function, Affect, and Social as shown in Figure 1) and is commonly used to extract psycholinguistic features in conventional methods (Golbeck et al., 2011;Sumner et al., 2012). Nevertheless, most recent works (Hernandez and Knight, 2017;Jiang et al., 2020;Keh et al., 2019;Lynn et al., 2020;Gjurković et al., 2020) tend to adopt deep neural networks (DNNs) to represent the posts and build predictive models in a data-driven manner. They first encode each post separately and then aggregate the post representations into a user representation. Although numerous improvements have been made over the traditional methods, they are likely to suffer from limitations as follows. First, the input of this task is usually a set of topic-agnostic posts, some of which may contain few personality cues. Hence, directly aggregating these posts based on their contextual representations may inevitably introduce noise. Second, personality detection is a typical data-hungry task since it is non-trivial to obtain personality tags, while DNNs implicitly extract personality cues from the texts and call for tremendous training data. Naturally, it is desirable to explicitly introduce psycholinguistic knowledge into the models to capture critical personality cues.
Motivated by the above discussions, we propose a psycholinguistic knowledge-based tripartite graph network, namely TrigNet, which consists of a tripartite graph network to model the psycholinguistic knowledge and a graph initializer using a pre-trained language model such as BERT (Devlin et al., 2019) to generate the initial representations for all the nodes. As illustrated in Figure 1, a specific tripartite graph is constructed for each user, where three heterogeneous types of nodes, namely post, word, and category, are used to represent the posts of a user, the words contained both in his posts and the LIWC dictionary, and the psychologically relevant categories of the words, respectively. The edges are determined by the subordination between word and post nodes as well as between word and category nodes. Besides, considering that there are no direct edges between homogeneous nodes (e.g., between post nodes) in the tripartite graph, a novel flow GAT is proposed to only transmit messages between neighboring parties to reduce the computational cost and to allow for more effective interaction between nodes. Finally, we regard the averaged post node representation as the final user representation for personality classification. Benefiting from the tripartite graph structure, the interaction between posts is based on psychologically relevant words and categories rather than topic-agnostic context.
We conduct extensive experiments on the Kaggle and Pandora datasets to evaluate our TrigNet model. Experimental results show that it achieves consistent improvements over several strong baselines. Comparing to the state-of-the-art model, SN+Att (Lynn et al., 2020), TrigNet brings a remarkable boost of 3.47 in averaged Macro-F1 (%) on Kaggle and a boost of 2.10 on Pandora. Besides, thorough ablation studies and analyses are conducted and demonstrate that the tripartite graph and the flow GAT play an irreplaceable role in the boosts of performance and decreases of computational cost.
Our contributions are summarized as follows: • This is the first effort to use a tripartite graph to explicitly introduce psycholinguistic knowledge for personality detection, providing a new perspective of using domain knowledge.
• We propose a novel tripartite graph network, TrigNet, with a flow GAT to reduce the computational cost in graph learning.
• We demonstrate the outperformance of our TrigNet over baselines as well as the effectiveness of the tripartite graph and the flow GAT by extensive studies and analyses.
2 Related Work

Personality Detection
As an emerging research problem, text-based personality detection has attracted the attention of both NLP and psychological researchers (Cui and Qi, 2017;Xue et al., 2018;Keh et al., 2019;Jiang et al., 2020;Tadesse et al., 2018;Lynn et al., 2020). Traditional studies on this problem generally resort to feature-engineering methods, which first extracts various psychological categories via LIWC (Tausczik and Pennebaker, 2010) or statistical features by the bag-of-words model (Zhang et al., 2010). These features are then fed into a classifier such as SVM (Cui and Qi, 2017) and XGBoost (Tadesse et al., 2018) to predict the personality traits. Despite interpretable features that can be expected, feature engineering has such limitations as it relies heavily on manually designed features.
With the advances of deep neural networks (DNNs), great success has been achieved in personality detection. Tandera et al. (2017) apply LSTM (Hochreiter and Schmidhuber, 1997) on each post to predict the personality traits. Xue et al. (2018) develop a hierarchical DNN, which depends on an AttRCNN and a variant of Inception (Szegedy et al., 2017) to learn deep semantic features from the posts. Lynn et al. (2020) first encode each post by a GRU (Cho et al., 2014) with attention and then pass the post representations to another GRU to produce the whole contextual representations. Recently, pre-trained language models have been applied to this task. Jiang et al. (2020) simply concatenate all the utterances from a single user into a document and encode it with BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Gjurković et al. (2020) first encode each post by BERT and then use CNN (LeCun et al., 1998) to aggregate the post representations. Most of them focus on how to obtain more effective contextual representations, with only several exceptions that try to introduce psycholinguistic features into DNNs, such as Majumder et al. (2017) and Xue et al. (2018). However, these approaches simply concatenate psycholinguistic features with contextual representations, ignoring the gap between the two spaces.

Graph Neural Networks
Graph neural networks (GNNs) can effectively deal with tasks with rich relational structures and learn a feature representation for each node in the graph according to the structural information. Recently, GNNs have attracted wide attention in NLP (Cao et al., 2019;Yao et al., 2019;Wang et al., 2020b,a). Among these research, graph construction lies at the heart as it directly impacts the final performance. Cao et al. (2019) build a graph for question answering, where the nodes are entities, and the edges are determined by whether two nodes are in the same document. Yao et al. (2019) construct a heterogeneous graph for text classification, where the nodes are documents and words, and the edges depend on word co-occurrences and document-word relations. Wang et al. (2020b) define a dependency-based graph by utilizing dependency parsing, in which the nodes are words, and the edges rely on the relations in the dependency parsing tree. Wang et al. (2020a) present a heterogeneous graph for extractive document summarization, where the nodes are words and sentences, and the edges depend on sentence-word relations. Inspired by the above successes, we construct a tripartite graph, which exploits psycholinguistic knowledge instead of simple document-word or sentence-word relations and is expected to contribute towards psychologically relevant node representations.

Our Approach
Personality detection can be formulated as a multidocument multi-label classification task (Lynn et al., 2020;Gjurković et al., 2020). Formally, each user has a set P = {p 1 , p 2 , . . . , p r } of posts. Let p i = [w i,1 , w i,2 , . . . , w i,s ] be the i-th post with s words, where p i can be viewed as a document. The goal of this task is to predict T personality traits Y = y t T t=1 for this user based on P , where y t ∈ {0, 1} is a binary variable. Figure 2 presents the overall architecture of the proposed TrigNet, which consists of a tripartite graph network and a BERT-based graph initializer. The former module aims to explicitly infuse psycholinguistic knowledge to uncover personality cues contained in the posts and the latter to encode each post and provide initial embeddings for the tripartite graph nodes. In the following subsections, we detail how the two modules work in four steps: graph construction, graph initialization, graph learning, and merge & classification.

Graph Construction
As a major psycholinguistic analysis instrument, LIWC (Tausczik and Pennebaker, 2010) divides words into psychologically relevant categories and is adopted in this paper to construct a heterogeneous tripartite graph for each user.
As shown in the right part of Figure 2, the constructed tripartite graph G= (V, E) contains three heterogeneous types of nodes, namely post, word, and category, where V denotes the set of nodes and E represents the edges between nodes. Specifically, we define w m } denotes m unique psycholinguistic words that appear both in the posts P and the LIWC dictionary, and V c = {c 1 , c 2 , · · · , c n } represents n psychologically relevant categories selected from LIWC. The undirected edge e ij between nodes i and j indicates word i either belongs to a post j or a category j.
The interaction between posts in the tripartite graph is implemented by two flows: (1) "p↔w↔p", which means posts interact via their shared psycholinguistic words (e.g., "p 1 ↔w 1 ↔p 2 " as shown by the red lines in Figure 2); (2) "p↔w↔c↔w↔p", which suggests that posts interact by words that share the same category (e.g., "p 1 ↔w 2 ↔c 2 ↔w 3 ↔p 2 " as shown by the green lines in Figure 2). Hence, the interaction between posts is based on psychologically relevant words or categories rather than topic-agnostic context.

Graph Initialization
As shown in the left part of Figure 2, we employ BERT (Devlin et al., 2019) to obtain the initial embeddings of all the nodes. BERT is built upon the multi-layer Transformer encoder (Vaswani et al., 2017), which consists of a word embedding layer and 12 Transformer layers. 1 Post Node Embedding The representations at the 12-th layer of BERT are usually used to represent an input sequence. This may not be appropriate for our task as personality is only weakly related to the higher order semantic features of posts, making it risky to rely solely on the final layer representations. In our experiments (Section 5.4), we find that the representations at the 11-th and 10-th layers are also useful for this task. Therefore, we utilize the representations at the last three layers to initialize the post node embeddings. Formally, the representations x j p i of the i-th post at the j-th layer can be obtained by: where "CLS" and "SEP" are special tokens to denote the start and end of an input sentence, respectively, and BERT j (·) denotes the representation of the special token "CLS" at the j-th layer. In this way, we obtain the representations where d is the dimension of each representation. We then apply layer attention (Peters et al., 2018) to collapse the three representations into a single vector x p i : where α j are softmax-normalized layer-specific weights to be learned. Consequently, we can obtain 1 "BERT-BASE-UNCASED" is used in this study. a set of post representations for the given r posts of a user X p = [x p 1 , x p 2 , · · · , x pr ] T ∈ R r×d Word Node Embedding BERT applies Word-Piece (Wu et al., 2016) to split words, which also cuts out-of-vocabulary words into small pieces. Thus, we obtain the initial node embedding of each word in V w by considering two cases: (1) If the word is not out of vocabulary, we directly look up the BERT embedding layer to obtain its embedding; (2) If the word is out of vocabulary, we use the averaged embedding of its pieces as its initial node embedding. The initial word node embeddings are represented as X w =[x w 1 , x w 2 , · · · , x wm ] T ∈ R m×d . Category Node Embedding The LIWC 2 dictionary divides words into 9 main categories and 64 subcategories. 3 Empirically, subcategories such as Pronouns, Articles, and Prepositions are not task-related. Besides, our initial experiments show that excessive introduction of subcategories in the tripartite graph makes the graph sparse and makes the learning difficult, resulting in performance deterioration. For these reasons, we select all 9 main categories and the 6 personalconcern subcategories for our study. Particularly, the 9 main categories Function, Affect, Social, Cognitive Processes, Perceptual Processes, Biological Processes, Drives, Relativity, and Informal Language, and 6 personal-concern subcategories Work, Leisure, Home, Money, Religion, and Death are used as our category nodes. Then, we replace the "UNUSED" tokens in BERT's vocab-ulary by the 15 category names and look up the BERT embedding layer to generate their embeddings X c =[x c 1 , x c 2 , · · · , x cn ] T ∈ R n×d .

Graph Learning
Graph attention network (GAT) (Veličković et al., 2018) can be applied over a graph to calculate the attention weight of each edge and update the node representations. However, unlike the traditional graph in which any two nodes may have edges, the connections in our tripartite graph only occur between neighboring parties (i.e., V w ↔ V p and V w ↔ V c ), as shown in Figure 3. Therefore, applying the original GAT over our tripartite graph will lead to unnecessary computational costs. Inspired by Wang et al. (2020a), we propose a flow GAT for the tripartite graph. Particularly, considering that the interaction between posts in our tripartite graph can be accounted for by two flows "p↔w↔p" and "p↔w↔c↔w↔p", we design a message passing mechanism that only transmits message by the two flows in the tripartite graph. Formally, given a constructed tripartite graph G = (V, E), where V = V p ∪V w ∪V c , and the initial node embeddings X=X p ∪X w ∪X c , we compute H where ← means the message is transmitted from the right nodes to the left nodes, mean (·) is the mean pooling function, and MP (·) represents the can be decomposed into three steps. First, it calculates the attention weight β k ij between node i in V w and its neighbor node j in V p at the k-th head: where σ is the LeakyReLU activation function, W k z , W k w and W k p are learnable weights, N i means that the neighbor nodes of node i in V p , and || is the concatenation operation. Second, the updated hidden stateh (l) w i is obtained by a weighted combination of its neighbor nodes in V p : where K is the number of heads and W k v is a learnable weight matrix. Third, noting that the above steps do not take the information of node i itself into account and to avoid gradient vanishing, we introduce a residual connection to produce the final updated node representation:

Merge & Classification
After L layers of iteration, we obtain the final node representations H (L) =H c . Then, we merge all post node representations H (L) p via mean pooling to produce the user representation: Finally, we employ T softmax-normalized linear transformations to predict T personality traits. For the t-th personality trait, we compute: where W t u is a trainable weight matrix and b t u is a bias term. The objective function of our TrigNet model is defined as: where V is the number of training samples, T is the number of personality traits, y t v is the true label for the t-th trait, and p(y t v |θ) is the predicted probability for this trait under parameters θ.

Experiments
In this section, we introduce the datasets, baselines, and settings of our experiments.

Datasets
We choose two public MBTI datasets for evaluations, which have been widely used in recent studies (Tadesse et al., 2018;Hernandez and Knight, 2017;Majumder et al., 2017;Jiang et al., 2020;Gjurković et al., 2020). The Kaggle dataset 4 is collected from PersonalityCafe, 5 where people share their personality types and discussions about health, behavior, care, etc. There are a total of 8675 users in this dataset and each user has 45-50 posts. Pandora 6 is another dataset collected from Reddit, 7 where personality labels are extracted from short descriptions of users with MBTI results to introduce themselves. There are dozens to hundreds of posts for each of the 9067 users in this dataset.
The traits of MBTI include Introversion vs. Extroversion (I/E), Sensing vs. iNtuition (S/N), Think vs. Feeling (T/F), and Perception vs. Judging (P/J). Following previous works (Majumder et al., 2017;Jiang et al., 2020), we delete words that match any personality label to avoid information leaks. The Macro-F1 metric is adopted to evaluate the performance in each personality trait since both datasets are highly imbalanced, and average Macro-F1 is used to measure the overall performance. We shuffle the datasets and split them in a 60-20-20 proportion for training, validation, and testing, respectively. According to our statistics, there are respectively 20.45 and 28.01 LIWC words on average in each post in the two datasets, and very few posts (0.021/0.002 posts per user) are presented as disconnected nodes in the graph. We show the statistics of the two datasets in Table 1.

Baselines
The following mainstream models are adopted as baselines to evaluate our model: SVM (Cui and Qi, 2017) and XGBoost (Tadesse et al., 2018): Support vector machine (SVM) or XGBoost is utilized as the classifier with features extracted by TF-IDF and LIWC from all posts. BiLSTM (Tandera et al., 2017): Bi-directional LSTM (Hochreiter and Schmidhuber, 1997) is firstly employed to encode each post, and then the averaged post representation is used for user representation. Glove (Pennington et al., 2014) is employed for the word embeddings. BERT (Keh et al., 2019): The fine-tuned BERT is firstly used to encode each post, and then mean pooling is performed over the post representations to generate the user representation. AttRCNN: This model adopts a hierarchical structure, in which a variant of Inception (Szegedy et al., 2017) is utilized to encode each post and a CNNbased aggregator is employed to obtain the user representation. Besides, it considers psycholinguistic knowledge by concatenating the LIWC features with the user representation.  SN+Attn (Lynn et al., 2020): As the latest model, SN+Attn employs a hierarchical attention network, in which a GRU (Cho et al., 2014) with word-level attention is used to encode each post and another GRU with post-level attention is used to generate the user representation.
To make a fair comparison between the baselines and our model, we replace the post encoders in AttRCNN and SN+Attn with the pre-trained BERT.

Training Details
We implement our TrigNet in Pytorch 8 and train it on four NVIDIA RTX 2080Ti GPUs. Adam (Kingma and Ba, 2014) is utilized as the optimizer, with the learning rate of BERT set to 2e-5 and of other components set to 1e-3. We set the maximum number of posts, r, to 50 and the maximum length of each post, s, to 70, considering the limit of available computational resources. After tuning on the validation dataset, we set the dropout rate to 0.2 and the mini-batch size to 32. The maximum number of nodes, r + m + n, is set to 500 for Kaggle and 970 for Pandora, which cover 98.95% and 97.07% of the samples, respectively. Moreover, the two hyperparameters, the numbers of flow GAT layers L and heads K, are searched in {1, 2, 3} and {1, 2, 4, 6, 8, 12, 16, 24}, respectively, and the best choices are L = 1 and K = 12. The reasons for L = 1 are likely twofold. First, our flow GAT can already realize the interactions between nodes when L = 1, whereas the vanilla GAT needs to stack 4 layers. Second, after trying L = 2 and L = 3, we find that they lead to slight performance drops compared to that of L = 1.

Results and Analyses
In this section, we report the overall results and provide thorough analyses and discussions. 8 https://pytorch.org/

Overall Results
The overall results are presented in Table 2, from which our observations are described as follows. First, the proposed TrigNet consistently surpasses the other competitors in F1 scores, demonstrating the superiority of our model on text-based personality detection with state-of-the-art performance. Specifically, compared with the existing state of the art, SN+Attn, TrigNet achieves 3.47 and 2.10 boosts in average F1 on the Kaggle and Pandora datasets, respectively. Second, compared with BERT, a basic module utilized in TrigNet, TrigNet yields 4.62 and 2.46 improvements in average F1 on the two datasets, verifying that the tripartite graph network can effectively capture the psychological relations between posts. Third, compared with AttRCNN, another method of leveraging psycholinguistic knowledge, TrigNet outperforms it with 3.61 and 2.38 increments in average F1 on the two datasets, demonstrating that our solution that injects psycholinguistic knowledge via the tripartite graph is more effective. Besides, the shallow models SVM and XGBoost achieve comparable performance to the non-pre-trained model BiLSTM, further showing that the words people used are important for personality detection.

Ablation Study
We conduct an ablation study of our TrigNet model on the Kaggle dataset by removing each component to investigate their contributions. Table 3 shows the results which are categorized into two groups.
In the first group, we investigate the contributions of the network components. We can see that removing the flow "p↔w↔c↔w↔p" defined in Eq. (5) results in higher performance declines than removing the flow "p↔w↔p" defined in Eq. (4), implying that the category nodes are helpful to capture personality cues from the texts. Besides, removing the layer attention mechanism also leads
In the second group, we investigate the contribution of each category node. The results, sorted by scores of decrease from small to large, demonstrate that the introduction of every category node is beneficial to TrigNet. Among these category nodes, the Affect is shown to be the most crucial one to our model, as the average Macro-F1 score drops most significantly after it is removed. This implies that the Affect category reflects one's personality obviously. Similar conclusions are reported by Depue and Collins (1999) and Zhang et al. (2019). In addition, the Function node is the least impactful category node. The reason could be that functional words reflect pure linguistic knowledge and are weakly connected to personality.

Analysis of the Computational Cost
In this work we propose a flow GAT to reduce the computational cost of vanilla GAT. To show its  effect, we compare it with vanilla GAT (as illustrated in the left part of Figure 3). The results are reported in Table 4, from which we can observe that flow GAT successfully reduces the computational cost in FLOPS and Memory by 38% and 32%, respectively, without extra parameters introduced. Besides, flow GAT is superior to vanilla GAT when the number of layers is 1. The cause is that the former can already capture adequate interactions between nodes with one layer, while the latter has to stack four layers to achieve this. We also compare our TrigNet with the vanilla BERT in terms of the computational cost. The result show that the flow GAT takes about 1.14% more FLOPS than the vanilla BERT(297.3G).

Layer Attention Analysis
This study adopts layer attention (Peters et al., 2018) as shown in Eq. (2) to produce initial embeddings for post nodes. To show which layers are more useful, we conduct a simple experiment on the two datasets by using all the 12 layer representations of BERT and visualize the attention weight of each layer. As plotted in Figure 4, we find that the attention weights from layers 10 to 12 are significantly greater than that of the rest layers on both datasets, which explains why the last three layers are chosen for layer attention in our model.

Conclusion
In this work, we proposed a novel psycholinguistic knowledge-based tripartite graph network, TrigNet, for personality detection. TrigNet aims to introduce Figure 4: Visualization of layer attention weights. The last three layers supply with more information for this task. structural psycholinguistic knowledge from LIWC via constructing a tripartite graph, in which interactions between posts are captured through psychologically relevant words and categories rather than simple document-word or sentence-word relations. Besides, a novel flow GAT that only transmits messages between neighboring parties was developed to reduce the computational cost. Extensive experiments and analyses on two datasets demonstrate the effectiveness and efficiency of TrigNet. This work is the first effort to leverage a tripartite graph to explicitly incorporate psycholinguistic knowledge for personality detection, providing a new perspective for exploiting domain knowledge.