ExpertPLM: Pre-training Expert Representation for Expert Finding

,


Introduction
Community Question Answering (CQA) websites have become a popular platform, which can help people share their knowledge in the form of questions and answers. Some large portals such as Stack Exchange 1 have extremely attracted millions of users (Fu et al., 2020), which can raise their questions or post answers for questions they 1 https://stackexchange.com Qiyao Peng and Hongtao Liu are equal contribution. Qing Yang is the corresponding author. In Mongolian Conquests, how did they screen the population for engineers? Figure 1: Several historical questions were answered by an expert (user ID 3353 in the History of StackExchange). The blue boxes represent original questions and the red boxes represent vote scores provided by the CQA community for answers. The higher the vote score of the answer, the more professional expertise. The reputation is 37,339, which can reflect the expert overall capabilities (obtains top 0.36% in History domain). are interested in or good at. Due to the large participation, there are too many questions to wait for answers (Zhao et al., 2017;Yuan et al., 2020). Hence, it is a great challenge to route questions to a suitable expert for providing satisfactory answers (Chang and Pal, 2013;Zhao et al., 2014). Expert finding in CQA websites can effectively route questions and help raisers receive high-quality answers quickly, which has attracted considerable attention recently.
Generally speaking, accurate learning expert representation is the central problem in expert finding. Most existing methods usually infer expert interest representation based on her/his historical answered questions, then measure the matching score between experts and the new questions. For example, PMEF (Peng et al., 2022a) designs a title-body-tag multi-view paradigm to learn representations of questions and experts respectively. It is noted that most existing methods focus on modeling expert interests and ignore whether the expert has the ability to answer the question, i.e., the expertise.
Recently, Pretrained Language Models (PLMs), e.g., BERT, pre-train general corpus-level language knowledge and fine-tune on the downstream task, which have achieved great success in various areas Qiu et al., 2021). Motivated by this, "Can we pre-train expert-level represen-tation on CQA domains, and then fine-tune on the downstream expert finding task?" Different from corpus-level pre-training in Natural Language Processing (NLP), which focuses on learning general language knowledge, expert pre-training needs to consider the following two core capabilities: 1) Interest modeling. We could infer the expert interests from the historical answered questions. As shown in Figure 1, the expert has answered multiple questions related to "Ancient Greece" and "Alexandria", which reflect that his interest about the history of Ancient Greece. However, simply adopting the existing PLMs or further pre-training over the CQA corpus could not effectively capture the expert-level interest.
2) Expertise modeling. The expertise of experts plays an important role in expert findings. From Figure 1, we can find that the expert is interested in Ancient Greece and Mongol. And the vote scores obtained for these two types of questions are very different (e.g., +39 and −4), which indicate the expert different expertise of different questions. However, most existing PLMs fail to model the ability of experts in answering different questions.
Hence it is necessary to design more effective pre-training framework for learning comprehensive expert representations.
For alleviating these gaps, we propose an Expert-level Pre-training Language Model for expert finding (ExpertPLM), which could pre-train expert representation effectively. We empower the typical corpus-level pre-training paradigm in the following aspects: (1) Expert interest modeling. We re-construct the model input via aggregating the expert historical answered questions for pretraining. Compared with the corpus-level input paradigm (i.e., one line-one sentence) of PLMs, our approach employs expert-level input, which could learn more comprehensive interest features based on histories during pre-training. (2) Expert abilities modeling. Unlike modeling expert interest, the expert ability is not explicitly reflected in historical answered questions. Fortunately, the vote score the answer received could indicate CQA user satisfaction with the answer, which could reflect the ability to answer this question of the expert (i.e., the higher vote score, the higher expertise). Hence, we encode the vote score and integrate that with the corresponding historical answered question input embedding to indicate the expert ability for the question.
To further prompt the expertise learning, we introduce the expert reputation shown in Figure 1 which could indicate the expert overall ability and design a reputation-augmented Masked Language Model (MLM) pre-training strategy to capture the expert reputation information. In this way, our method could pre-train expert representation including interest and expertise effectively. In the fine-tuning we utilize the weight to encode expert and question, then accomplish the downstream expert finding task via a fine-tuning way.
In summary, the contributions of our method are: • We propose a novel expert-level pre-training language model for the expert finding task in CQA websites, which could effectively pretrain the expert representations.
• We unify the historical question titles, vote scores during pre-training and design a reputation-augmented MLM task to empower the model for capturing the interest and expertise of experts.
• Extensive experiments on six real-world datasets show that our method could achieve better performance than existing baselines and validate the effectiveness of our approach Ex-pertPLM.

Related Works
In this section, we briefly review some related works about Pre-training for NLP, Pre-training for RS, and Expert Finding.

Pre-training for NLP
There is a long history of pre-training general language representations. Earlier methods, such as Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014) learned the word embedding by capturing word co-occurrence information, which can offer a significant improvement in various tasks. However, these methods were incapable of considering contextual information. Recently, a series of pre-training methods based on the Transformers (Vaswani et al., 2017), such as BERT (Devlin et al., 2018), BART (Lewis et al., 2020), have changed original training paradigm. Through pretraining and fine-tuning paradigms, they can jointly capture general language knowledge on a large corpus of text and task-specific knowledge, which could improve downstream task performance.

Pre-training for RS
Recently, some recommendation tasks  employ the pre-training technology to learn item co-occurrence information for improving recommendation performance. For example, BERT4Rec  employed the Cloze task to predict the masked items using the left and right context based on the Transformer structure, which could capture the contextual user interaction representations. Then, the model was fine-tuned on the pre-trained encoder to accomplish the next item recommendation, which obtained better performance. However, different from the general recommendation task, the expert finding in CQA is always cold-start and have quite unique characteristics, such as the expertise modelling of experts.

Expert Finding
In CQA websites, expert finding aims to find capable experts for providing satisfactory answers to questions (Yuan et al., 2020;Liu et al.;Peng et al., 2022b). The majority of previous works fall into two categories: traditional methods and deep learning-based methods. Traditional methods mostly employed feature-engineering or topicmodeling to model the questions and experts and then routed questions to suitable experts. For example, Yang et al. (Yang et al., 2013) proposed a topic expertise model to jointly model expert topical interests and expertise for help better recommending. Deep learning-based methods employ the neural network to model experts and measure the matching relevance with target questions (Li et al., 2019;Fu et al., 2020;Ghasemi et al., 2021). For example, TCQR ) employed a question encoder to learn question words and learned the answerers' representation in the context of both the semantic and temporal information for expert representation learning.

Problem definition
In this section, we formulate the problem of expert finding in CQA websites. Suppose that there is a target question q t and a candidate expert set C u = {c u 1 , · · · , c u M } respectively, where M is the number of experts. Given a candidate expert c u i ∈ C u with r u i as the reputation, she/he is associated with a set of her/his historical answered questions, which can be denoted as Q u i = {q 1 , · · · , q n } where n is the number of historical questions. And vote scores corresponding to the expert historical answered questions can be denoted as V u i = {v 1 , · · · , v n }. The question is represented by a question title, which consists of a sequence of words. The primary objective of the expert finding is to predict the most suitable expert for answering the target question. Note that the expert who provides the "accepted answer" for the question will be regarded as the ground truth. It is noted that one question only have one "accepted answer".

Proposed Method
In this section, we will introduce our method Ex-pertPLM in detail. The expert pre-training language model is demonstrated in Figure 2. In the pre-training stage, we pre-train the model based on the concatenated expert historical answered questions (i.e., one input line is one expert all historical question titles) from different CQA domains for capturing expert interest. For indicating the expert different abilities to answer different questions, we integrate the vote score embedding with the corresponding question input embedding. Furthermore, we design a reputation-augmented MLM pre-training task for capturing the expert overall expertise and CQA language knowledge. In the fine-tuning stage, as shown in Figure 3, via conducting the supervised expert finding task on expert and question representations generated by the pretrained weight, we can obtain an improved expert finding model for a specific domain.

Pre-training Expert Representation
The goals of the pre-training stage are: 1) teaching ExpertPLM how to capture the expert interest and expertise; 2) learning general CQA domain language knowledge. Next, we will introduce the ExpertPLM from Input Layer, Model Architecture and Pre-training Task three aspects. Figure 2, for empowering the BERT to model expert interest, we simply concatenate the words of the expert's historical answered questions into a whole sequence as one expert-level input. Then, we add the special tokens [CLS] and [SEP] at the begining and end of the input word sequence respectively. Furthermore, for distinguishing different historical answered questions, we add the special token [HSEP] between the histories (e.g., [HSEP] between the q 1 and q 2 ).

Input Embedding As shown in
[army] of Figure 2: ExpertPLM Pre-training Framework. Expert historical answered questions (q 1 , q 2 , · · ·) and vote scores are aligned as the input for indicating the expert interest and expertise. The first token [CLS] will be always masked to pretrain the user reputation during pre-training.
Given an expert-level input, considering that the original pre-trained BERT weight (e.g., bert-base-uncased) has already carried a great deal of language knowledge, we utilize that to initialize the token, segment and position embeddings. Hence, the input token representation matrix E t is constructed by summing its corresponding token, segment and position embedding: (1) Furthermore, the vote scores an expert has received could indicate his/her expertise to answer different questions. Generally speaking, the higher the vote score the answer received, the more satisfied the community is with the answer, and the answerer has the stronger professional expertise to answer such questions. For example, as shown in Figure 2, the vote score of answer (i.e., −4) about Mongol represent the expert may lack the ability to answer Mongol related questions, but have much expertise to answer Ancient Greece related questions (vote score: +39). Hence, we introduce the vote scores corresponding to each historical question for measuring the expert abilities in different question fields effectively. We encode the normalized vote score and integrate that with the input representation E t as follows: It is mentioned that the dimension of vote score embedding coincides with the corresponding his-torical question. In other words, for the historical question q 1 and the corresponding vote score v 1 , the dimension of vote score embedding is mapped to the same dimension as the question q 1 input embedding. In this way, the BERT model could capture the expert interests and expertise for different questions.
BERT Layer BERT model architecture consists of multi-layer bidirectional Transformer encoder layers. Each Transformer encoder layer has the following two major sub-layers, i.e., multi-head self-attention and position-wise feed-forward. Let E l in denote the input representation of the (l + 1)th Transformer encoder layer. We omit the layer subscript l of each parameter for convenience.
Multi-Head Self-Attention. This sub-layer aims to capture the contextual representations for each word. The self-attention function is defined as: where Q, K and V represent the query, key and value matrix correspondingly. Multi-head selfattention layer MH(·) will project the input to multiple sub-spaces and capture the interaction information, which is denoted as: where W q i , W w i , W v i ∈ R d× d h and W ∈ R d×d are parameters. Via the multi-head self-attention, the input representation E is transformed to H ∈ R n×d , where n is the token number.
Position-wise feed-forward. For the input H, the calculation is defined as: where W f 1 , W f 2 and b f 1 , b f 2 are learnable parameters. Furthermore, the residual connection is introduced into each of the two sub-layers, and layer normalization is applied to each sub-layer.

Pre-training Task
In this section, we will present the reputationaugmented MLM training task in the pre-training stage, which is the core task to enforce the PLMs to model expert abilities and capture the CQA language knowledge.
Firstly, considering that the original MLM task has the power capable ability to train the PLMs, we adopt the MLM task to learn the language knowledge in CQA scenario, which is first randomly masking some words and then using the bidirectional context information to re-construct the input sequence. However, the origin MLM is only for the input corpus and would be incapable to learn the expert-level features (e.g., the ability).
As denoted above, the reputation the expert received (as shown in Figure 1) in CQA could reflect the overall expertise in answering questions. Generally speaking, the higher reputation, the higher expertise, and the answer provided by the expert would be more satisfied users from CQA community. Hence, we design a reputation-augmented MLM task to pre-train the model for empowering the model to capture the overall expertise of experts. Specifically, given the example input illustrated in Figure 2, the output of the special token [CLS] could capture the whole input sequence information. Hence, we adopt the token [CLS] as a special indicator to predict the expert reputation. We normalize all expert reputations to 0 − 11, and transform them as special tokens (e.g., [0] − [11]) for convenient prediction. It is noted that the reputation [CLS] token is always masked during the pre-training phase.
In this way, our model ExpertPLM could pretrain expert-level representation containing expert interests and capabilities, which are beneficial to downstream expert finding task.

Fine-tuning for Expert finding
Though the ExpertPLM has pre-trained the expertlevel representation, the downstream task is still slightly different from the pre-training, since it focuses not only on modeling expert but also on modeling interaction between expert and target question. Considering the pre-trained model based MLM can naturally capture the CQA language knowledge, we use the pre-trained model to learn the features of questions. As illustrated in Figure 3, we enter the expert historical answered questions and the target question into two same pre-trained encoders separately. Then, we concatenate two [CLS] representations for predicting the matching score S c between the expert and the target question. We employ negative sampling technology (Huang et al., 2013) and the cross-entropy loss to fine-tune our model as follows: whereŜ c is the ground truth label and S c is the normalized probability predicted by the model.

Datasets and Experimental Settings
We construct a dataset containing 103,005 expertlevel input data for pre-training expert representation, which is from StackExchange 2 . For finetuning and verifying the effect of the model in specific domains, we select six different domains, i.e., English, Biology, Es, Electronics, Gis and CodeReview. Each dataset includes a question set, in which, each question is associated with its title, an "accepted answer" among several answers provided by different answerers. And the provider of   the "accepted answer" is the ground truth expert. We follow the preprocessing method in previous work (Peng et al., 2022a). The detailed statistical characteristics of the datasets are shown in Table 2. We split each dataset into a training set, a validation set and a testing set, with the ratios 80%, 10%, 10% respectively in chronological order. We adopt the pre-trained weight bert-baseuncased as the base model. The ExpertPLM pretraining weight contains 110M parameters. To alleviate the overfitting problem, we utilize dropout technology (Srivastava et al., 2014) and set the dropout ratio as 0.2. We adopt Adam (Kingma and Ba, 2015) optimization strategy to optimize our model and set the learning rate to 5e-5 in further pre-training and 5e-2 in fine-tuning. We independently repeat each experiment 5 times and report the average results. All experiments are implemented using Pytorch frame and using two 24GB-memory RTX 3090 GPU servers with Intel(R) Xeon(R)@2.20GHz CPU. Our code, pre-  trained weight and the validation data are anonymously available on the Dropbox 3 . We briefly list statistical information of vote score and reputation in two datasets as examples, which is shown in Table 3. Since the vote score and the reputation exhibit similar characteristics, we only describe the pre-processing process for vote score, and the reputation is pre-processed in a similar way. First, we perform an overall translation of the vote score to eliminate negative numbers. Then, we perform a logarithmic operation on the vote score (i.e., ln(·)) to mitigate the effects of excessive variance. To facilitate model calculation and make the number of scores contained in each score segment is approximately similar, we normalize the vote score to integer between 1 and 10, which is calculated as follows: where v min and v max represent the minimum score 3 https://github.com/pengqy/EMNLP2022_ExpertPLM 1048 and the maximum score in vote score sequence V u .
where v * i is the normalized score of the i-th vote score v i in V u , round is the rounding operator.

Baselines and Evaluation metrics
We compare our method ExpertPLM with recent competitive methods including: (1) Doc2Vec selects experts who have previously answered questions relevant to the target question.
(2) CNTN (Qiu and Huang, 2015) employs the CNN to model questions and computes ranking scores between questions and experts.  (Fu et al., 2020) equips with a recurrent memory reasoning network to explore the implicit relevance between expert and question. (6) UserEmb (Ghasemi et al., 2021) utilizes a node2vec to capture social features and uses a word2vec to capture semantic features, then integrates them to improve the expert finding. (7) PMEF (Peng et al., 2022a) designs a personalized expert finding method under a multi-view paradigm, which could comprehensively model expert and question. The evaluation metrics include Mean Reciprocal Rank (MRR) (Craswell, 2009), P@1 (i.e., Precision@1), P@3 (i.e., Precision@3) and Normalized Discounted Cumulative Gain (NDCG@20) (Järvelin and Kekäläinen, 2002) to verify the expert ranking quality.

Performance Comparison
We report experimental results of ExpertPLM and other comparative methods in Table 1. There are some findings in these results. Some earlier methods (e.g., Doc2Vec, CNTN) obtain poor results on almost datasets, the reason may be that they usually employ max or mean operation on histories to model expert, which omits different history importance. On the contrary, recent methods (e.g., RMRN, PMEF, etc.) achieve better results on different datasets, which is due to these methods focusing on modeling the different interest for different questions.
As we can see, our model ExpertPLM outperforms other comparative methods and achieves great improvements. Our method introduces the expert-level representation pre-training mechanism to pre-train the expert interests and expertise for different questions on different CQA domains. Via pre-training, the expert representation can be roughly captured by the model, which could be beneficial to the downstream expert finding task. Meanwhile, this paradigm captures general CQA language knowledge during pre-training, which could enhance the modeling of questions and experts in the downstream task and yield better performance.

Ablation Study
To highlight the effectiveness of our designing reputation-augmented MLM pre-training task, we design three model variants: (a) Only Cm, adopt corpus-level MLM to pre-train over CQA corpus (i.e., one question title one input line) and then fine-tune instead of the expert-level pre-training; (b) Only Em, adopt expert-level input for MLM pre-training but remove the vote score information and the reputation task during expert-level pretraining; (c) w/o Rep, adopt expert-level MLM task for pre-training but remove the reputation task during expert-level pre-training.
As shown in Table 4, we can have the following observations: (1) Only Em outperforms Only Cm. This is because Only Em employs expertlevel input and pre-trains specifically for experts, and hence it could learn the more precise representation of experts compared with the corpus-level pretraining in Only Cm.
(2) w/o Rep outperforms Only Em. Compared with Only Em, the w/o Rep introduces the vote score information additionally to pre-train the expert in different abilities to answer different questions, which is the core of downstream task. (3) Our complete model ExpertPLM obtains the best results. The reason is it can pretrain the expert-level representations including interest and expertise. Further, the pre-trained model also captures the CQA language knowledge, which could yield better performance. In all, the results of ablation studies meet our motivation and validate the effectiveness of our proposed pre-training task.

Pre-trained Weight Analysis
In this section, we conduct two experiments to further explore the influences of the pre-trained weight in the following two aspects through comparing with the origin Bert weight.   Effect of Pre-trained Weight In our method, we employ the ExpertPLM pre-training weight to accomplish expert finding via a fine-tuning way. Hence, we will explore the effectiveness of the pre-training weight in this section. We replace the weight directly with the original bert-base-uncased in the fine-tuning stage, i.e., we employ two bert-base-uncased weights to learn the expert and question representations respectively, then compute the matching score. The results are illustrated in Figure 4. We find that the ExpertPLM is useful for the downstream expert finding task. As denoted above, accurately learning expert representation is a critical task for expert finding as it can encode expert interest and expertise for answering different questions. Compared with ExpertPLM, Bert-Base-Uncased could not capture such expert characteristics, which reduces the performance of the downstream expert finding task. This observation validates the effectiveness of our ExpertPLM pre-training weight.
Effect of Train Data Ratio in Fine-tuning In this part, we adjust the training data ratio in the fine-tuning stage to explore the effect of different data ratio on model training. We employ [40%, 50%, 60%, 70%, 80%] all data in Gis dataset  as training data to fine-tune our model, meanwhile, the ratios of validation data and testing remain the same (i.e., 10%) with the main experiments. As shown in Figure 5. We can find that there are growing gaps between the results of Expert-PLM and Bert-Base-Uncased with the reduction of training data, which indicates the advantage of the pre-trained expert model is larger when the training data is more scarce. This may be because the ExpertPLM can exploit expert histories and vote scores to capture the expert interest and expertise during pre-training phase, which could reduce the dependency on training data during the fine-tuning stage. And the Bert-Base-Uncased could be incapable of capturing expert-level representation, which could be affected by the ratio of training data in the fine-tuning.

Effect of Mask Ratio in Pre-training
The mask ratio is an important hyperparameter of ExpertPLM during pre-training and we have varied the ratio in [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35] for exploring the parameter sensitivity of the Masked Language Model in pre-training.
The results are shown in Figure 6. We can find that all metric results increase at first as the ratio of masked token increases, and reach the maximum value (i.e., the best model performance), and then degrades. When the masked ratio is small, the BERT model could not capture adequate CQA language knowledge, which could reduce the performance of downstream tasks. In the contrary, when the mask ratio is large, the [MASK] symbol appears in pre-training stage more frequently, which could intensify the mismatch between pre-training and fine-tuning. Hence, we set up the mask ratio to 0.15 during the pre-training stage.

Conclusion
In this paper, we propose ExpertPLM, a pretraining language model for the expert finding task in CQA. The core of our method is that we design an expert-specific pre-training framework based on a masked language model, towards precisely modeling experts (i.e., interest and expertise) based on the historical answered questions and vote scores. Meanwhile, the pre-trained language model could capture the CQA language knowledge, which is beneficial to the downstream task. We conduct detailed experiments on real world CQA datasets, and the results fully validate the effectiveness of our proposed pretraining method, In the future, we would like to explore a larger scale comprehensive expert pre-training model and extend the pretrained model to more downstream tasks.

Limitation
Although our model has achieved excellent performance, there may be some limitations in this study that could be addressed in future research. First, the existing pre-training dataset is still a little small, which would lead to inadequate pre-training. In the future, we will construct a larger pre-training dataset for larger-scale CQA pre-training. Second, some users have more historical answered questions, which will cause that the input sequence length is greater than 512. In the future, we will explore CQA pre-training based on long-sequence modelling. Third, during pre-training, we only design the expertise learning task for users. In the future, we will explore to introduce more user modeling tasks (e.g., interest modeling) during pretraining.