Engage the Public: Poll Question Generation for Social Media Posts

This paper presents a novel task to generate poll questions for social media posts. It offers an easy way to hear the voice from the public and learn from their feelings to important social topics. While most related work tackles formal languages (e.g., exam papers), we generate poll questions for short and colloquial social media messages exhibiting severe data sparsity. To deal with that, we propose to encode user comments and discover latent topics therein as contexts. They are then incorporated into a sequence-to-sequence (S2S) architecture for question generation and its extension with dual decoders to additionally yield poll choices (answers). For experiments, we collect a large-scale Chinese dataset from Sina Weibo containing over 20K polls. The results show that our model outperforms the popular S2S models without exploiting topics from comments and the dual decoder design can further benefit the prediction of both questions and answers. Human evaluations further exhibit our superiority in yielding high-quality polls helpful to draw user engagements.


Introduction
Social media is a crucial outlet for people to exchange ideas, share viewpoints, and keep connected with the world. It allows us to hear the public voice for decision making and better understanding our society. Nevertheless, for the silent majority, they tend to read others' messages instead of voicing their own opinions with words, possibly because of the introvert personality, busy schedule, and others. How shall we better engage them into the discussions and learn from their thoughts?
In this work, we present a novel application to automatically generate a poll question for a social media post. It will encourage public users, especially those reluctant to comment with words, to * Jing Li is the corresponding author. input their reflections via voting. For example, the statistics of our dataset show that 13K users on average engaged in a poll compared with 173 commented to a post. For a better illustration of the task, Figure 1 shows two example poll questions on Sina Weibo 1 , henceforth Weibo, a popular Chinese microblog. The goal of our task is to output an opinion question, such as Q 1 and Q 2 , and invite other users to engage in the discussion to a source post (e.g., P 1 and P 2 ); poll choices (answers like A 1 and A 2 ) can be produced together to allow easy public engagement (via voting).
To date, most progress made in question generation is built upon the success of encoder-decoder frameworks (Du et al., 2017). Despite of the extensive efforts made in this line (Sun et al., 2018;Yao et al., 2018;Chai and Wan, 2020;Sun et al., 2020), most previous work focus on the processing of formally-written texts, such as exam questions in reading comprehension tests. The existing methods are therefore suboptimal to handle social media languages with short nature and informal styles, which might present challenges to make sense of the source posts and decide what to ask. For example, from the limited words in P 1 , it is hard to capture the meanings of "B站" (B site) and "爱奇 艺" (iQiyi) as video apps, which is nevertheless crucial to predict Q 1 . Moreover, the question itself, being in social media fashion, is likely to contain fresh words, such as "c位" (center position) in Q 2 , which may further hinder the models' capability to predict the poll questions in social media style.
To tackle these challenges, we first enrich the short contexts of source posts with other users' comments; a neural topic model is employed to discover topic words therein and help identify the key points made in source posts. It is based on the assumption that the salient words in a source post are likely to be echoed in its comments (Wang et al., 2019b), potentially useful to learn the map from posts to poll questions. For example, the core words in Q 1 -"app" and "视频" (video)co-occur frequently in the comments with "B站" (B site) and "爱奇艺" (iQiyi), which may help the model to link their meanings together. The topic representations are then incorporated into a sequence-to-sequence (S2S) architecture to decode poll questions word by word. Furthermore, we extend the basic S2S to a version with dual decoders to generate questions and answers in a multi-task learning setting and further exploit their correlations. For example, modeling answers in A 2 might help indicate that P 2 centers around "赵粤" (Akira) and "希林娜依高" (Curley G), two celebrities.
To the best of our knowledge, this work is the first to study poll questions on social media, where their interactions among answer choices, source posts, and reader users' comments are comprehensively explored. As a pilot study over social media polls, we also contribute the very first dataset containing around 20K Weibo polls associated with their source posts and user comments. 2 We believe our dataset, being the first of its kind, will largely benefit the research on social media polls and how they help promote the public engagements.
On our dataset, we first compare the model performance on poll question generation in terms of automatic evaluation and human evaluation. The 2 Our dataset and code are publicly available in https://github.com/polyusmart/Poll-Question-Generation automatic evaluation results show that the latent topics learned from the first few pieces of user comments is already helpful -they result in our models' significantly better performance than the S2S baselines and their trendy extensions proposed for other tasks. For example, our full model achieves 38.24 ROUGE-1 while S2S with RoBERTa (Liu et al., 2019) yields 34.08. Human evaluation further demonstrates our models' capability to generate poll questions relevant to the source post, fluent in language, and particularly engaging to draw user attentions for discussions. We then quantify models' sensitivities to the length of varying source posts and poll questions, where the scores of our model are consistently better. Next, we find our model exhibits an increasing trend in predicting poll questions that will engage more comments in the future, which suggests the potential helpfulness of comments to indicate engaging questions. At last, the performance of dual decoder designs are discussed and it is shown that joint prediction of questions and their answers can benefit both tasks.

Task Formulation
Our major input is a social media post (i.e., source post) and the main output a poll question that continue the senses of the source post and encourage public users to voice opinions. For each question, possible answer choices (i.e., answers) may also be yielded as a side product to enable participants to easily input their thoughts. To enrich the contexts of source posts, their reply messages (i.e., user comments) are also encoded as external features.

Data Description
Here we describe the dataset we collect to empirically study social media polls.
Data Collection. Weibo allows users to create polls, asking questions to the public and inviting others to share their thoughts via voting. It enables the construction of a dataset with user-generated polls. At the beginning, we gathered around 100K random Weibo posts, whereas less than 0.1% of them contain polls. The sparse distribution of polls presents the challenge to scale up the dataset. To deal with that, we looked in to the sampled polls and draw two interesting points: first, many polls carry trendy hashtags (user-annotated topic labels like #COVID19) to draw user attentions; second, a user who once created a poll is likely to do it again.  Inspired by these observations, we first obtained the popular hashtags since Nov 2019. 3 Then, we gathered the posts under the hashtag through the Weibo search API, from which the ones containing polls are picked out. 4 Next, we examined the authors of these polls and access their posting history to gather more polls they created from Weibo user timeline API. 5 Afterwards, for each post, we crawled its comments via the comment API. 6 Finally, 20,252 polls were obtained from 1,860 users.
Data Analysis. The statistics of the dataset is displayed in Table 1. As can be seen, comments are shorter than posts, probably because users tend to put more efforts in crafting original posts than replying to others and hence comments may be relatively nosier than original posts; both questions and answers are short, which follow the fashion of user-generated contents on social media.
To further investigate the data sparsity in social media contents, we sample some texts from LDC news corpus (formally-written texts) (Ahtaridis et al., 2012) -the samples contain the same token number as our social media texts. Our corpus's vocabulary size and entropy are 24,884 and 7.46, while those for news corpus are 9,891 and 5.98. This suggests the sparsity of social media data.
We also observe that each post exhibits more voters than comments, implying that users may prefer to voice opinions via voting, which is easier than commenting with words. We further analyze the effects of polls on user engagements and draw an interesting finding. For the same author, their posts with polls exhibit 1.65, 22.2, and 1.80 times comments, likes, and reposts on average compared to posts without polls. 7 This implies that adding polls indeed help to draw user engagements to a post. For each poll, there are less than 4 answer choices on average. To further characterize that, Figure 2(a) shows the count of polls over varying numbers of answer choices appearing in them and the statistics suggest that most users are not willing to craft over 5 poll choices, which, interestingly, exhibit similar statistics in exam questions. In addition, we probe into what types of topics are more likely to contain polls. To that end, we examined source posts with hashtags and manually categorized the hashtags into 11 topics. Figure 2(b) shows the poll distribution over topics. Most polls fall in "social events" category, which mostly concern public emergency and in our dataset tremendous posts focus on the outbreak of COVID-19. There are also a large proportion of polls concern entertainment topics such as celebrities and TV shows, probably initiated for advertising purpose.

Poll Question Generation Framework
This section introduces our framework with two variants: one based on a basic S2S (single decoder) and the other is its extension with dual decoders to predict poll questions and answer choices in a multitask learning setting. The model architecture of the dual decoder model is shown in Figure 3.

Source Posts and Comments Encoding
Following the common practice in S2S (Du et al., 2017), we encode a source post P in the form of word sequence w 1 , w 2 , ..., w |P | , where |P | is the number of words in the post. For user comments C, bag of words (BOW) representations are employed for topic modeling, henceforth C bow over BoW vocabulary. More details are provided below.
Source Post Encoding. To encode the post sequence P , a bidirectional gated recurrent unit (Bi-GRU) (Cho et al., 2014) is adopted. For the i-th word w i ∈ P , we first convert it into an embedding vector ν i , which is later processed into hidden and sequentially put into a memory bank M = h 1 , h 1 , ..., h |P | , which will be further delivered to decoders for their attentive retrieval.
User Comments Modeling. Considering the noisy nature of user comments, latent topics are employed to recognize the salient contents therein. They are explored based on word statistics and represented as clusters of words tending to co-occur in the comments of some posts (probably concerning similar topics), such as the names of video apps in Figure 1. In topic modeling, we assume there are K topics and each topic k is represented with a topic-word distribution over the BoW vocabulary. A post P has a topic mixture θ, which is learned from the words appearing in its comments C bow .
Our topic learning methods (from comments) are inspired by the neural topic model (NTM) based on variational auto-encoder (VAE) (Miao et al., 2017;Zeng et al., 2018), which allows the end-to-end training of NTM with other modules in an unified neural architecture. It employs an encoder and a decoder to resemble the data reconstruction process of the comment words in BoW.
Concretely, the input C bow is first encoded into prior parameters µ and σ using neural perceptrons. Then, through Gaussian transformation, they are applied to draw a latent variable: z = N (µ, σ 2 ), which is further taken to produce the topic composition of comments (θ) with softmax transformation.
At last, the decoder reconstructs comments and produces a BOW vector C bow (conditioned on the latent topic θ) through another neural perception.

Poll Decoding
Here we further describe how we generate questions (and answers in the dual decoders settings) with the encoded source posts and comments.
Question Generation. To handle the output of a question Q, the corresponding decoder (i.e., question decoder) is formed with a uni-directional GRU and fed with the memory bank M from source post encoding and the topic distribution θ from user comment modeling. The words in Q are predicted sequentially with the following formula: where q j means the j-th word in Q and q <j refers to Q's predicted word sequence from slot 1 to j − 1.
To leverage comment modeling results in the decoding, we incorporate θ into the attention weights (defined below) over source posts and concentrate on topic words therein for question generation.
(2) s j is the GRU decoder's j-th hidden states and: In addition, we adopt copy mechanism (See et al., 2017) to allow the generated questions to contain the keywords from the source posts: p gen refers to the likelihood to generate a word while p copy is the extractive distribution derived from the attention weights over the source input. The soft switcher λ j ∈ [0, 1] can determine whether to copy a word or generate a new one in aware of the comments' topics: Answer Generation. To further explore the relations between questions (Q) and answers (A), we "replicate" the question decoder's architecture and form another decoder to handle answer generation (answer decoder). The answer choices are concatenated to form an answer sequence and neighboring choices are separated with a special token "<sep>". The answer decoder also adopts the same topic-aware attentions (Eq. 2) as the question decoder (denoted as β ij here) and copy mechanisms (Eq. 4) to be able to put topic words from the source into the answer choices, such as "赵粤" (Akira) and "希林娜依高" (Curley G) in Figure 1.
Question decoder and answer decoder work together in a dual decoders setting, whose parameters are updated simultaneously to exploit the essential correlations of poll questions and their answers.

Model Training
This subsection describes how we jointly train the neural topic model (henceforth NTM) for comment modeling and the decoders for question and answer generation with multi-task learning. The loss function for NTM is defined as: The C above refers to C bow . The first term is the KL divergence loss and the second is the reconstruction loss in VAE. For question generation, the loss is: N is the number of training samples; Q n , P n , and θ n are the target poll question, source post, and topic distribution of the n-th training sample. Answer generation loss L AG is defined similarly. The training loss of the entire model are defined as: where γ Q and γ A balance the weights over NTM and the two decoders.
Then, for some poll questions echoed in the source posts, we took them away for fair experiments.
Next, an open-source toolkit jieba is employed for Chinese word segmentation. 8 Afterwards, we filtered out stop words and for the remaining, we maintained two vocabularies with the most frequent 50K words for sequences (input and output) and another 100K words for BoW. Finally, comments are capped at the first 100 words to examine poll question generation with the early comments and their potential to draw future user engagements.
In evaluations, we split our data into 80% for training, 10% for validation and 10% for test. , which were implemented with the paddle hub platform 9 . For all S2S with pre-trained models, their pre-trained parameters were further fine-tuned on our training data.
Then, we consider the following S2S extensions with copy mechanism (i.e., COPY) (Meng et al., 2017), topic modeling from posts (i.e., TOPIC) (Wang et al., 2019a), and bidirectional attentions over posts and comments (i.e., CMT (BIATT)) (Wang et al., 2019b). All of them were proposed for keyphrase generation tasks and set up following their original papers.
For our models, we consider two variants -CMT (NTM) in the single decoder archetecture and its dual decoder version DUAL DEC. 10 Model Settings. All the hyperparameters are tuned on the validation set via grid search. For NTM, it is pre-trained for 50 epochs before joint training and afterwards different modules take turns to update parameters. We adopt two-layers bidirectional GRU to build source post encoder and one-layer unidirectional GRU question and answer decoders. The hidden size of each GRU is 300. 8 https://github.com/fxsjy/jieba 9 https://www.paddlepaddle.org.cn/hub 10 We also finetuned BERT with our models yet cannot observe much performance gain. It is because NTM is able to learn essential features from the input and BERT cannot provide additional benefits. Another possible reason is that social media BERT is unavailable in Chinese and that trained on out-domain data (e.g., news) might not fit well with Weibo languages. Large-scale Weibo data might be acquired for continue pre-training (Gururangan et al., 2020), which is beyond the scope of this paper and will be explored in future work.
For a word embedding, the size is set to 150 and randomly initialized. In training, we apply Adam optimizer with initial learning rate as 1e-3, gradient clipping as 1.0, and early-stopping strategy adopted. The weights to trade off losses in multitask learning is set to γ Q = γ A = 1 (Eq. 8).
Evaluation Metrics. We adopt both automatic measures and human ratings for evaluations. For the former, we examine two popular metrics for language generation tasks -ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002). For the latter, human annotators rates with 4 point Likert scale (i.e., {0, 1, 2, 3}) and over three criteria are considered: the relevance to the source posts (relevance), how fluent the generated language reads (fluency), the attractiveness degree of the questions in drawing people's engagements (engagingness).

Experimental Results
In this section, we first show the main comparison results on poll question generation involving both automatic evaluations and human ratings (in §5.1). Then, model sensitivity to varying lengths of source posts and poll questions are discussed in §5.2, followed by the analyses of models' capability to handle poll questions exhibiting varying degrees of user engagements ( §5.3). Next, §5.4 discusses the performance of dual decoders that jointly generate questions and answers. A case study is presented at last (in §5.5) to interpret the sample outputs.

Comparison on Poll Question Generation
We first show the comparison results on poll question generation, where we will discuss automatic evaluations and human ratings in turn below.
Automatic Evaluations. Table 2 reports the automatic measured results on question generation. As can be seen, our task is challenging and basic S2S performs poorly. Pre-trained models from the BERT family can offer some help though limited. It is probably because the pre-training data is from other domains (e.g., news and online encyclopedia), where the representations learned cannot fully reflect the styles of social media languages.
We then observe copy mechanism and latent topics (learn from posts) are both useful, where the former allows the keyword extracted from the post to form a question while the latter further helps find topic words to be copied. On the contrary, user  comments, though able to provide useful information, are noisy (also implied by Table 1). So, it is important to encode the comments in an appropriate way -CMT (NTM) captures salient topic features from the comments and performs much better than CMT (BIATT), which might be hindered by the noise and exhibit the second worst results. In addition, we notice DUAL DEC slightly outperforms its single decoder variant CMT(NTM), though the gain is small. To better examine their prediction results, we conduct human evaluations.
Human Ratings. Here we sampled 400 source posts (and their outputs), and invited four native Chinese speakers to rate the poll questions in a 4 point Likert scale -0 for extremely bad, 1 for bad, 2 for good, and 3 for extremely good -without knowing where the results come from. Each annotator reviews 100 samples and one's assignments vary with others' and Table 3 shows the average ratings over the four annotators.
All the models are rated worse than the gold standard, which means automatic poll question generation still has a long way to go. We also observe that models with latent topics exhibit relatively better relevance. This may be because topic models allow the capture of salient contents from the input and detail injection to the output. Besides, CMT (NTM) and DUAL DEC perform the best in engagingness, probably because user comments and poll answers might provide implicit clues (e.g., fresh words) helpful to predict engaging questions. For fluency, BASE outperforms our models by a small margin, as it tends to yield short and generic questions, such as "你怎么看" (What's your viewpoint?) based on our observation. More-  over, we measure the length of questions generated by BASE and DUAL (our full model) and find that 11.0% questions generated by BASE contain less than 5 words whereas the number for DUAL is only 1.6%. This again demonstrates our potential to generate longer questions with richer details.

Effects of Post and Question Length
We further quantify the question generation results over varying lengths of source posts and poll questions and show the corresponding ROUGE-1 scores in Figure 4. Here, we compare BASE and ROBERTA, TOPIC, and our CMT (NTM). 11 Post length seems not to affect much on the models' performance, probably attributed to the length limitation in Weibo -even the relatively longer posts contain limited words. On the contrary, for the question length, the two S2S baselines both exhibit obvious performance drops when generating long questions, while TOPIC and CMT (NTM) perform steadily. This suggests that latent topics, either captured from posts or comments, may have the potential to enrich questions with detailed descriptions, and hence can better tackle long questions. Nevertheless, CMT (NTM) presents consistently better ROUGE-1 in diverse scenarios. 11 In §5.2 and §5.3, we experiment in the single decoder settings so as to focus on the quality of generated questions. We will further discuss the dual decoders in §5.4.

Polls Questions vs. User Engagements
As shown in the human ratings ( §5.1), comments might help to generate engaging poll questions. For a further discussion, Figure 5 shows the ROUGE-1 of ROBERTA, TOPIC, and CMT (NTM) in handling questions for polls that later engage varying user comment numbers. Interestingly, CMT (NTM) performs better when predicting questions that engage more comments at the end. This means that early comments might provide useful clues for models to distinguish attractive questions with the potential to draw more public engagements in the future. Lacking the ability to learn from comments, TOPIC exhibits relatively more stable trends.

Discussion on Dual Decoders
The previous two subsections are discussed in the single decoder setting and here we further examine the effectiveness to jointly predict questions and answers. BASE, COPY, TOPIC, and CMT (NTM) with single and dual decoders are discussed.
We first compare question generation results and Figure 6 shows the ROUGE-1 scores. It is seen that dual decoders can boost the results of BASE and COPY, implying that questions and answers are indeed related and exploiting their interactions can successfully bring performance gain. However, we cannot observe large-margin improvements in TOPIC and CMT (NTM), probably because many words in answers, such as "赵粤" (Akira) and "希 林娜依高" (Curley G) in Figure 1, are also topic words that can be discovered with topic models. Therefore, jointly generating answers only provides limited help to their question generation results.
Then, we analyze how the multitask learning ability of dual decoders influence the prediction of poll answers. Table 4 displays the comparison results with pipeline models that sequentially generate questions and then answers. By examining the pipeline results, we first find that source posts are   helpful in answer generation, which results in the outperformance of PT+QS over QS ONLY. Besides, answer generation trained with predicted questions or the gold standards do not make much difference. Gold standard questions might exhibit higher quality while predicted questions may better fit the tests (answer choices should be predicted without knowing the human-crafted questions).
For dual decoders, CMT (NTM) still performs the best, implying that latent topics from user comments can also contribute to better prediction of poll answers. In comparison with the best pipeline model (PT+QS), the scores from CMT (NTM) are competitive, though the dual decoder allows endto-end training and is easier to be used (with less manual efforts in model training and application).

Case Study
To provide more insights, we further take the two Weibo posts in Figure 1 as the input cases and ex-amine the output of varying models in Table 5. 12 Unsurprisingly, BASE tends to yield generic questions as limited features are encoded from the noisy source. ROBERTA sometimes produces repeated words (e.g., its output to P 1 ), hindering its capability to generate fluent language (also indicated by Table 3). This is possibly caused by the overfitting problem as RoBERTa might rely on large-scale in-domain data for fine-tuning.
We also find that modeling topics and user comments may enable the output to contain trendy wordings, making it more engaging, such as "c位" (center point) in CMT (NTM)'s output question for P 2 and the names of many new video apps in DUAL DEC's generated answer choices for P 1 . Furthermore, the dual decoders might learn the cohesive relations between questions and answers, such as the Akira and Curley G occurring in both the generated questions and answer choices (P 2 ).

Related Work
Our work is in the line with question generation, where most prior efforts focus on how to ask good exam questions given an article and the pre-defined answers. Some adopt manually-crafted rules or features (Labutov et al., 2015;Dhole and Manning, 2020;Fabbri et al., 2020), largely relying on the labor-intensive process for rule design or feature engineering. To simplify the training, automatic feature learning hence becomes popular. For example, Chali and Hasan (2015) first employs a Bayesian model to learn topic features and then leverages them to yield questions. These pipeline methods require the expertise involvement to manually customize the model inference algorithms, while our neural network design allows end-to-end training of topic modeling and question generation.
Recently, S2S-based question generation architecture has demonstrated promising results (Du et al., 2017;Chai and Wan, 2020). To better encode the input, researchers adopt successful training design from other tasks, such as self-attention mechanism Scialom et al., 2019), language model pre-training (Pan et al., 2019), variational inference , and reinforcement learning Pan et al., 2019). Heuristic features, e.g., the answers' positions in the article Sun et al., 2018; Table 5: Questions generated for the source posts in Figure 1: P 1 (top) and P 2 (bottom). For DUAL DEC (i.e., CMT (NTM) with dual decoders), the question is followed by the answer in the next row. Kim et al., 2019;Liu, 2020) are sometimes considered. For question decoding, certain constraints are added to control the generation, such as some aspects to be contained (Hu et al., 2018), varying levels of difficulty  and specificity (Cao et al., 2019). We are also related with previous work handling the generation of questions and answers in a multitask learning setting Tang et al., 2017;Sun et al., 2020). Nonetheless, none of the aforementioned research concerns poll questions and answers on social media, which exhibit very different language styles compared with any existing studies and has not been extensively explored.

Conclusion
We have presented a novel task to generate social media poll questions. User comments encoded with a neural topic model are leveraged in a S2S framework; dual decoder architecture is further adopted to explore the interactions between questions and answers. Extensive experiments on a large-scale dataset newly collected from Weibo have demonstrated the effectiveness of our proposed model.

Ethical Considerations
The task will not pose ethical problems. First, the polls are open access to the public users (so as to collect their opinions). Second, Weibo allows any users to report suspicious cases with ethical concerns and the reported contents will be removed immediately. Third, the polls are running in an anonymous way to protect the privacy of voters.
The dataset is collected through the official APIs of Weibo and is consistent with the Weibo terms of use. We also manually examined the data to ensure the following points. First, we conduct data anonymization and manually examined the data to ensure there are no privacy and ethical concerns, e.g., personal information, toxic language, and hate speech. In the generated polls, we didn't spot any cases that might have the concern. Second, the involved Weibo users are all public ones. To that end, we automatically filtered out personal users without the official confirmation of Weibo (the confirmed public users can be identified with a "VIP" tag). The user list is manually checked again to mitigate the ethical concern.
For the annotation, we recruited part-time research assistants to work with the pay 15.7 USD/hour and at most 20 hours per week.