In recent years, with the vast and rapidly increasing amounts of spoken and textual data, Named Entity Recognition (NER) tasks have evolved into three distinct categories, i.e., text-based NER (TNER), Speech NER (SNER) and Multimodal NER (MNER). However, existing approaches typically require designing separate models for each task, overlooking the potential connections between tasks and limiting the versatility of NER methods. To mitigate these limitations, we introduce a new task named Integrated Multimodal NER (IMNER) to break the boundaries between different modal NER tasks, enabling a unified implementation of them. To achieve this, we first design a unified data format for inputs from different modalities. Then, leveraging the pre-trained MMSpeech model as the backbone, we propose an **I**ntegrated **M**ultimod**a**l **Ge**neration Framework (**IMAGE**), formulating the Chinese IMNER task as an entity-aware text generation task. Experimental results demonstrate the feasibility of our proposed IMAGE framework in the IMNER task. Our work in integrated multimodal learning in advancing the performance of NER may set up a new direction for future research in the field. Our source code is available at https://github.com/NingJinzhong/IMAGE4IMNER.
Disclaimer: Samples in this paper may be harmful and cause discomfort! Patronizing and condescending language (PCL) is a form of speech directed at vulnerable groups. As an essential branch of toxic language, this type of language exacerbates conflicts and confrontations among Internet communities and detrimentally impacts disadvantaged groups. Traditional pre-trained language models (PLMs) perform poorly in detecting PCL due to its implicit toxicity traits like hypocrisy and false sympathy. With the rise of large language models (LLMs), we can harness their rich emotional semantics to establish a paradigm for exploring implicit toxicity. In this paper, we introduce PclGPT, a comprehensive LLM benchmark designed specifically for PCL. We collect, annotate, and integrate the Pcl-PT/SFT dataset, and then develop a bilingual PclGPT-EN/CN model group through a comprehensive pre-training and supervised fine-tuning staircase process to facilitate implicit toxic detection. Group detection results and fine-grained detection from PclGPT and other models reveal significant variations in the degree of bias in PCL towards different vulnerable groups, necessitating increased societal attention to protect them.
Yonkoma Manga, characterized by its four-panel structure, presents unique challenges due to its rich contextual information and strong sequential features. To address the limitations of current multimodal large language models (MLLMs) in understanding this type of data, we create a novel dataset named YManga from the Internet. After filtering out low-quality content, we collect a dataset of 1,015 yonkoma strips, containing 10,150 human annotations. We then define three challenging tasks for this dataset: panel sequence detection, generation of the author’s creative intention, and description generation for masked panels. These tasks progressively introduce the complexity of understanding and utilizing such image-text data. To the best of our knowledge, YManga is the first dataset specifically designed for yonkoma manga strips understanding. Extensive experiments conducted on this dataset reveal significant challenges faced by current multimodal large language models. Our results show a substantial performance gap between models and humans across all three tasks.
In the realm of artificial intelligence and linguistics, the automatic generation of humor, particularly puns, remains a complex task. This paper introduces an innovative approach that employs a Generative Adversarial Network (GAN) and semantic pruning techniques to generate humorous puns. We initiate our process by identifying potential pun candidates via semantic pruning. This is followed by the use of contrastive learning to decode the unique characteristics of puns, emphasizing both correct and incorrect interpretations. The learned features from contrastive learning are utilized within our GAN model to better capture the semantic nuances of puns. Specifically, the generator exploits the pruned semantic tree to generate pun texts, while the discriminator evaluates the generated puns, ensuring both linguistic correctness and humor. Evaluation results highlight our model’s capacity to produce semantically coherent and humorous puns, demonstrating an enhancement over prior methods and approach human-level performance. This work contributes significantly to the field of computational humor, advancing the capabilities of automatic pun generation.
Emotion recognition in conversation (ERC) is essential for dialogue systems to identify the emotions expressed by speakers. Although previous studies have made significant progress, accurate recognition and interpretation of similar fine-grained emotion properly accounting for individual variability remains a challenge. One particular under-explored area is the role of individual beliefs and desires in modelling emotion. Inspired by the Belief-Desire Theory of Emotion, we propose a novel method for conversational emotion recognition that incorporates both belief and desire to accurately identify emotions. We extract emotion-eliciting events from utterances and construct graphs that represent beliefs and desires in conversations. By applying message passing between nodes, our graph effectively models the utterance context, speaker’s global state, and the interaction between emotional beliefs, desires, and utterances. We evaluate our model’s performance by conducting extensive experiments on four popular ERC datasets and comparing it with multiple state-of-the-art models. The experimental results demonstrate the superiority of our proposed model and validate the effectiveness of each module in the model.
With the development of the Internet, social media has produced a large amount of user-generated data, which brings new challenges for humor computing. Traditional humor computing research mainly focuses on the content, while neglecting the information of interaction relationships in social media. In addition, both content and users are important in social media, while existing humor computing research mainly focuses on content rather than people. To address these problems, we model the information transfer and entity interactions in social media as a heterogeneous graph, and create the first dataset which introduces the social context information - HumorWB, which is collected from Chinese social media - Weibo. Two humor-related tasks are designed in the dataset. One is a content-oriented humor recognition task, and the other is a novel humor evaluation task. For the above tasks, we purpose a graph-based model called SCOG, which uses heterogeneous graph neural networks to optimize node representation for downstream tasks. Experimental results demonstrate the effectiveness of feature extraction and graph representation learning methods in the model, as well as the necessity of introducing social context information.
Researchers have attempted to mitigate lexical bias in toxic language detection (TLD). However, existing methods fail to disentangle the “useful” and “misleading” impact of lexical bias on model decisions. Therefore, they do not effectively exploit the positive effects of the bias and lead to a degradation in the detection performance of the debiased model. In this paper, we propose a Counterfactual Causal Debiasing Framework (CCDF) to mitigate lexical bias in TLD. It preserves the “useful impact” of lexical bias and eliminates the “misleading impact”. Specifically, we first represent the total effect of the original sentence and biased tokens on decisions from a causal view. We then conduct counterfactual inference to exclude the direct causal effect of lexical bias from the total effect. Empirical evaluations demonstrate that the debiased TLD model incorporating CCDF achieves state-of-the-art performance in both accuracy and fairness compared to competitive baselines applied on several vanilla models. The generalization capability of our model outperforms current debiased models for out-of-distribution data.
This paper describes our system used in the SemEval-2024 Task 9 on two sub-tasks, BRAINTEASER: A Novel Task Defying Common Sense. In this work, we developed a system SHTL, which means simulate human thinking capabilities by Large Language Model (LLM). Our approach bifurcates into two main components: Common Sense Reasoning and Rationalize Defying Common Sense. To mitigate the hallucinations of LLM, we implemented a strategy that combines Retrieval-augmented Generation (RAG) with the the Self-Adaptive In-Context Learning (SAICL), thereby sufficiently leveraging the powerful language ability of LLM. The effectiveness of our method has been validated by its performance on the test set, with an average performance on two subtasks that is 30.1 higher than ChatGPT setting zero-shot and only 0.8 lower than that of humans.
Recent advancements in Large Language Models (LLMs) have propelled text generation to unprecedented heights, approaching human-level quality. However, it poses a new challenge to distinguish LLM-generated text from human-written text. Presently, most methods address this issue through classification, achieved by fine-tuning on small language models. Unfortunately, small language models suffer from anisotropy issue, where encoded text embeddings become difficult to differentiate in the latent space. Moreover, LLMs possess the ability to alter language styles with versatility, further complicating the classification task. To tackle these challenges, we propose Gated Mixture-of-Experts Fine-tuning (GMoEF) to detect LLM-generated text. GMoEF leverages parametric whitening to normalize text embeddings, thereby mitigating the anisotropy problem. Additionally, GMoEF employs the mixture-of-experts framework equipped with gating router to capture features of LLM-generated text from multiple perspectives. Our GMoEF achieved an impressive ranking of #8 out of 70 teams. The source code is available on https://gitlab.com/sigrs/gmoef.
This paper describes our system used in the SemEval-2024 Task 4 Multilingual Detection of Persuasion Techniques in Memes. Our team proposes a detection system that employs a Teacher Student Fusion framework. Initially, a Large Language Model serves as the teacher, engaging in abductive reasoning on multimodal inputs to generate background knowledge on persuasion techniques, assisting in the training of a smaller downstream model. The student model adopts CLIP as an encoder for text and image features, and we incorporate an attention mechanism for modality alignment. Ultimately, our proposed system achieves a Macro-F1 score of 0.8103, ranking 1st out of 20 on the leaderboard of Subtask 2b in English. In Bulgarian, Macedonian and Arabic, our detection capabilities are ranked 1/15, 3/15 and 14/15.
The development of social platforms has facilitated the proliferation of disinformation, with memes becoming one of the most popular types of propaganda for disseminating disinformation on the internet. Effectively detecting the persuasion techniques hidden within memes is helpful in understanding user-generated content and further promoting the detection of disinformation on the internet. This paper demonstrates the approach proposed by Team DUTIR938 in Subtask 2b of SemEval-2024 Task 4. We propose a dual-channel model based on semi-supervised learning and model ensemble. We utilize CLIP to extract image features, and employ various pretrained language models under task-adaptive pretraining for text feature extraction. To enhance the detection and generalization capabilities of the model, we implement sample data augmentation using semi-supervised pseudo-labeling methods, introduce adversarial training strategies, and design a two-stage global model ensemble strategy. Our proposed method surpasses the provided baseline method, with Macro/Micro F1 values of 0.80910/0.83667 in the English leaderboard. Our submission ranks 3rd/19 in terms of Macro F1 and 1st/19 in terms of Micro F1.
Detecting persuasive communication is an important topic in Natural Language Processing (NLP), as it can be useful in identifying fake information on social media. We have developed a system to identify applied persuasion techniques in text fragments across four languages: English, Bulgarian, North Macedonian, and Arabic. Our system uses data augmentation methods and employs an ensemble strategy that combines the strengths of both RoBERTa and DeBERTa models. Due to limited resources, we concentrated solely on task 1, and our solution achieved the top ranking in the English track during the official assessments. We also analyse the impact of architectural decisions, data constructionand training strategies.
This paper describes our system used in the SemEval-2023 Task 9 Multilingual Tweet Intimacy Analysis. There are two key challenges in this task: the complexity of multilingual and zero-shot cross-lingual learning, and the difficulty of semantic mining of tweet intimacy. To solve the above problems, our system extracts contextual representations from the pretrained language models, XLM-T, and employs various optimization methods, including adversarial training, data augmentation, ordinal regression loss and special training strategy. Our system ranked 14th out of 54 participating teams on the leaderboard and ranked 10th on predicting languages not in the training data. Our code is available on Github.
Sexism is an injustice afflicting women and has become a common form of oppression in social media. In recent years, the automatic detection of sexist instances has been utilized to combat this oppression. The Subtask A of SemEval-2023 Task 10, Explainable Detection of Online Sexism, aims to detect whether an English-language post is sexist. In this paper, we describe our system for the competition. The structure of the classification model is based on RoBERTa, and we further pre-train it on the domain corpus. For fine-tuning, we adopt Unsupervised Data Augmentation (UDA), a semi-supervised learning approach, to improve the robustness of the system. Specifically, we employ Easy Data Augmentation (EDA) method as the noising operation for consistency training. We train multiple models based on different hyperparameter settings and adopt the majority voting method to predict the labels of test entries. Our proposed system achieves a Macro-F1 score of 0.8352 and a ranking of 41/84 on the leaderboard of Subtask A.
Sarcasm, as a form of irony conveying mockery and contempt, has been widespread in social media such as Twitter and Weibo, where the sarcastic text is commonly characterized as an incongruity between the surface positive and negative situation. Naturally, it has an urgent demand to automatically identify sarcasm from social media, so as to illustrate people’s real views toward specific targets. In this paper, we develop a novel sarcasm detection method, namely Sarcasm Detector with Augmentation of Potential Result and Reaction (SD-APRR). Inspired by the direct access view, we treat each sarcastic text as an incomplete version without latent content associated with implied negative situations, including the result and human reaction caused by its observable content. To fill the latent content, we estimate the potential result and human reaction for each given training sample by [xEffect] and [xReact] relations inferred by the pre-trained commonsense reasoning tool COMET, and integrate the sample with them as an augmented one. We can then employ those augmented samples to train the sarcasm detector, whose encoder is a graph neural network with a denoising module. We conduct extensive empirical experiments to evaluate the effectiveness of SD-APRR. The results demonstrate that SD-APRR can outperform strong baselines on benchmark datasets.
The Relational Triple Extraction (RTE) task is a fundamental and essential information extraction task. Recently, the table-filling RTE methods have received lots of attention. Despite their success, they suffer from some inherent problems such as underutilizing regional information of triple. In this work, we treat the RTE task based on table-filling method as an Object Detection task and propose a one-stage Object Detection framework for Relational Triple Extraction (OD-RTE). In this framework, the vertices-based bounding box detection, coupled with auxiliary global relational triple region detection, ensuring that regional information of triple could be fully utilized. Besides, our proposed decoding scheme could extract all types of triples. In addition, the negative sampling strategy of relations in the training stage improves the training efficiency while alleviating the imbalance of positive and negative relations. The experimental results show that 1) OD-RTE achieves the state-of-the-art performance on two widely used datasets (i.e., NYT and WebNLG). 2) Compared with the best performing table-filling method, OD-RTE achieves faster training and inference speed with lower GPU memory usage. To facilitate future research in this area, the codes are publicly available at https://github.com/NingJinzhong/ODRTE.
The widespread dissemination of toxic online posts is increasingly damaging to society. However, research on detecting toxic language in Chinese has lagged significantly due to limited datasets. Existing datasets suffer from a lack of fine-grained annotations, such as the toxic type and expressions with indirect toxicity. These fine-grained annotations are crucial factors for accurately detecting the toxicity of posts involved with lexical knowledge, which has been a challenge for researchers. To tackle this problem, we facilitate the fine-grained detection of Chinese toxic language by building a new dataset with benchmark results. First, we devised Monitor Toxic Frame, a hierarchical taxonomy to analyze the toxic type and expressions. Then, we built a fine-grained dataset ToxiCN, including both direct and indirect toxic samples. ToxiCN is based on an insulting vocabulary containing implicit profanity. We further propose a benchmark model, Toxic Knowledge Enhancement (TKE), by incorporating lexical features to detect toxic language. We demonstrate the usability of ToxiCN and the effectiveness of TKE based on a systematic quantitative and qualitative analysis.
Metaphor is a pervasive aspect of human communication, and its presence in multimodal forms has become more prominent with the progress of mass media. However, there is limited research on multimodal metaphor resources beyond the English language. Furthermore, the existing work in natural language processing does not address the exploration of categorizing the source and target domains in metaphors. This omission is significant considering the extensive research conducted in the fields of cognitive linguistics, which emphasizes that a profound understanding of metaphor relies on recognizing the differences and similarities between domain categories. We, therefore, introduce MultiCMET, a multimodal Chinese metaphor dataset, consisting of 13,820 text-image pairs of advertisements with manual annotations of the occurrence of metaphors, domain categories, and sentiments metaphors convey. We also constructed a domain lexicon that encompasses categorizations of metaphorical source domains and target domains and propose a Cascading Domain Knowledge Integration (CDKI) benchmark to detect metaphors by introducing domain-specific lexical features. Experimental results demonstrate the effectiveness of CDKI. The dataset and code are publicly available.
Ancient Chinese poetry is the earliest literary genre that took shape in Chinese literature and has a dissemination effect, showing China’s profound cultural heritage. At the same time, the generation of ancient poetry is an important task in the field of digital humanities, which is of great significance to the inheritance of national culture and the education of ancient poetry. The current work in the field of poetry generation is mainly aimed at improving the fluency and structural accuracy of words and sentences, ignoring the theme unity of poetry generation results. In order to solve this problem, this paper proposes a graph neural network poetry theme representation model based on label embedding. On the basis of the network representation of poetry, the topic feature representation of poetry is constructed and learned from the granularity of words. Then, the features of the poetry theme representation model are combined with the autoregressive language model to construct a theme-oriented ancient Chinese poetry generation model TLPG (Poetry Generation with Theme Label). Through machine evaluation and evaluation by experts in related fields, the model proposed in this paper has significantly improved the topic consistency of poetry generation compared with existing work on the premise of ensuring the fluency and format accuracy of poetry.
Patronizing and Condescending Language (PCL) towards vulnerable communities in general media has been shown to have potentially harmful effects. Due to its subtlety and the good intentions behind its use, the audience is not aware of the language’s toxicity. In this paper, we present our method for the SemEval-2022 Task4 titled “Patronizing and Condescending Language Detection”. In Subtask A, a binary classification task, we introduce adversarial training based on Fast Gradient Method (FGM) and employ pre-trained model in a unified architecture. For Subtask B, framed as a multi-label classification problem, we utilize various improved multi-label cross-entropy loss functions and analyze the performance of our method. In the final evaluation, our system achieved official rankings of 17/79 and 16/49 on Subtask A and Subtask B, respectively. In addition, we explore the relationship between PCL and emotional polarity and intensity it contains.
Chinese Named Entity Recognition (NER) has continued to attract research attention. However, most existing studies only explore the internal features of the Chinese language but neglect other lingual modal features. Actually, as another modal knowledge of the Chinese language, English contains rich prompts about entities that can potentially be applied to improve the performance of Chinese NER. Therefore, in this study, we explore the bilingual enhancement for Chinese NER and propose a unified bilingual interaction module called the Adapted Cross-Transformers with Global Sparse Attention (ACT-S) to capture the interaction of bilingual information. We utilize a model built upon several different ACT-Ss to integrate the rich English information into the Chinese representation. Moreover, our model can learn the interaction of information between bilinguals (inter-features) and the dependency information within Chinese (intra-features). Compared with existing Chinese NER methods, our proposed model can better handle entities with complex structures. The English text that enhances the model is automatically generated by machine translation, avoiding high labour costs. Experimental results on four well-known benchmark datasets demonstrate the effectiveness and robustness of our proposed model.
Intelligent medical services have attracted great research interests for providing automated medical consultation. However, the lack of corpora becomes a main obstacle to related research, particularly data from real scenarios. In this paper, we construct RealMedDial, a Chinese medical dialogue dataset based on real medical consultation. RealMedDial contains 2,637 medical dialogues and 24,255 utterances obtained from Chinese short-video clips of real medical consultations. We collected and annotated a wide range of meta-data with respect to medical dialogue including doctor profiles, hospital departments, diseases and symptoms for fine-grained analysis on language usage pattern and clinical diagnosis. We evaluate the performance of medical response generation, department routing and doctor recommendation on RealMedDial. Results show that RealMedDial are applicable to a wide range of NLP tasks with respect to medical dialogue.
Metaphor involves not only a linguistic phenomenon, but also a cognitive phenomenon structuring human thought, which makes understanding it challenging. As a means of cognition, metaphor is rendered by more than texts alone, and multimodal information in which vision/audio content is integrated with the text can play an important role in expressing and understanding metaphor. However, previous metaphor processing and understanding has focused on texts, partly due to the unavailability of large-scale datasets with ground truth labels of multimodal metaphor. In this paper, we introduce MultiMET, a novel multimodal metaphor dataset to facilitate understanding metaphorical information from multimodal text and image. It contains 10,437 text-image pairs from a range of sources with multimodal annotations of the occurrence of metaphors, domain relations, sentiments metaphors convey, and author intents. MultiMET opens the door to automatic metaphor understanding by investigating multimodal cues and their interplay. Moreover, we propose a range of strong baselines and show the importance of combining multimodal cues for metaphor understanding. MultiMET will be released publicly for research.
The wanton spread of hate speech on the internet brings great harm to society and families. It is urgent to establish and improve automatic detection and active avoidance mechanisms for hate speech. While there exist methods for hate speech detection, they stereotype words and hence suffer from inherently biased training. In other words, getting more affective features from other affective resources will significantly affect the performance of hate speech detection. In this paper, we propose a hate speech detection framework based on sentiment knowledge sharing. While extracting the affective features of the target sentence itself, we make better use of the sentiment features from external resources, and finally fuse features from different feature extraction units to detect hate speech. Experimental results on two public datasets demonstrate the effectiveness of our model.
Although researches on word embeddings have made great progress in recent years, many tasks in natural language processing are on the sentence level. Thus, it is essential to learn sentence embeddings. Recently, Sentence BERT (SBERT) is proposed to learn embeddings on the sentence level, and it uses the inner product (or, cosine similarity) to compute semantic similarity between sentences. However, this measurement cannot well describe the semantic structures among sentences. The reason is that sentences may lie on a manifold in the ambient space rather than distribute in an Euclidean space. Thus, cosine similarity cannot approximate distances on the manifold. To tackle the severe problem, we propose a novel sentence embedding method called Sentence BERT with Locality Preserving (SBERT-LP), which discovers the sentence submanifold from a high-dimensional space and yields a compact sentence representation subspace by locally preserving geometric structures of sentences. We compare the SBERT-LP with several existing sentence embedding approaches from three perspectives: sentence similarity, sentence classification and sentence clustering. Experimental results and case studies demonstrate that our method encodes sentences better in the sense of semantic structures.
Recent metaphor identification approaches mainly consider the contextual text features within a sentence or introduce external linguistic features to the model. But they usually ignore the extra information that the data can provide, such as the contextual metaphor information and broader discourse information. In this paper, we propose a model augmented with hierarchical contextualized representation to extract more information from both sentence-level and discourse-level. At the sentence level, we leverage the metaphor information of words that except the target word in the sentence to strengthen the reasoning ability of our model via a novel label-enhanced contextualized representation. At the discourse level, the position-aware global memory network is adopted to learn long-range dependency among the same words within a discourse. Finally, our model combines the representations obtained from these two parts. The experiment results on two tasks of the VUA dataset show that our model outperforms every other state-of-the-art method that also does not use any external knowledge except what the pre-trained language model contains.
In recent years, the plentiful information contained in Chinese legal documents has attracted a great deal of attention because of the large-scale release of the judgment documents on China Judgments Online. It is in great need of enabling machines to understand the semantic information stored in the documents which are transcribed in the form of natural language. The technique of information extraction provides a way of mining the valuable information implied in the unstructured judgment documents. We propose a Legal Triplet Extraction System for drug-related criminal judgment documents. The system extracts the entities and the semantic relations jointly and benefits from the proposed legal lexicon feature and multi-task learning framework. Furthermore, we manually annotate a dataset for Named Entity Recognition and Relation Extraction in Chinese legal domain, which contributes to training supervised triplet extraction models and evaluating the model performance. Our experimental results show that the legal feature introduction and multi-task learning framework are feasible and effective for the Legal Triplet Extraction System. The F1 score of triplet extraction finally reaches 0.836 on the legal dataset.
In our daily life, metaphor is a common way of expression. To understand the meaning of a metaphor, we should recognize the metaphor words which play important roles. In the metaphor detection task, we design a sequence labeling model based on ALBERT-LSTM-softmax. By applying this model, we carry out a lot of experiments and compare the experimental results with different processing methods, such as with different input sentences and tokens, or the methods with CRF and softmax. Then, some tricks are adopted to improve the experimental results. Finally, our model achieves a 0.707 F1-score for the all POS subtask and a 0.728 F1-score for the verb subtask on the TOEFL dataset.
Humor plays important role in human communication, which makes it important problem for natural language processing. Prior work on the analysis of humor focuses on whether text is humorous or not, or the degree of funniness, but this is insufficient to explain why it is funny. We therefore create a dataset on humor with 9,123 manually annotated jokes in Chinese. We propose a novel annotation scheme to give scenarios of how humor arises in text. Specifically, our annotations of linguistic humor not only contain the degree of funniness, like previous work, but they also contain key words that trigger humor as well as character relationship, scene, and humor categories. We report reasonable agreement between annota-tors. We also conduct an analysis and exploration of the dataset. To the best of our knowledge, we are the first to approach humor annotation for exploring the underlying mechanism of the use of humor, which may contribute to a significantly deeper analysis of humor. We also contribute with a scarce and valuable dataset, which we will release publicly.
Metaphors are frequently used to convey emotions. However, there is little research on the construction of metaphor corpora annotated with emotion for the analysis of emotionality of metaphorical expressions. Furthermore, most studies focus on English, and few in other languages, particularly Sino-Tibetan languages such as Chinese, for emotion analysis from metaphorical texts, although there are likely to be many differences in emotional expressions of metaphorical usages across different languages. We therefore construct a significant new corpus on metaphor, with 5,605 manually annotated sentences in Chinese. We present an annotation scheme that contains annotations of linguistic metaphors, emotional categories (joy, anger, sadness, fear, love, disgust and surprise), and intensity. The annotation agreement analyses for multiple annotators are described. We also use the corpus to explore and analyze the emotionality of metaphors. To the best of our knowledge, this is the first relatively large metaphor corpus with an annotation of emotions in Chinese.
Homographic puns have a long history in human writing, widely used in written and spoken literature, which usually occur in a certain syntactic or stylistic structure. How to recognize homographic puns is an important research. However, homographic pun recognition does not solve very well in existing work. In this work, we first use WordNet to understand and expand word embedding for settling the polysemy of homographic puns, and then propose a WordNet-Encoded Collocation-Attention network model (WECA) which combined with the context weights for recognizing the puns. Our experiments on the SemEval2017 Task7 and Pun of the Day demonstrate that the proposed model is able to distinguish between homographic pun and non-homographic pun texts. We show the effectiveness of the model to present the capability of choosing qualitatively informative words. The results show that our model achieves the state-of-the-art performance on homographic puns recognition.