Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
Dmitry Ustalov | Swapna Somasundaran | Alexander Panchenko | Fragkiskos D. Malliaros | Ioana Hulpuș | Peter Jansen | Abhik Jana
Knowledge graphs (KGs) of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge graphs are typically incomplete, it is useful to perform knowledge graph completion or link prediction, i.e. predict whether a relationship not in the knowledge graph is likely to be true. This paper serves as a comprehensive survey of embedding models of entities and relationships for knowledge graph completion, summarizing up-to-date experimental results on standard benchmark datasets and pointing out potential future research directions.
Entity Resolution (ER) identifies records that refer to the same real-world entity. Deep learning approaches improved the generalization ability of entity matching models, but hardly overcame the impact of noisy or incomplete data sources. In real scenes, an entity usually consists of multiple semantic facets, called aspects. In this paper, we focus on entity augmentation, namely retrieving the values of missing aspects. The relationship between aspects is naturally suitable to be represented by a knowledge graph, where entity augmentation can be modeled as a link prediction problem. Our paper proposes a novel graph-based approach to solve entity augmentation. Specifically, we apply a dedicated random walk algorithm, which uses node types to limit the traversal length, and encodes graph structure into low-dimensional embeddings. Thus, the missing aspects could be retrieved by a link prediction model. Furthermore, the augmented aspects with fixed orders are served as the input of a deep Siamese BiLSTM network for entity matching. We compared our method with state-of-the-art methods through extensive experiments on downstream ER tasks. According to the experiment results, our model outperforms other methods on evaluation metrics (accuracy, precision, recall, and f1-score) to a large extent, which demonstrates the effectiveness of our method.
Named entity recognition (NER) from visual documents, such as invoices, receipts or business cards, is a critical task for visual document understanding. Most classical approaches use a sequence-based model (typically BiLSTM-CRF framework) without considering document structure. Recent work on graph-based model using graph convolutional networks to encode visual and textual features have achieved promising performance on the task. However, few attempts take geometry information of text segments (text in bounding box) in visual documents into account. Meanwhile, existing methods do not consider that related text segments which need to be merged to form a complete entity in many real-world situations. In this paper, we present GraphNEMR, a graph-based model that uses graph convolutional networks to jointly merge text segments and recognize named entities. By incorporating geometry information from visual documents into our model, richer 2D context information is generated to improve document representations. To merge text segments, we introduce a novel mechanism that captures both geometry information as well as semantic information based on pre-trained language model. Experimental results show that the proposed GraphNEMR model outperforms both sequence-based and graph-based SOTA methods significantly.
Graph-based semi-supervised learning is appealing when labels are scarce but large amounts of unlabeled data are available. These methods typically use a heuristic strategy to construct the graph based on some fixed data representation, independently of the available labels. In this pa- per, we propose to jointly learn a data representation and a graph from both labeled and unlabeled data such that (i) the learned representation indirectly encodes the label information injected into the graph, and (ii) the graph provides a smooth topology with respect to the transformed data. Plugging the resulting graph and representation into existing graph-based semi-supervised learn- ing algorithms like label spreading and graph convolutional networks, we show that our approach outperforms standard graph construction methods on both synthetic data and real datasets.
BERT is a popular language model whose main pre-training task is to fill in the blank, i.e., predicting a word that was masked out of a sentence, based on the remaining words. In some applications, however, having an additional context can help the model make the right prediction, e.g., by taking the domain or the time of writing into account. This motivates us to advance the BERT architecture by adding a global state for conditioning on a fixed-sized context. We present our two novel approaches and apply them to an industry use-case, where we complete fashion outfits with missing articles, conditioned on a specific customer. An experimental comparison to other methods from the literature shows that our methods improve personalization significantly.
Word Sense Disambiguation (WSD) is a well-known problem in the natural language processing. In recent years, there has been increasing interest in applying neural net-works and machine learning techniques to solve WSD problems. However, these previ-ous supervised approaches often suffer from the lack of manually sense-tagged exam-ples. In this paper, to solve these problems, we propose a semi-supervised WSD method using graph embeddings based learning method in order to make effective use of labeled and unlabeled examples. The results of the experiments show that the proposed method performs better than the previous semi-supervised WSD method. Moreover, the graph structure between examples is effective for WSD and it is effective to utilize a graph structure obtained by fine-tuning BERT in the proposed method.
We present a novel method for injecting temporality into entailment graphs to address the problem of spurious entailments, which may arise from similar but temporally distinct events involving the same pair of entities. We focus on the sports domain in which the same pairs of teams play on different occasions, with different outcomes. We present an unsupervised model that aims to learn entailments such as win/lose → play, while avoiding the pitfall of learning non-entailments such as win ̸→ lose. We evaluate our model on a manually constructed dataset, showing that incorporating time intervals and applying a temporal window around them, are effective strategies.
We propose a simple and efficient framework to learn syntactic embeddings based on information derived from constituency parse trees. Using biased random walk methods, our embeddings not only encode syntactic information about words, but they also capture contextual information. We also propose a method to train the embeddings on multiple constituency parse trees to ensure the encoding of global syntactic representation. Quantitative evaluation of the embeddings show a competitive performance on POS tagging task when compared to other types of embeddings, and qualitative evaluation reveals interesting facts about the syntactic typology learned by these embeddings.
We propose an open-world knowledge graph completion model that can be combined with common closed-world approaches (such as ComplEx) and enhance them to exploit text-based representations for entities unseen in training. Our model learns relation-specific transformation functions from text-based to graph-based embedding space, where the closed-world link prediction model can be applied. We demonstrate state-of-the-art results on common open-world benchmarks and show that our approach benefits from relation-specific transformation functions (RST), giving substantial improvements over a relation-agnostic approach.
The 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating large detailed multi-fact explanations for standardized science exam questions. Given a question, correct answer, and knowledge base, models must rank each fact in the knowledge base such that facts most likely to appear in the explanation are ranked highest. Explanations consist of an average of 6 (and as many as 16) facts that span both core scientific knowledge and world knowledge, and form an explicit lexically-connected “explanation graph” describing how the facts interrelate. In this second iteration of the explanation regeneration shared task, participants are supplied with more than double the training and evaluation data of the first shared task, as well as a knowledge base nearly double in size, both of which expand into more challenging scientific topics that increase the difficulty of the task. In total 10 teams participated, and 5 teams submitted system description papers. The best-performing teams significantly increased state-of-the-art performance both in terms of ranking (mean average precision) and inference speed on this challenge task.
This paper describes the system designed by the Baidu PGL Team which achieved the first place in the TextGraphs 2020 Shared Task. The task focuses on generating explanations for elementary science questions. Given a question and its corresponding correct answer, we are asked to select the facts that can explain why the answer is correct for the question and answering (QA) from a large knowledge base. To address this problem, we use a pre-trained language model to recall the top-K relevant explanations for each question. Then, we adopt a re-ranking approach based on a pre-trained language model to rank the candidate explanations. To further improve the rankings, we also develop an architecture consisting both powerful pre-trained transformers and GNNs to tackle the multi-hop inference problem. The official evaluation shows that, our system can outperform the second best system by 1.91 points.
In this work, we describe the system developed by a group of undergraduates from the Indian Institutes of Technology for the Shared Task at TextGraphs-14 on Multi-Hop Inference Explanation Regeneration (Jansen and Ustalov, 2020). The shared task required participants to develop methods to reconstruct gold explanations for elementary science questions from the WorldTreeCorpus (Xie et al., 2020). Although our research was not funded by any organization and all the models were trained on freely available tools like Google Colab, which restricted our computational capabilities, we have managed to achieve noteworthy results, placing ourselves in 4th place with a MAPscore of 0.49021in the evaluation leaderboard and 0.5062 MAPscore on the post-evaluation-phase leaderboard using RoBERTa. We incorporated some of the methods proposed in the previous edition of Textgraphs-13 (Chia et al., 2019), which proved to be very effective, improved upon them, and built a model on top of it using powerful state-of-the-art pre-trained language models like RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), SciB-ERT (Beltagy et al., 2019) among others. Further optimization of our work can be done with the availability of better computational resources.
Textgraphs 2020 Workshop organized a shared task on ‘Explanation Regeneration’ that required reconstructing gold explanations for elementary science questions. This work describes our submission to the task which is based on multiple components: a BERT baseline ranking, an Integer Linear Program (ILP) based re-scoring and a regression model for re-ranking the explanation facts. Our system achieved a Mean Average Precision score of 0.3659.
Explainable question answering for science questions is a challenging task that requires multi-hop inference over a large set of fact sentences. To counter the limitations of methods that view each query-document pair in isolation, we propose the LSTM-Interleaved Transformer which incorporates cross-document interactions for improved multi-hop ranking. The LIT architecture can leverage prior ranking positions in the re-ranking setting. Our model is competitive on the current leaderboard for the TextGraphs 2020 shared task, achieving a test-set MAP of 0.5607, and would have gained third place had we submitted before the competition deadline. Our code implementation is made available at [https://github.com/mdda/worldtree_corpus/tree/textgraphs_2020](https://github.com/mdda/worldtree_corpus/tree/textgraphs_2020).