Open information extraction from the web.

Michele Banko Michael J Cafarella Stephen Soderland Matthew Broadhead Oren Etzioni Open information extraction from the web. 2007 In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI). st corpus is given, together with the number of relations that could be extracted additionally from the test corpus. e.g., in a sentence like The exhibition [...] shows <PER>Clemens Brentano<\PER>, <PER>Achim von Arnim<\PER> and <PER>Heinrich von Kleist<\PER>, and between NEs occurring in the same (complex) argument, e.g., <PER>Hanns Peter Nerger<\PER>, CEO of <ORG>Berlin Tourismus Marketing GmbH (BTM) <\ORG>, sums it up [...]. 5. Related work Our work is related to previous work on domainindependent unsupervised relation extraction, in particular Sekine (2006), Shinyama and Sekine (2006) and Banko et al. (2007). Sekine (2006) introduces On-demand information extraction, which aims at automatically identifying salient patterns and extracting relations based on these patterns. He retrieves relevant documents from a newspaper corpus based on a query and applies a POS tagger, a dependency analyzer and an extended NE tagger. Using the information from the taggers, he extracts patterns and applies paraphrase recognition to create sets of semantically similar patterns. Shinyama and Sekine (2006) apply NER, coreference resolution and parsing to a corpus of newspaper articles to extract two-place relations b luster the relations. However, only relations among the five most highly-weighted entities in a cluster are extracted and only the first ten sentences of each article are taken into account. Banko et al. (2007) use a much larger corpus, namely 9 million web pages, to extract all relations between noun phrases. Due to the large amount of data, they apply POS tagging only. Their output consists of millions of relations, most of them being abstract assertions such as (executive, hired by, company) rather than concrete facts. Our approach can be regarded as a combination of these approaches: Like Banko et al. (2007), we extract relations from noisy web documents rather than comparably homogeneous news articles. However, rather than extracting relations from millions of pages we reduce the size of our corpus beforehand using a query in order to be able to apply more linguistic preprocessing. Like Sekine (2006) and Shinyama and Sekine (2006), we concentrate on relations involving NEs, the assumption being that these relations are the potentially interesting ones. The relation clustering step allows us to group similar relations, which can, for example, be useful for the generation of answers in a Question Banko, Cafarella, Soderland, Broadhead, Etzioni, 2007 Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI). Andrea Heyl to appear 2008. Unsupervised relation extraction. Master’s thesis, Saarland University. Heyl, Andrea Heyl. to appear 2008. Unsupervised relation extraction. Master’s thesis, Saarland University. Lc4j Language categorization library for Java. 2007 http://www.olivo.net/software/lc4j/. Lc4j, 2007 Lc4j. 2007. Language categorization library for Java. http://www.olivo.net/software/lc4j/. LingPipe http://www.alias-i.com/lingpipe/. Satoshi Sekine. 2007 In ACL. The Association for Computer Linguistics. ion when downloading the documents. However, this does not prevent some documents written in a language other than our target language (English) from entering our corpus. In addition, some web sites contain text written in several languages. In order to restrict the processing to sentences written in English, we apply a language guesser tool, lc4j (Lc4j, 2007) and remove sentences not classified as written in English. This reduces errors on the following levels of processing. We also remove sentences that only contain non-alphanumeric characters. To all remaining sentences, we apply LingPipe (LingPipe, 2007) for sentence boundary detection, named entity recognition (NER) and coreference resolution. As a result of this step database tables are created, containing references to the original document, sentences and detected named entities (NEs). 2.2. Relation extraction Relation extraction is done on the basis of parsing potentially relevant sentences. We define a sentence to be of potential relevance if it at least contains two NEs. In the first step, so-called skeletons (simplified dependency trees) are extracted. To build the skeletons, the Stanford parser (Stanford Parser, 2007) is used to gener LingPipe, 2007 LingPipe. 2007. http://www.alias-i.com/lingpipe/. Satoshi Sekine. 2006. On-demand information extraction. In ACL. The Association for Computer Linguistics. Yusuke Shinyama Satoshi Sekine Preemptive information extraction using unrestricted relation discovery. 2006 In Proc. of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 304--311 Association for Computational Linguistics. an example sentence from the test corpus is given, together with the number of relations that could be extracted additionally from the test corpus. e.g., in a sentence like The exhibition [...] shows <PER>Clemens Brentano<\PER>, <PER>Achim von Arnim<\PER> and <PER>Heinrich von Kleist<\PER>, and between NEs occurring in the same (complex) argument, e.g., <PER>Hanns Peter Nerger<\PER>, CEO of <ORG>Berlin Tourismus Marketing GmbH (BTM) <\ORG>, sums it up [...]. 5. Related work Our work is related to previous work on domainindependent unsupervised relation extraction, in particular Sekine (2006), Shinyama and Sekine (2006) and Banko et al. (2007). Sekine (2006) introduces On-demand information extraction, which aims at automatically identifying salient patterns and extracting relations based on these patterns. He retrieves relevant documents from a newspaper corpus based on a query and applies a POS tagger, a dependency analyzer and an extended NE tagger. Using the information from the taggers, he extracts patterns and applies paraphrase recognition to create sets of semantically similar patterns. Shinyama and Sekine (2006) apply NER, coreference resolution and parsing to a corpus of newspaper articles to extra large amount of data, they apply POS tagging only. Their output consists of millions of relations, most of them being abstract assertions such as (executive, hired by, company) rather than concrete facts. Our approach can be regarded as a combination of these approaches: Like Banko et al. (2007), we extract relations from noisy web documents rather than comparably homogeneous news articles. However, rather than extracting relations from millions of pages we reduce the size of our corpus beforehand using a query in order to be able to apply more linguistic preprocessing. Like Sekine (2006) and Shinyama and Sekine (2006), we concentrate on relations involving NEs, the assumption being that these relations are the potentially interesting ones. The relation clustering step allows us to group similar relations, which can, for example, be useful for the generation of answers in a Question Answering system. 6. Future work Since many errors were due to the noisiness of the arbitrarily downloaded web documents, a more sophisticated filtering step for extracting relevant textual information from web sites before applying NE recognition, parsing, etc. is likely to improve the performance of the system. The NER compone Shinyama, Sekine, 2006 Yusuke Shinyama and Satoshi Sekine. 2006. Preemptive information extraction using unrestricted relation discovery. In Proc. of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 304–311. Association for Computational Linguistics. Stanford Parser 2007 http://nlp.stanford.edu/ downloads/lex-parser.shtml. pply LingPipe (LingPipe, 2007) for sentence boundary detection, named entity recognition (NER) and coreference resolution. As a result of this step database tables are created, containing references to the original document, sentences and detected named entities (NEs). 2.2. Relation extraction Relation extraction is done on the basis of parsing potentially relevant sentences. We define a sentence to be of potential relevance if it at least contains two NEs. In the first step, so-called skeletons (simplified dependency trees) are extracted. To build the skeletons, the Stanford parser (Stanford Parser, 2007) is used to generate dependency trees for the potentially relevant sentences. For each NE pair in a sentence, the common root element in the corresponding tree is identified and the elements from each of the NEs to the root are collected. An example of a skeleton is shown in Figure 5. In the second step, information based on dependency types is extracted for the potentially relevant sentences. Focusing on verb relations (this can be extended to other types of relations), we collect for each verb its subject(s), object(s), preposition(s) with arguments and auxiliary verb(s). We can now extract Parser, 2007 Stanford Parser. 2007. http://nlp.stanford.edu/ downloads/lex-parser.shtml. Ian H Witten Eibe Frank Data Mining: Practical machine learning tools and techniques. 2005 cuments+ NE tables skeletons + sov−relations filtering of relevant sentences syntactic + typed dependency parsing Relation extraction table of clustered relations relation filtering clustering Relation clustering Preprocessing web documents document retrieval topic specific documents conversion plain text documents sentence boundary detection, NE recognition, coreference language filtering resolution The comparably large amount of data in the corpus requires the use of an efficient clustering algorithm. Standard ML clustering algorithms such as k-means and EM (as provided by the Weka toolbox (Witten and Frank, 2005)) have been tested for clustering the relations at hand but were not able to deal with the large number of features and instances required for an adequate representation of our dataset. We thus decided to use a scoring algorithm that compares a relation to other relations based on certain aspects and calculates a similarity score. If this similarity score exceeds a predefined threshold, two relations are grouped together. Similarity is measured based on the output from the different preprocessing steps as well as lexical information from WordNet (WordNet, 2007): • WordNet: WordNet information Witten, Frank, 2005 Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann San Francisco edition. WordNet. http://wordnet.princeton.edu/. Kaufmann, Francisco, Morgan Kaufmann, San Francisco, 2nd edition. WordNet. 2007. http://wordnet.princeton.edu/.