<?xml version="1.0" encoding="UTF-8"?>
<algorithms version="110505">
<algorithm name="ParsCit" version="110505">
<citationList>
<citation valid="true">
<authors>
<author>Michele Banko</author>
<author>Michael J Cafarella</author>
<author>Stephen Soderland</author>
<author>Matthew Broadhead</author>
<author>Oren Etzioni</author>
</authors>
<title>Open information extraction from the web.</title>
<date>2007</date>
<booktitle>In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI).</booktitle>
<contexts>
<context position="17889" citStr="Banko et al. (2007)" startWordPosition="2827" endWordPosition="2830">st corpus is given, together with the number of relations that could be extracted additionally from the test corpus. e.g., in a sentence like The exhibition [...] shows &lt;PER&gt;Clemens Brentano&lt;\PER&gt;, &lt;PER&gt;Achim von Arnim&lt;\PER&gt; and &lt;PER&gt;Heinrich von Kleist&lt;\PER&gt;, and between NEs occurring in the same (complex) argument, e.g., &lt;PER&gt;Hanns Peter Nerger&lt;\PER&gt;, CEO of &lt;ORG&gt;Berlin Tourismus Marketing GmbH (BTM) &lt;\ORG&gt;, sums it up [...]. 5. Related work Our work is related to previous work on domainindependent unsupervised relation extraction, in particular Sekine (2006), Shinyama and Sekine (2006) and Banko et al. (2007). Sekine (2006) introduces On-demand information extraction, which aims at automatically identifying salient patterns and extracting relations based on these patterns. He retrieves relevant documents from a newspaper corpus based on a query and applies a POS tagger, a dependency analyzer and an extended NE tagger. Using the information from the taggers, he extracts patterns and applies paraphrase recognition to create sets of semantically similar patterns. Shinyama and Sekine (2006) apply NER, coreference resolution and parsing to a corpus of newspaper articles to extract two-place relations b</context>
<context position="19356" citStr="Banko et al. (2007)" startWordPosition="3058" endWordPosition="3061">luster the relations. However, only relations among the five most highly-weighted entities in a cluster are extracted and only the first ten sentences of each article are taken into account. Banko et al. (2007) use a much larger corpus, namely 9 million web pages, to extract all relations between noun phrases. Due to the large amount of data, they apply POS tagging only. Their output consists of millions of relations, most of them being abstract assertions such as (executive, hired by, company) rather than concrete facts. Our approach can be regarded as a combination of these approaches: Like Banko et al. (2007), we extract relations from noisy web documents rather than comparably homogeneous news articles. However, rather than extracting relations from millions of pages we reduce the size of our corpus beforehand using a query in order to be able to apply more linguistic preprocessing. Like Sekine (2006) and Shinyama and Sekine (2006), we concentrate on relations involving NEs, the assumption being that these relations are the potentially interesting ones. The relation clustering step allows us to group similar relations, which can, for example, be useful for the generation of answers in a Question </context>
</contexts>
<marker>Banko, Cafarella, Soderland, Broadhead, Etzioni, 2007</marker>
<rawString>Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI).</rawString>
</citation>
<citation valid="false">
<authors>
<author>Andrea Heyl</author>
</authors>
<title>to appear 2008. Unsupervised relation extraction. Master’s thesis,</title>
<institution>Saarland University.</institution>
<marker>Heyl, </marker>
<rawString>Andrea Heyl. to appear 2008. Unsupervised relation extraction. Master’s thesis, Saarland University.</rawString>
</citation>
<citation valid="true">
<authors>
<author>Lc4j</author>
</authors>
<title>Language categorization library for Java.</title>
<date>2007</date>
<note>http://www.olivo.net/software/lc4j/.</note>
<marker>Lc4j, 2007</marker>
<rawString>Lc4j. 2007. Language categorization library for Java. http://www.olivo.net/software/lc4j/.</rawString>
</citation>
<citation valid="true">
<authors>
<author>LingPipe</author>
</authors>
<title>http://www.alias-i.com/lingpipe/. Satoshi Sekine.</title>
<date>2007</date>
<booktitle>In ACL. The Association</booktitle>
<institution>for Computer Linguistics.</institution>
<contexts>
<context position="7069" citStr="LingPipe, 2007" startWordPosition="1112" endWordPosition="1114">ion when downloading the documents. However, this does not prevent some documents written in a language other than our target language (English) from entering our corpus. In addition, some web sites contain text written in several languages. In order to restrict the processing to sentences written in English, we apply a language guesser tool, lc4j (Lc4j, 2007) and remove sentences not classified as written in English. This reduces errors on the following levels of processing. We also remove sentences that only contain non-alphanumeric characters. To all remaining sentences, we apply LingPipe (LingPipe, 2007) for sentence boundary detection, named entity recognition (NER) and coreference resolution. As a result of this step database tables are created, containing references to the original document, sentences and detected named entities (NEs). 2.2. Relation extraction Relation extraction is done on the basis of parsing potentially relevant sentences. We define a sentence to be of potential relevance if it at least contains two NEs. In the first step, so-called skeletons (simplified dependency trees) are extracted. To build the skeletons, the Stanford parser (Stanford Parser, 2007) is used to gener</context>
</contexts>
<marker>LingPipe, 2007</marker>
<rawString>LingPipe. 2007. http://www.alias-i.com/lingpipe/. Satoshi Sekine. 2006. On-demand information extraction. In ACL. The Association for Computer Linguistics.</rawString>
</citation>
<citation valid="true">
<authors>
<author>Yusuke Shinyama</author>
<author>Satoshi Sekine</author>
</authors>
<title>Preemptive information extraction using unrestricted relation discovery.</title>
<date>2006</date>
<booktitle>In Proc. of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics,</booktitle>
<pages>304--311</pages>
<publisher>Association</publisher>
<institution>for Computational Linguistics.</institution>
<contexts>
<context position="17865" citStr="Shinyama and Sekine (2006)" startWordPosition="2822" endWordPosition="2825">an example sentence from the test corpus is given, together with the number of relations that could be extracted additionally from the test corpus. e.g., in a sentence like The exhibition [...] shows &lt;PER&gt;Clemens Brentano&lt;\PER&gt;, &lt;PER&gt;Achim von Arnim&lt;\PER&gt; and &lt;PER&gt;Heinrich von Kleist&lt;\PER&gt;, and between NEs occurring in the same (complex) argument, e.g., &lt;PER&gt;Hanns Peter Nerger&lt;\PER&gt;, CEO of &lt;ORG&gt;Berlin Tourismus Marketing GmbH (BTM) &lt;\ORG&gt;, sums it up [...]. 5. Related work Our work is related to previous work on domainindependent unsupervised relation extraction, in particular Sekine (2006), Shinyama and Sekine (2006) and Banko et al. (2007). Sekine (2006) introduces On-demand information extraction, which aims at automatically identifying salient patterns and extracting relations based on these patterns. He retrieves relevant documents from a newspaper corpus based on a query and applies a POS tagger, a dependency analyzer and an extended NE tagger. Using the information from the taggers, he extracts patterns and applies paraphrase recognition to create sets of semantically similar patterns. Shinyama and Sekine (2006) apply NER, coreference resolution and parsing to a corpus of newspaper articles to extra</context>
<context position="19686" citStr="Shinyama and Sekine (2006)" startWordPosition="3111" endWordPosition="3114">large amount of data, they apply POS tagging only. Their output consists of millions of relations, most of them being abstract assertions such as (executive, hired by, company) rather than concrete facts. Our approach can be regarded as a combination of these approaches: Like Banko et al. (2007), we extract relations from noisy web documents rather than comparably homogeneous news articles. However, rather than extracting relations from millions of pages we reduce the size of our corpus beforehand using a query in order to be able to apply more linguistic preprocessing. Like Sekine (2006) and Shinyama and Sekine (2006), we concentrate on relations involving NEs, the assumption being that these relations are the potentially interesting ones. The relation clustering step allows us to group similar relations, which can, for example, be useful for the generation of answers in a Question Answering system. 6. Future work Since many errors were due to the noisiness of the arbitrarily downloaded web documents, a more sophisticated filtering step for extracting relevant textual information from web sites before applying NE recognition, parsing, etc. is likely to improve the performance of the system. The NER compone</context>
</contexts>
<marker>Shinyama, Sekine, 2006</marker>
<rawString>Yusuke Shinyama and Satoshi Sekine. 2006. Preemptive information extraction using unrestricted relation discovery. In Proc. of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 304–311. Association for Computational Linguistics.</rawString>
</citation>
<citation valid="true">
<authors>
<author>Stanford Parser</author>
</authors>
<date>2007</date>
<note>http://nlp.stanford.edu/ downloads/lex-parser.shtml.</note>
<contexts>
<context position="7652" citStr="Parser, 2007" startWordPosition="1201" endWordPosition="1202">pply LingPipe (LingPipe, 2007) for sentence boundary detection, named entity recognition (NER) and coreference resolution. As a result of this step database tables are created, containing references to the original document, sentences and detected named entities (NEs). 2.2. Relation extraction Relation extraction is done on the basis of parsing potentially relevant sentences. We define a sentence to be of potential relevance if it at least contains two NEs. In the first step, so-called skeletons (simplified dependency trees) are extracted. To build the skeletons, the Stanford parser (Stanford Parser, 2007) is used to generate dependency trees for the potentially relevant sentences. For each NE pair in a sentence, the common root element in the corresponding tree is identified and the elements from each of the NEs to the root are collected. An example of a skeleton is shown in Figure 5. In the second step, information based on dependency types is extracted for the potentially relevant sentences. Focusing on verb relations (this can be extended to other types of relations), we collect for each verb its subject(s), object(s), preposition(s) with arguments and auxiliary verb(s). We can now extract </context>
</contexts>
<marker>Parser, 2007</marker>
<rawString>Stanford Parser. 2007. http://nlp.stanford.edu/ downloads/lex-parser.shtml.</rawString>
</citation>
<citation valid="true">
<authors>
<author>Ian H Witten</author>
<author>Eibe Frank</author>
</authors>
<title>Data Mining: Practical machine learning tools and techniques.</title>
<date>2005</date>
<contexts>
<context position="9534" citStr="Witten and Frank, 2005" startWordPosition="1486" endWordPosition="1489">cuments+ NE tables skeletons + sov−relations filtering of relevant sentences syntactic + typed dependency parsing Relation extraction table of clustered relations relation filtering clustering Relation clustering Preprocessing web documents document retrieval topic specific documents conversion plain text documents sentence boundary detection, NE recognition, coreference language filtering resolution The comparably large amount of data in the corpus requires the use of an efficient clustering algorithm. Standard ML clustering algorithms such as k-means and EM (as provided by the Weka toolbox (Witten and Frank, 2005)) have been tested for clustering the relations at hand but were not able to deal with the large number of features and instances required for an adequate representation of our dataset. We thus decided to use a scoring algorithm that compares a relation to other relations based on certain aspects and calculates a similarity score. If this similarity score exceeds a predefined threshold, two relations are grouped together. Similarity is measured based on the output from the different preprocessing steps as well as lexical information from WordNet (WordNet, 2007): • WordNet: WordNet information </context>
</contexts>
<marker>Witten, Frank, 2005</marker>
<rawString>Ian H. Witten and Eibe Frank. 2005. Data Mining: Practical machine learning tools and techniques.</rawString>
</citation>
<citation valid="true">
<authors>
<author>Morgan Kaufmann</author>
<author>San Francisco</author>
</authors>
<title>edition. WordNet.</title>
<date></date>
<note>http://wordnet.princeton.edu/.</note>
<marker>Kaufmann, Francisco, </marker>
<rawString>Morgan Kaufmann, San Francisco, 2nd edition. WordNet. 2007. http://wordnet.princeton.edu/.</rawString>
</citation>
</citationList>
</algorithm>
</algorithms>