<?xml version="1.0" encoding="UTF-8"?>
<algorithms version="110505">
<algorithm name="SectLabel" version="110505">
<variant no="0" confidence="0.000756">
<title confidence="0.986763">
Unsupervised Relation Extraction from Web Documents
</title>
<author confidence="0.957796">
Kathrin Eichler, Holmer Hemsen and G¨unter Neumann
</author>
<affiliation confidence="0.869386">
DFKI GmbH, LT-Lab, Stuhlsatzenhausweg 3 (Building D3 2), D-66123 Saarbr¨ucken
</affiliation>
<email confidence="0.909296">
{FirstName.SecondName}@dfki.de
</email>
<sectionHeader confidence="0.999425" genericHeader="abstract">
Abstract
</sectionHeader>
<bodyText confidence="0.981787">
The IDEX system is a prototype of an interactive dynamic Information Extraction (IE) system. A user of the system
expresses an information request in the form of a topic description, which is used for an initial search in order to retrieve
a relevant set of documents. On basis of this set of documents, unsupervised relation extraction and clustering is done by
the system. The results of these operations can then be interactively inspected by the user. In this paper we describe the
relation extraction and clustering components of the IDEX system. Preliminary evaluation results of these components are
presented and an overview is given of possible enhancements to improve the relation extraction and clustering components.
1. Introduction
Information extraction (IE) involves the process of au-
tomatically identifying instances of certain relations of
interest, e.g., produce(&lt;company&gt;, &lt;product&gt;, &lt;lo-
cation&gt;), in some document collection and the con-
struction of a database with information about each
individual instance (e.g., the participants of a meet-
ing, the date and time of the meeting). Currently, IE
systems are usually domain-dependent and adapting
the system to a new domain requires a high amount
of manual labour, such as specifying and implement-
ing relation–specific extraction patterns manually (cf.
Fig. 1) or annotating large amounts of training cor-
pora (cf. Fig. 2). These adaptations have to be made
offline, i.e., before the specific IE system is actually
made. Consequently, current IE technology is highly
statical and inflexible with respect to a timely adapta-
tion to new requirements in the form of new topics.
</bodyText>
<figureCaption confidence="0.961357">
Figure 1: A hand-coded rule–based IE–system (schemat-
ically): A topic expert implements manually task–specific
extraction rules on the basis of her manual analysis of a
representative corpus.
</figureCaption>
<subsectionHeader confidence="0.785035">
1.1. Our goal
</subsectionHeader>
<bodyText confidence="0.805705">
The goal of our IE research is the conception and im-
plementation of core IE technology to produce a new
</bodyText>
<figureCaption confidence="0.992279166666667">
Figure 2: A data–oriented IE system (schematically): The
task–specific extraction rules are automatically acquired by
means of Machine Learning algorithms, which are using
a sufficiently large enough corpus of topic–relevant docu-
ments. These documents have to be collected and costly
annotated by a topic–expert.
</figureCaption>
<bodyText confidence="0.999827823529412">
IE system automatically for a given topic. Here, the
pre–knowledge about the information request is given
by a user online to the IE core system (called IDEX)
in the form of a topic description (cf. Fig. 3). This
initial information source is used to retrieve relevant
documents and extract and cluster relations in an un-
supervised way. In this way, IDEX is able to adapt
much better to the dynamic information space, in par-
ticular because no predefined patterns of relevant re-
lations have to be specified, but relevant patterns are
determined online. Our system consists of a front-end,
which provides the user with a GUI for interactively in-
specting information extracted from topic-related web
documents, and a back-end, which contains the rela-
tion extraction and clustering component. In this pa-
per, we describe the back-end component and present
preliminary evaluation results.
</bodyText>
<subsectionHeader confidence="0.98033">
1.2. Application potential
</subsectionHeader>
<bodyText confidence="0.9976295">
However, before doing so we would like to motivate
the application potential and impact of the IDEX ap-
</bodyText>
<figureCaption confidence="0.657726571428571">
Figure 3: The dynamic IE system IDEX (schematically):
a user of the IDEX IE system expresses her information
request in the form of a topic description which is used for
an initial search in order to retrieve a relevant set of doc-
uments. From this set of documents, the system extracts
and collects (using the IE core components of IDEX) a set
of tables of instances of possibly relevant relations. These
tables are presented to the user (who is assumed to be the
topic–expert), who will analyse the data further for her in-
formation research. The whole IE process is dynamic, since
no offline data is required, and the IE process is interactive,
since the topic expert is able to specify new topic descrip-
tions, which express her new attention triggered by a novel
relationship she was not aware of beforehand.
</figureCaption>
<bodyText confidence="0.999941551724138">
proach by an example application. Consider, e.g., the
case of the exploration and the exposure of corruptions
or the risk analysis of mega construction projects. Via
the Internet, a large pool of information resources of
such mega construction projects is available. These
information resources are rich in quantity, but also
in quality, e.g., business reports, company profiles,
blogs, reports by tourists, who visited these construc-
tion projects, but also web documents, which only
mention the project name and nothing else. One of
the challenges for the risk analysis of mega construc-
tion projects is the efficient exploration of the possibly
relevant search space. Developing manually an IE sys-
tem is often not possible because of the timely need
of the information, and, more importantly, is proba-
bly not useful, because the needed (hidden) informa-
tion is actually not known. In contrast, an unsuper-
vised and dynamic IE system like IDEX can be used
to support the expert in the exploration of the search
space through pro–active identification and clustering
of structured entities. Named entities like for example
person names and locations, are often useful indicators
of relevant text passages, in particular, if the names are
in some relationship. Furthermore, because the found
relationships are visualized using an advanced graph-
ical user interface, the user can select specific names
and find associated relationships to other names, the
documents they occur in or she can search for para-
phrases of sentences.
</bodyText>
<sectionHeader confidence="0.699195" genericHeader="keywords">
2. System architecture
</sectionHeader>
<bodyText confidence="0.9996695">
The back-end component, visualized in Figure 4, con-
sists of three parts, which are described in detail in this
section: preprocessing, relation extraction and relation
clustering.
</bodyText>
<subsectionHeader confidence="0.987372">
2.1. Preprocessing
</subsectionHeader>
<bodyText confidence="0.999991375">
In the first step, for a specific search task, a topic of
interest has to be defined in the form of a query. For
this topic, documents are automatically retrieved from
the web using the Google search engine. HTML and
PDF documents are converted into plain text files. As
the tools used for linguistic processing (NE recogni-
tion, parsing, etc.) are language-specific, we use the
Google language filter option when downloading the
documents. However, this does not prevent some doc-
uments written in a language other than our target
language (English) from entering our corpus. In ad-
dition, some web sites contain text written in several
languages. In order to restrict the processing to sen-
tences written in English, we apply a language guesser
tool, lc4j (Lc4j, 2007) and remove sentences not clas-
sified as written in English. This reduces errors on
the following levels of processing. We also remove sen-
tences that only contain non-alphanumeric characters.
To all remaining sentences, we apply LingPipe (Ling-
Pipe, 2007) for sentence boundary detection, named
entity recognition (NER) and coreference resolution.
As a result of this step database tables are created,
containing references to the original document, sen-
tences and detected named entities (NEs).
</bodyText>
<subsectionHeader confidence="0.884206">
2.2. Relation extraction
</subsectionHeader>
<bodyText confidence="0.999658913043478">
Relation extraction is done on the basis of parsing po-
tentially relevant sentences. We define a sentence to be
of potential relevance if it at least contains two NEs.
In the first step, so-called skeletons (simplified depen-
dency trees) are extracted. To build the skeletons, the
Stanford parser (Stanford Parser, 2007) is used to gen-
erate dependency trees for the potentially relevant sen-
tences. For each NE pair in a sentence, the common
root element in the corresponding tree is identified and
the elements from each of the NEs to the root are col-
lected. An example of a skeleton is shown in Figure 5.
In the second step, information based on dependency
types is extracted for the potentially relevant sen-
tences. Focusing on verb relations (this can be ex-
tended to other types of relations), we collect for each
verb its subject(s), object(s), preposition(s) with ar-
guments and auxiliary verb(s). We can now extract
verb relations using a simple algorithm: We define a
verb relation to be a verb together with its arguments
(subject(s), object(s) and prepositional phrases) and
consider only those relations to be of interest where at
least the subject or the object is an NE. We filter out
relations with only one argument.
</bodyText>
<subsectionHeader confidence="0.951681">
2.3. Relation clustering
</subsectionHeader>
<bodyText confidence="0.9836815">
Relation clusters are generated by grouping relation
instances based on their similarity.
</bodyText>
<figureCaption confidence="0.99235">
Figure 4: System architecture
Figure 5: Skeleton for the NE pair “Hohenzollern” and “Brandenburg” in the sentence “Subsequent members of
the Hohenzollern family ruled until 1918 in Berlin, first as electors of Brandenburg.“
</figureCaption>
<figure confidence="0.997035454545454">
sentence/documents+
NE tables
skeletons +
sov−relations
filtering of
relevant
sentences
syntactic
+
typed dependency
parsing
Relation extraction
table of clustered relations
relation
filtering
clustering
Relation clustering
Preprocessing
web documents
document
retrieval
topic specific
documents
conversion
plain text
documents
sentence boundary
detection,
NE recognition,
coreference
language
filtering
resolution
</figure>
<bodyText confidence="0.9482615625">
The comparably large amount of data in the corpus
requires the use of an efficient clustering algorithm.
Standard ML clustering algorithms such as k-means
and EM (as provided by the Weka toolbox (Witten
and Frank, 2005)) have been tested for clustering the
relations at hand but were not able to deal with the
large number of features and instances required for an
adequate representation of our dataset. We thus de-
cided to use a scoring algorithm that compares a re-
lation to other relations based on certain aspects and
calculates a similarity score. If this similarity score ex-
ceeds a predefined threshold, two relations are grouped
together.
Similarity is measured based on the output from the
different preprocessing steps as well as lexical informa-
tion from WordNet (WordNet, 2007):
</bodyText>
<listItem confidence="0.998273">
• WordNet: WordNet information is used to deter-
mine if two verb infinitives match or if they are in
the same synonym set.
• Parsing: The extracted dependency information is
</listItem>
<bodyText confidence="0.981516666666667">
used to measure the token overlap of the two sub-
jects and objects, respectively. We also compare
the subject of the first relation with the object of
the second relation and vice versa. In addition,
we compare the auxiliary verbs, prepositions and
preposition arguments found in the relation.
</bodyText>
<listItem confidence="0.977363545454546">
• NE recognition: The information from this step
is used to count how many of the NEs occurring
in the contexts, i.e., the sentences in which the
two relations are found, match and whether the
NE types of the subjects and objects, respectively,
match.
• Coreference resolution: This type of information
is used to compare the NE subject (or object) of
one relation to strings that appear in the same
coreference set as the subject (or object) of the
second relation.
</listItem>
<bodyText confidence="0.992911">
Manually analyzing a set of extracted relation in-
stances, we defined weights for the different similarity
measures and calculated a similarity score for each re-
lation pair. We then defined a score threshold and clus-
tered relations by putting two relations into the same
cluster if their similarity score exceeded this threshold
value.
</bodyText>
<sectionHeader confidence="0.870685" genericHeader="introduction">
3. Experiments and results
</sectionHeader>
<bodyText confidence="0.9999215">
For our experiments, we built a test corpus of doc-
uments related to the topic “Berlin Hauptbahnhof”
by sending queries describing the topic (e.g., “Berlin
Hauptbahnhof”, “Berlin central station”) to Google
and downloading the retrieved documents specifying
English as the target language. After preprocessing
these documents as described in 2.1., our corpus con-
sisted of 55,255 sentences from 1,068 web pages, from
which 10773 relations were automatically extracted
and clustered.
</bodyText>
<subsectionHeader confidence="0.97048">
3.1. Clustering
</subsectionHeader>
<bodyText confidence="0.995808">
From the extracted relations, the system built 306 clus-
ters of two or more instances, which were manually
evaluated by two authors of this paper. 81 of our clus-
ters contain two or more instances of exactly the same
relation, mostly due to the same sentence appearing in
several documents of the corpus. Of the remaining 225
clusters, 121 were marked as consistent, 35 as partly
consistent, 69 as not consistent. We defined consis-
tency based on the potential usefulness of a cluster to
the user and identified three major types of potentially
useful clusters:
</bodyText>
<listItem confidence="0.946797375">
• Relation paraphrases, e.g.,
accused (Mr Moore, Disney, In letter)
accused (Michael Moore, Walt Disney
Company)
• Different instances of the same pattern, e.g.,
operates (Delta, flights, from New York)
offers (Lufthansa, flights, from DC)
• Relations about the same topic (NE), e.g.,
</listItem>
<bodyText confidence="0.708695833333333">
rejected (Mr Blair, pressure, from Labour
MPs)
reiterated (Mr Blair, ideas, in speech, on
March)
created (Mr Blair, doctrine)
...
Of our 121 consistent clusters, 76 were classified as be-
ing of the type ’same pattern’, 27 as being of the type
’same topic’ and 18 as being of the type ’relation para-
phrases’. As many of our clusters contain two instances
only, we are planning to analyze whether some clusters
should be merged and how this could be achieved.
</bodyText>
<subsectionHeader confidence="0.906138">
3.2. Relation extraction
</subsectionHeader>
<bodyText confidence="0.999918097560975">
In order to evaluate the performance of the relation ex-
traction component, we manually annotated 550 sen-
tences of the test corpus by tagging all NEs and verbs
and manually extracting potentially interesting verb
relations. We define ’potentially interesting verb rela-
tion’ as a verb together with its arguments (i.e., sub-
ject, objects and PP arguments), where at least two
of the arguments are NEs and at least one of them
is the subject or an object. On the basis of this crite-
rion, we found 15 potentially interesting verb relations.
For the same sentences, the IDEX system extracted 27
relations, 11 of them corresponding to the manually
extracted ones. This yields a recall value of 73% and
a precision value of 41%.
There were two types of recall errors: First, errors in
sentence boundary detection, mainly due to noisy in-
put data (e.g., missing periods), which lead to parsing
errors, and second, NER errors, i.e., NEs that were
not recognised as such. Precision errors could mostly
be traced back to the NER component (sequences of
words were wrongly identified as NEs).
In the 550 manually annotated sentences, 1300 NEs
were identified as NEs by the NER component. 402
NEs were recognised correctly by the NER, 588
wrongly and in 310 cases only parts of an NE were
recognised. These 310 cases can be divided into three
groups of errors. First, NEs recognised correctly, but
labeled with the wrong NE type. Second, only parts
of the NE were recognised correctly, e.g., “Touris-
mus Marketing GmbH” instead of “Berlin Tourismus
Marketing GmbH”. Third, NEs containing additional
words, such as “the” in “the Brandenburg Gate”.
To judge the usefulness of the extracted relations, we
applied the following soft criterion: A relation is con-
sidered useful if it expresses the main information given
by the sentence or clause, in which the relation was
found. According to this criterion, six of the eleven
relations could be considered useful. The remaining
five relations lacked some relevant part of the sen-
tence/clause (e.g., a crucial part of an NE, like the
’ICC’ in ’ICC Berlin’).
</bodyText>
<sectionHeader confidence="0.785832" genericHeader="method">
4. Possible enhancements
</sectionHeader>
<bodyText confidence="0.99984015">
With only 15 manually extracted relations out of 550
sentences, we assume that our definition of ’potentially
interesting relation’ is too strict, and that more inter-
esting relations could be extracted by loosening the ex-
traction criterion. To investigate on how the criterion
could be loosened, we analysed all those sentences in
the test corpus that contained at least two NEs in order
to find out whether some interesting relations were lost
by the definition and how the definition would have to
be changed in order to detect these relations. The ta-
ble in Figure 6 lists some suggestions of how this could
be achieved, together with example relations and the
number of additional relations that could be extracted
from the 550 test sentences.
In addition, more interesting relations could be
found with an NER component extended by more
types, e.g., DATE and EVENT. Open domain NER
may be useful in order to extract NEs of additional
types. Also, other types of relations could be inter-
esting, such as relations between coordinated NEs,
</bodyText>
<table confidence="0.9902265">
option example additional relations
extraction of relations, Co-operation with &lt;ORG&gt;M.A.X. 25
where the NE is not the 2001&lt;\ORG&gt; &lt;V&gt;is&lt;\V&gt; clearly of
complete subject, object or benefit to &lt;ORG&gt;BTM&lt;\ORG&gt;.
PP argument, but only part
of it
extraction of relations with &lt;ORG&gt;BTM&lt;\ORG&gt; &lt;V&gt;invited and or 7
a complex VP supported&lt;\V&gt; more than 1,000 media rep-
resentatives in &lt;LOC&gt;Berlin&lt;\LOC&gt;.
resolution of relative pro- The &lt;ORG&gt;Oxford Centre for Maritime 2
nouns Archaeology&lt;\ORG&gt; [...] which will
&lt;V&gt;conduct&lt;\V&gt; a scientific symposium in
&lt;LOC&gt;Berlin&lt;\LOC&gt;.
combination of several of the &lt;LOC&gt;Berlin&lt;\LOC&gt; has &lt;V&gt;developed to 7
options mentioned above become&lt;\V&gt; the entertainment capital of
&lt;LOC&gt;Germany&lt;\LOC&gt;.
</table>
<figureCaption confidence="0.913857">
Figure 6: Table illustrating different options according to which the definition of ”potentially interesting relation”
</figureCaption>
<bodyText confidence="0.602132888888889">
could be loosened. For each option, an example sentence from the test corpus is given, together with the number
of relations that could be extracted additionally from the test corpus.
e.g., in a sentence like The exhibition [...] shows
&lt;PER&gt;Clemens Brentano&lt;\PER&gt;, &lt;PER&gt;Achim
von Arnim&lt;\PER&gt; and &lt;PER&gt;Heinrich von
Kleist&lt;\PER&gt;, and between NEs occurring in the
same (complex) argument, e.g., &lt;PER&gt;Hanns Peter
Nerger&lt;\PER&gt;, CEO of &lt;ORG&gt;Berlin Tourismus
Marketing GmbH (BTM) &lt;\ORG&gt;, sums it up [...].
</bodyText>
<sectionHeader confidence="0.998432" genericHeader="method">
5. Related work
</sectionHeader>
<bodyText confidence="0.999839844444445">
Our work is related to previous work on domain-
independent unsupervised relation extraction, in par-
ticular Sekine (2006), Shinyama and Sekine (2006) and
Banko et al. (2007).
Sekine (2006) introduces On-demand information ex-
traction, which aims at automatically identifying
salient patterns and extracting relations based on these
patterns. He retrieves relevant documents from a
newspaper corpus based on a query and applies a POS
tagger, a dependency analyzer and an extended NE
tagger. Using the information from the taggers, he ex-
tracts patterns and applies paraphrase recognition to
create sets of semantically similar patterns. Shinyama
and Sekine (2006) apply NER, coreference resolution
and parsing to a corpus of newspaper articles to ex-
tract two-place relations between NEs. The extracted
relations are grouped into pattern tables of NE pairs
expressing the same relation, e.g., hurricanes and their
locations. Clustering is performed in two steps: they
first cluster all documents and use this information to
cluster the relations. However, only relations among
the five most highly-weighted entities in a cluster are
extracted and only the first ten sentences of each arti-
cle are taken into account.
Banko et al. (2007) use a much larger corpus, namely
9 million web pages, to extract all relations between
noun phrases. Due to the large amount of data, they
apply POS tagging only. Their output consists of mil-
lions of relations, most of them being abstract asser-
tions such as (executive, hired by, company) rather
than concrete facts.
Our approach can be regarded as a combination of
these approaches: Like Banko et al. (2007), we extract
relations from noisy web documents rather than com-
parably homogeneous news articles. However, rather
than extracting relations from millions of pages we re-
duce the size of our corpus beforehand using a query in
order to be able to apply more linguistic preprocessing.
Like Sekine (2006) and Shinyama and Sekine (2006),
we concentrate on relations involving NEs, the assump-
tion being that these relations are the potentially in-
teresting ones. The relation clustering step allows us
to group similar relations, which can, for example, be
useful for the generation of answers in a Question An-
swering system.
</bodyText>
<sectionHeader confidence="0.998324" genericHeader="method">
6. Future work
</sectionHeader>
<bodyText confidence="0.999936806451613">
Since many errors were due to the noisiness of the ar-
bitrarily downloaded web documents, a more sophisti-
cated filtering step for extracting relevant textual infor-
mation from web sites before applying NE recognition,
parsing, etc. is likely to improve the performance of
the system.
The NER component plays a crucial role for the qual-
ity of the whole system, because the relation extraction
component depends heavily on the NER quality, and
thereby the NER quality influences also the results of
the clustering process. A possible solution to improve
NER in the IDEX System is to integrate a MetaNER
component, combining the results of several NER com-
ponents. Within the framework of the IDEX project
a MetaNER component already has been developed
(Heyl, to appear 2008), but not yet integrated into the
prototype. The MetaNER component developed uses
the results from three different NER systems. The out-
put of each NER component is weighted depending on
the component and if the sum of these values for a pos-
sible NE exceeds a certain threshold it is accepted as
NE otherwise it is rejected.
The clustering step returns many clusters containing
two instances only. A task for future work is to in-
vestigate, whether it is possible to build larger clus-
ters, which are still meaningful. One way of enlarging
cluster size is to extract more relations. This could
be achieved by loosening the extraction criteria as de-
scribed in section 4. Also, it would be interesting to see
whether clusters could be merged. This would require
a manual analysis of the created clusters.
</bodyText>
<sectionHeader confidence="0.99039" genericHeader="conclusions">
Acknowledgement
</sectionHeader>
<bodyText confidence="0.9992096">
The work presented here was partially supported by a
research grant from the “Programm zur F¨orderung von
Forschung, Innovationen und Technologien (ProFIT)”
(FKZ: 10135984) and the European Regional Develop-
ment Fund (ERDF).
</bodyText>
<sectionHeader confidence="0.960003" genericHeader="references">
7. References
</sectionHeader>
<reference confidence="0.996732692307692">
Michele Banko, Michael J. Cafarella, Stephen Soder-
land, Matthew Broadhead, and Oren Etzioni. 2007.
Open information extraction from the web. In Proc.
of the International Joint Conference on Artificial
Intelligence (IJCAI).
Andrea Heyl. to appear 2008. Unsupervised relation
extraction. Master’s thesis, Saarland University.
Lc4j. 2007. Language categorization library for Java.
http://www.olivo.net/software/lc4j/.
LingPipe. 2007. http://www.alias-i.com/lingpipe/.
Satoshi Sekine. 2006. On-demand information extrac-
tion. In ACL. The Association for Computer Lin-
guistics.
Yusuke Shinyama and Satoshi Sekine. 2006. Preemp-
tive information extraction using unrestricted re-
lation discovery. In Proc. of the main conference
on Human Language Technology Conference of the
North American Chapter of the Association of Com-
putational Linguistics, pages 304–311. Association
for Computational Linguistics.
Stanford Parser. 2007. http://nlp.stanford.edu/
downloads/lex-parser.shtml.
Ian H. Witten and Eibe Frank. 2005. Data Min-
ing: Practical machine learning tools and techniques.
Morgan Kaufmann, San Francisco, 2nd edition.
WordNet. 2007. http://wordnet.princeton.edu/.
</reference>
</variant>
</algorithm>
<algorithm name="AAMatching" version="110505">
  <results time="1337156134" date="Wed May 16 16:15:34 SGT 2012">
    <authors>
      <author>
        <fullname source="parscit">G ¨ unter Neumann</fullname>
        <institutions>
          <institution>DFKI GmbH, LT-Lab, Stuhlsatzenhausweg 3 ( Building D3 2 ) , D-66123 Saarbr ¨ ucken</institution>
        </institutions>
      </author>
      <author>
        <fullname source="parscit">Kathrin Eichler</fullname>
        <institutions>
          <institution>DFKI GmbH, LT-Lab, Stuhlsatzenhausweg 3 ( Building D3 2 ) , D-66123 Saarbr ¨ ucken</institution>
        </institutions>
      </author>
      <author>
        <fullname source="parscit">Holmer Hemsen</fullname>
        <institutions>
          <institution>DFKI GmbH, LT-Lab, Stuhlsatzenhausweg 3 ( Building D3 2 ) , D-66123 Saarbr ¨ ucken</institution>
        </institutions>
      </author>
    </authors>
    <institutions>
      <institution>DFKI GmbH, LT-Lab, Stuhlsatzenhausweg 3 ( Building D3 2 ) , D-66123 Saarbr ¨ ucken</institution>
    </institutions>
  </results>
</algorithm>

</algorithms>