Personal Attributes Extraction Based on the Combination of Trigger Words, Dictionary and Rules

Personal Attributes Extraction in Unstructured Chinese Text Task is a subtask of The 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2014). In this report, we propose a method based on the combination of trigger words, dictionary and rules to realize the personal attributes extraction. We introduce the extraction process and show the result of this bakeoff, which can show that our method is feasible and has achieved good effect.


Introduction
In recent years, with the development of Internet, masses of information provide the majority of Internet users with a lot of convenience. However, with the increase of amount of information, screening redundant information and seeking for the knowledge which users really want from a lot of unstructured texts is getting more and more difficult. For example, when we search for the details of someone, general search engines usually return a number of pages, and we must identify these pages one by one even if we just need a little of them. Therefore, extracting personal attributes from unstructured texts has become a very important task. Personal attributes extraction in unstructured Chinese text task is designed to extract person specific attributes, such as date of birth, spouse, husband, children, education, or title etc. from unstructured Chinese texts. The corresponding techniques play an important role in information extraction, event tracking, entity disambiguation and other related research areas.
In our report, a method based on the combination of trigger words, dictionary and rules to realize the personal attributes extraction is introduced. We build a basic framework including trigger words, dictionaries and rules that relative to the task to extract personal specific attributes. In Section 2, we introduce two basic methods about information extraction and several recent researches on this theme while the detailed description of the task is represented in Section 3. In Section 4, we give the step to build our basic model of extraction. We talk the main framework in Section 4.1. Then from Section 4.2 to Section 4.4, we describe the process to build trigger word table, attribute dictionaries and personal attribute rules one by one in a detailed way. We show the evaluation metrics and the final experiment results in Section 5 to prove the feasibility of our method. In Section 6, we point out the shortage of our system and propose some suggestions to improve our model and then make a conclusion.

Related Works
Rule-based methods and statistics-based ones are two main ways of information extraction at present. Information extraction based on the rules is a two phase process consists of learning and applying, including the study of rules and the application of using rules for target information extraction. Information extraction rules mainly come from the target context in constraint environment. As long as finding the constraint information which can meet the rules in the text, we could also find the target extraction information. Thus, learning and extracting the rules themselves becomes the key point to the rule-based information extraction. As for the method of statistics-based, its accuracy is generally low, but it has good portability to this extraction problem. Some statistics models have strong statistical theory basis and wholesome training algorithms such as HMM and CRFs and so on. However, statistics-based information extraction requires a large amount of labeled training data.
Currently, there are not many references about the personal attributes extraction and there is no more mature system to solve this problem. However, personal attributes extraction has a very close relation to the information extraction, and personal entities also belong to the category of the entity. So, to a certain extent, the entity relation extraction method can also be applied to personal attribute extraction. Ye [1] and some other researchers treated the personal attribute extraction as a specific application in the entity relationship extraction. They use the "Hownet" to acquire the trigger words which can describe the personal attributes, then change the relationship between trigger words and names into a classification problem. Their solution needs manual labeled data during classifier training and is under the help of semantic resource. Wang [2] and some other researchers put forward a relationship judgment algorithm which is based on the semantic similarity between the current tuples and the relationship set to filter and classify the relational tuples that are extracted according to the pattern, using Wikipedia as a knowledge database. This is under the foundation of extraction model of sentence groups such as blocks and named entity recognition marker. Wang [3] and others tried to use the method of knowledge engineering to extract personal attributes. They sum up some rules manually under the foundation of mass analysis about web texts and researches in natural language processing and then built a pattern repository to do the match. Yu [4] adopted the way of using trigger words and classifier to exact personal basic information, and carried out a character search engine based on the stored exaction information.

Task Descriptions
In this task, there are 25 predefined personal attributes to be extracted, including alternate_names, date_of_birth, age, country_of_birth, stateorprovince_of_birth, city_of_birth, date_of_death, country_of_death, stateorprovince_of_death, city_of_death, coutriea_of_residence, stateorprovince_of_residence, cities_of_residence, title, member_of, employee_of, religion, spouse, children, parents, siblings, other_family, charges, cause_of_death and schools_attended. The testing data are provided by a series of folders which are named after people whose attributes need to be extracted. In each folder, a XML document of Wikipedia and some unstructured Chinese texts about the person are included. Except for the actual attribute values, the extraction results should also contain the source documents that the values come from and their positions in the documents. For the attributes that are already located in the tags of "Facts" in the document of Wikipedia, we do not need to extract them repeatedly. For those attributes whose values are not unique, such as parents, children and the residence of cities, it is responsible for us to extract all probable attribute values.

Methods
Before the selection of methods to extract, we've analyzed the attributes to be extracted, the sample data and also the testing data provided by the conference carefully. Because we don't have enough data as the training data, and it requires quantities of work to collect and label the training data artificially, we gave up the extraction method based on statistics. While, through the observation of a large number of Wikipedia pages and personal information, we found that most of the attributes have a great similarity in the expression and discipline. Therefore, what we use is a method that combines the trigger words, dictionaries and rules together to achieve the task of personal attributes extraction.

Basic Framework
As shown in Figure 1, the architecture includes several parts: 1. The test corpus is provided by the conference. The corpus includes several XML files about persons whose personal attributes are to be extracted, containing the persons' Wikipedia records, and a number of unstructured documents relating to the persons.
2. Build attributes trigger words. The trigger words are aimed to narrow down the extraction scope, such as birth date and place of birth appears in sentences containing "出生"(birth) or "生于"(born).
3. Build attributes dictionary. The dictionary is in the view of the state, province, and school, the cause of death and some similar fixed attributes or some attributes which could be extracted by dictionary lookup directly.
4. Build attributes extraction rules. We sum up the general characteristics of the attributes from the corpus using the combination features of word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and sentence parsing. Then we formulate the rules of grammar corresponding to these characteristics respectively. As a result, we can use these rules in the process of personal attributes extraction respectively.
5. Extract the attributes information. Extract attributes from the input unstructured documents according to the rules and structure of the dictionary.

Build Trigger word Table
So-called trigger word refers to a particular attribute extraction having the effect of location and identification that can activate the extraction task. When a sentence contains trigger words in a certain document, it could trigger the corresponding attribute extraction task in the sentence, so that the scope of the attribute extraction would be greatly narrowed. In this work, by analyzing the text characteristic and the description of the Chinese language style, we built trigger word sets for part of the corresponding attributes, while the attributes without trigger words require full range extraction in document. Trigger word table is shown in Table 1.

Build Attribute Dictionary
We built attribute dictionary aiming at national, provincial or state, city, school, etc. for those attributes, which can be extracted directly by dictionary lookup.
Compared to the rules, dictionary extraction is more convenient and with higher accuracy. For part of attributes, we built 8 dictionaries referring to the country, school, religion etc., as shown in Table 2.

Build Personal Attribute Rules
Rules are very important for the proposed personal attributes information extraction. Its quality directly decides the effect of information extraction. While we were studying the personal attributes, we found that the expression of same attributes have a lot of similarities. Based on the similarity, in combination with word segmentation, part-of-speech tagging, and named entity recognition, we built rules for each corresponding attribute. Rule sets are shown in Table 3. The recent word tagged by "NN" after the trigger words; The quoted words after the trigger words date_of_birth Generated in advance all the regular time format templates, and match the time format in the first sentence containing trigger words as the result country_of_birth, city_of_birth, stateorprovince_of_birth Match the corresponding dictionary in the first sentence containing the trigger words age extract numbers followed by the " 岁 ", taking the maximum as a result; Add specific rules to extract, For the Chinese digital age, such as "六十 岁" date_of_death Match time format in the sentence containing the trigger words as a result when the content of <date_of_death> tag is empty.

country_of_death, city_of_death, stateorprovince_of_death
Match the corresponding dictionary in the sentence containing the trigger words cause_of_death Match the corresponding dictionary in the sentences containing trigger words; Search for the string with a tag sequence of NN or NN + NN + VV or NN + NN or NN + VV or NN + VA after the "由于" or "因" whose tag is "P" with a distance less than five words until meeting punctuation.
schools_attended, countries_of_residence, citis_of_residence, statesorprovince_of_residence, Match the corresponding dictionary in the sentence containing the trigger words title Match the title dictionary backward in the phrase containing trigger words or the character name; The recent word tagged by "NN" after the phrase with the structure of the trigger words or character name + "是" ; match the title dictionary in all the sentences containing the character name when the query failed.
member_of, employee_of The chunks tagged by "ORG" after named entity recognition in the sentences containing the trigger words or title attribute; Search for the recent chunk tagged by "NP" in phrase containing trigger words, bidirectionally; Mark the results containg " 会 ", " 军 ", " 队 " as member_of atrribute value, the rest as employee_of attribute values religion Match the religion dictionary in the sentences containing trigger words spouse, parents, children, siblings The chunks tagged by "PER" after named entity recognition in the sentences containing the trigger words, rejecting the character name other_family The chunks tagged by "PER" after named entity recognition in the sentences containing the trigger words, rejecting the character name or the name marked by other attributes. charges match the corresponding dictionary in sentences containing the character name; Search for the string with a tag sequence of VV or AD+VV before the trigger word. The string between the phrase and the trigger word is the value.

Experiments
This work is designed to extract person specific attributes from unstructured Chinese texts.  [5].

Single attributes evaluation metric
= When NumCorrect is zero, we set NumCorrect to 1.0; List attributes evaluation metric = ( 2 + 1) * * 2 * ( + ) When both IP and IR are zero, we set ListSlotValue to 0.0; Overall evaluation metric We use the average of single attributes evaluation score and list attributes evaluation score as the final evaluation score. In the evaluation, both the lenient evaluation and strict evaluation are performed. In the strict evaluation, all instance attributes are compared to the answers while in the lenient evaluation, the offsets of the string from the beginning word to the ending word are ignored. Table 4 and Table 5 give the results for lenient evaluation and strict evaluation, respectively. Note that there are 6 teams participated in this bakeoff, as shown in the first column of Table 4 and Table 5, in which our team is called CIST-BUPT.  We can see that our method has achieved good results, ranking the second place in the six teams. The results fully show that the method based on the combination of trigger words, dictionary and rules is feasible to some extent, and the trigger words and rules we formulated have performed well.
But there are still some problems in our method. The list attributes evaluation score is far lower than the single attributes evaluation score, which shows that we possibly have missed a lot of instances. And when considering the offsets of the extracted string, both the single attribute and list attributes evaluation score declined. This indicates that there are some errors, for example, the attribute value is correct but the source or object is wrong. In future work, we need to develop special improved strategies to extract more accurate results.

Conclusions and Future Work
In this report, we proposed a method based on the combination of trigger words, dictionary and rules to extract person specific attributes from unstructured Chinese texts. The trigger words can narrow the scope of extraction and then they are combined with specific dictionary lookup and extraction rules to implement the extraction of 25 person specific attributes.
Given the limited time and the first try in this kind of bakeoff, our system still has some shortages to be improved. For example, in the case of "Missing Words", we can specify the rules or collect and tag data artificially in order to get more training data and then use the method of machine learning to extract person attributes. On the other hand, to improve the case of "Incorrect Words", we plan to increase the judgment of the subject in one sentence so that we can avoid the situation that the attributes we extract belong to other people. Otherwise, we can also try to make more specific rules for the place names which occurs in schools or organizations to reduce their effects to those related attributes about place.
We believe that if we do some improvements to our system as above, we can get a more accurate extraction result. And we are also looking forward to developing more formal and more relatively complete machine learning algorithms and rules to realize the extraction of person specific attributes in unstructured Chinese with less human labor in the loop.