A Study on Personal Attributes Extraction Based on the Combination of Sentences Classifications and Rules

Personal attributes extraction plays a significant role in information mining, event tracing and personal name disambiguation. It mainly involves two problems, attribute recognition and decision making on whether this attribute belongs to the extracted person. Personal attributes generally involve named entities, which are recognized mainly by adjusting word segmentation software. As for those which cannot be recognized by word segmentation, the combination of feature words and rules can be used for their recognition. The combination of sentences classifications and rules is employed for attribute ownership decision. At first, all the sentences in the document are classified into those with attribute words and those without, with the latter omitted. The former are then classified into description sentences with one person and description sentences with more persons, according to the criterion that whether there are more than one person described in the sentence. According to statistics of description sentences with one person, anaphora resolution is not necessary, which reduces recognition errors from anaphora resolution failures. Minimum slicing is used for description sentences with more persons, and attribute ownership decision is made within the minimum language segment with the co-occurrence of both the person and the attribute. This method achieves 0.507388780 and 0.489505010 respectively in the lenient evaluation results and the strict evaluation results of SF_Value in CIPS-SIGHAN2014 1 Bakeoff, which turns out to be the best. The fact has shown that the method is effective.


Introduction
Attribute, characterized by its objectivity, is 1 http://www.cipsc.org.cn/clp2014/webpage/en/home_en.htm inherent in things (Zhuang, 2000). Personal attribute extraction aims at automatically extracting in unstructured texts specific attributes associated with the personal name, such as the character entity's date of birth, work units, spouses, children, education, title, etc. This plays a significant role in information mining, event tracing and personal name disambiguation. International TAC KBP has been conducted since 2009 (Bikel et al., 2009;McNamee et al., 2009), and CIPS-SIGHAN2014 has referred to and revised its Slot Filling tasks to design personal attribute extraction tasks in Chinese. There are six groups participating this bakeoff. Personal attribute extraction mainly involves two problems, attribute recognition and decision making on whether this attribute belongs to the extracted person, and the latter can be called attribute ownership decision. Personal attributes are generally named entities, such as personal names, place names, organization names, temporal nouns, so named entity recognition technology is needed in attribute recognition. Although named entity recognition is one difficulty in natural language processing, there are plenty of experiences and methods we can draw upon as 30 years has witnessed its research since the introduction of Chinese word segmentation, as in (Sun et al., 1995;Zhao et al., 1999;Liao et al., 2004;Yu et al., 2006;Ye et al., 2007). Therefore, this paper focuses upon attribute ownership decision after a brief introduction to personal attribute extraction, since the former is more complicated with anaphora resolution and attribute ownership decision among more persons. Some of bakeoff papers regarding filling slot have noticed these problems, as in (Bikel et al., 2009;Burman et al., 2012). In this paper, we propose attribute ownership decision through the combination of sentences classifications and rules in accordance with natural language features and the task requirements of our bakeoff. This method has achieved good results in the evaluation. The rest of the paper is organized as: Section 2 introduces main ideas, Section 3 presents the methods of personal attribute recognition, Section 4 emphasizes on and discusses the methods of personal attribute ownership decision, Section 5 is experimental results and Section 6 is conclusion.

Main ideas
Attribute recognition is mainly named entity recognition, which is attempted to be settled in word segmentation in our study. According to attribute recognition task requirements, the word segmentation software used in this study has been adjusted so that it can recognize most named entities. As for those which cannot be recognized by the software, the method of feature words together with rules has been employed. After attribute recognition, all the sentences in the document are classified into those with attribute words and those without, with the latter omitted. Therefore, attribute ownership decision is merely conducted to the sentences marked with attribute words. Now that the anaphora of personal pronouns are widely used in most sentences, attribute ownership decision involves anaphora resolution, which means the determination of the antecedent of the anaphor (Wang, 2005). Anaphora resolution appears to be difficult in Chinese, far from being settled completely satisfactorily (Wang, 2002;Wang 2005). In order to decrease the reliance on anaphora resolution, we have studied the tested documents and found that the described person in most of them is the extracted character. When most sentences in a document describe the extracted person, it is not necessary to employ anaphora resolution. Anaphora resolution or some other methods are needed to find the attribute of the extracted person only for those sentences with more persons. In a small number of documents, there is only one extracted person within the whole text, such as "马伟明_T1.xml" and "白志 东 _T1.xml". As such, in attribute ownership decision, it should be determined whether there are more than two persons described in the sentence. In this way, the sentences marked with attribute words in the document will be classified as description sentences with one person and description sentences with more persons through some methods, which would decrease the reliance on anaphora resolution and so greatly improve decision precision by decreasing the recognition errors from anaphora resolution failures. The challenge here is how to determine those sentences with more persons, which will be expounded later.

Personal attribute recognition
Personal attribute recognition involves two jobs. One is to adjust word segmentation software in order to achieve full recognition of various types of named entities, and the other is to annotate feature words to ensure exact decision of attribute identity of some named entities.

Adjusting word segmentation software
Named entity recognition is mainly completed in word segmentation. The word segmentation software used is CUCBst, a dictionary and rule based software developed by Broadcasting Media Center, Communications University of China. The adjustment includes: adjustmenting tagging, adding words, and adjusting rules.

Adding words
There are two stages in adding words: Stage One is to collect and sort dictionaries in system development, adding names such as titles, nations and places to the segmentation dictionary. Stage Two is to add OOV words to the segmentation dictionary in evaluation period by implementing new words automatic recognized in evaluation corpus with manual intervention. It should be pointed out that some certain noun phrase is regarded as one word and then kept in the dictionary. These noun phrases are mainly organization titles, nicknames and titles such as "北平研究院物理研究所(Institute of Physics of Peking Academy of Sciences)", "罗彻斯特储蓄 银行(Rochester Bank)", "橙县小姐(Miss Orange County)" and " 名 誉 理 事 长 (Honorary chairman)".

Adjusting rules
CUCBst segmentation system is characterized by coarse-grained segmentation and fine-grained segmentation, which is implemented by rules. We adjust some merging rules so that they can achieve better attribution recognition. For example: Example Sentence 2: 斯托曼 1953 年出生于美 国纽约曼哈顿地区的犹太人家庭。 Translation: Stallman was born of a Jewish family in Manhattan, New York, in 1953. Its segmented version before the rule adjustment is: coarse-grained segmentation：斯托曼/nr 1953 年 /t 出生/v 于/p 美国纽约曼哈顿地区/ns 的/u 犹太人/n 家庭/n fine-grained segmentation：斯托曼/nr 1953 年/t 出生/v 于/p 美国/ns 纽约/ns 曼哈顿/ns 地区 /n 的/u 犹太人/n 家庭/n In the coarse-grained segmentation version, "美 国 纽 约 曼 哈 顿 地 区 ", which includes two personal attributes in accordance with evaluation outline, country of birth and city of birth, is merged together. Further analyses and processes are needed for correct recognition. In the fine-grained segmentation version, "美国纽约曼 哈顿地区" is divided into 4 words as "美国/ns 纽约/ns 曼哈顿/ns 地区/n", in which country of birth is correctly segmented. However, city of birth needs further processes by merging the following three words. Example Sentence 2's segmented version after the rule adjustment is: 斯托曼/nr 1953 年/t 出生/v 于/p 美国/gj 纽约曼哈顿地区/ns 的/u 犹太人/n 家庭/n In this version, " 美 国 纽 约 曼 哈 顿 地 区 " is segmented into 2 words as "美国/gj 纽约曼哈 顿地区/ns", which are country of birth and city of birth respectively. This makes the recognition and extraction of related attributes convenient.

Finding nearest named entity through the feature word
Although some specific tagging aimed for named entities and some personal attributes is conducted in word segmentation, it should be noted that not all tagged named entities are personal attributes. For example, 1998 is not always a person's date of birth, since it could be the date for an event or something else. Therefore, it is necessary to decide personal attribute through the feature word, and find nearest named entity through the feature word within the sentence. Take the example of time of birth: Example Sentence 3: 张幼仪/nr 生于/bir 1900 年/t ，/w 比/p 徐志摩/nr 小/a 4/m 岁/q 。/w Translation: Zhang Youyi was born in 1900, and she was four years younger than Xu Zhimo. Example Sentence 4: 鲁桂珍/nr 1904 年/t 生于 /bir 南京/ns Translation: In 1904, Lu Guizhen was born in Nanjing. When segmented, "生于(be born)" is tagged as "bir", which means the word is a feature word associated with a person's birth. When there is "bir" in a sentence, the system will iterate before and after this feature word to find the nearest time noun, as in Example Sentence 3, 1900 is after the feature word and in Sentence 4, 1904 is before the feature word.

Deciding whether the attribute belongs to the extracted person
In this section, we first classify the sentences in two levels in order to decide the attribute ownership in the classified sentence. As for the description sentence with one person, decide whether the character is the extracted object. If not, just omit the sentence. As for the description sentence with more persons, decide personal attribute ownership by extracting the personal attribute within the minimum language segment with the co-occurrence of both the person and the attribute.

Sentence classification
Sentence classification involves two levels. First, the sentences are classified into sentences with or without attribute marks. Then, classify the sentences with attribute marks into those with one person and those with more persons.

Classifying all the sentences into two types
All the attributes and feature words are marked in word segmentation. In terms of these marks, all the sentences are classified into two types. Those without attribute marks will be directly omitted, whereas those with attribute marks will be kept for further processing.

Classifying the sentences with attribute marks into two types
The sentences with attribute marks are classified into those with one described person and those with 2 or more than 2 described persons. Character recognition is significant in this step. The forms to recognize characters include personal names, only surnames or first names, personal pronouns, zero form and kinship titles, in which personal names and kinship titles can be either antecedent or anaphora, the rest three can only be anaphora.
(1) personal names Here, the number of personal names in the sentence will decide whether the sentence is the one with one described person. Example Sentence 5 is the sentence with one described person, for there is one personal name "冯白驹" within, whereas Example Sentence 6 is the sentence with more described persons, for there are two personal names within, "王文明" and "冯 白驹".
(2) only surnames or first names As for non-Chinese names, the whole name is used first and then generally the surname is used for anaphora. In Example Sentence 13, there are three persons, "Blanchett", "father" and "mother".
In addition, we also find that when some attributes of the extracted person's teacher, student, friend or leader are described, this person's name will appear. However, when a teacher, a student or a professor is used in a general sense, he or she has little thing to do with attribute extraction, so he or she will not be regarded as a character.  , on the basis of summarizing research on Carothers, published "Principles of Polymer Chemistry", which shook the whole globe. The book is still the bible-like theoretical basis of today's realm of polymer. "老师(the teacher)" in Example Sentence 14, "同 事 、 学 生 (colleagues, students)" in Example Sentence 15 are used in a general sense, so both sentences are ones with one person. Instead, Example Sentence 16 makes clear the date of birth, date of death, and some other information, concerning Carothers' student, Paul J. Flory(with a specific name for the student), so the sentence is one with more persons.

Attribute ownership decision
By employing the above mentioned character recognition features to classify the sentences, we get two sentence sets, the description sentences with one person (including zero anaphora) and the description sentences with more persons.

The description sentences with one person (1) affirming the extracted person
As for the sentences with personal names, including with only first names or surnames, the extracted persons' names, including first names or surnames, are used for the match. The difficulty lies in the sentences with personal pronouns and zero form. As mentioned above, most documents in the testing texts mainly describe extracted persons, thus when the description sentences with one person involve personal pronouns and zero form, it can be hypothesized that extracted persons are directly used as described persons. In order to test this hypothesis, we study the use of the third singular personal pronoun "他(he)" in all the sentences. Through automated recognition, we obtain 369 sentences with one person which have "他(he)". Then we identify all the sentences to see whether "他(he)" is the anaphora of the extracted person. Fig. 1 and Fig. 2 show the results.
As illustrated in Fig. 1 and Fig. 2, 356 sentences, in 112 documents, with "他(he)" as the anaphora of the extracted person, account for 96 percent of all the sentences, whereas 14 sentences, in 5 documents, with "他(he)" not as the anaphora of the extracted person, account for only 4 percent of all the sentences. We study these 5 documents and find that the chiefly described person is not the extracted person in 3 documents, which are "鲁桂珍_T2.xml", "鲁桂珍_T3.xml" and "陈济 棠_T3.xml". In "鲁桂珍_T2.xml" and "鲁桂珍 _T3.xml", the chiefly described person is 鲁桂 珍's husband, 李约瑟, not the extracted person, 鲁 桂 珍 . In " 陈 济 棠 _T3.xml", he chiefly described person is 陈济棠's son, 陈树柏, not the extracted person, 陈济棠. In this document, there are 5 sentences with one person which have "他(he)". There are 4 sentences with "他(he)" not as the anaphora of the extracted person, while there is only one sentences with "他(he)" as the anaphora of the extracted person. Thus we call this document as one with overlaps. The other two documents are "马伟明_T3.xml" and "白志 东_T3.xml" respectively. Although the chiefly described person is the extracted person in both documents, the narrative perspective is first-person perspective.
In addition, we also perform statistical analysis of the use of zero form. As there are a number of zero anaphora, 193 sentences with zero anaphora are randomly chosen from 126 documents. Then we identify all these sentences to see whether there is the anaphora of the extracted person. The results are shown in Fig. 3 and Fig. 4.
As illustrated in Fig. 3 and Fig. 4, zero anaphora shares similar use with the anaphora of the third singular pronoun " 他 (he)". By analyzing the documents with zero anaphora not referred to the extracted person, we find that the chiefly described person is not the extracted person.
However, there is no first-person perspective, which is quite different from the case of the third singular pronoun "他(he)". The data above demonstrate that our hypotheses are in line with reality. If we have had classified the documents in terms of some features such as the chiefly described person and narrative perspectives and then classified the sentences in documents, we would have achieved better results.
(2) attribute extraction The extracted character in the description sentence with one person is affirmed at first. If the character is not the extracted object, omit the sentence. If the character is the extracted object, attributes are extracted and put into different attribute lists in terms of marks. For example: Example Sentence 17. 1943 年 11 月/t ，/w 白 志东/nr 出生/bir 于/p 河北省/sh 乐亭县/sx。/w Translation: In November, 1943, Bai Zhidong was born in Leting County, Hebei Province. According to the feature word "出生(birth)" and attribute marks, the attributes of "1943 年 11 月 (Nov. 1943)", "河北省(Hebei province)" and "乐 亭县(Laoting county)" are put into such attribute lists as date of birth, province of birth and city of birth(including towns and villages) of the extracted person "白志东".

The description sentences with more persons
Attribute ownership decision in the description sentences with more persons turns out be the challenge of this evaluation task. For example: Example Sentence 18. 李济深升为军长， 陈济棠 升任第十一师师长。 (陈济棠) Translation: Li Jishen was promoted to an army corps commander, and Chen Jitang was promoted to be the commander of eleventh division. (Chen Jitang) In this sentence, "军长(army commander)" is the title of " 李 济 深 ", while " 师 长 (divisional commander)" is the title of "陈济棠", a person to be extracted. Attribute ownership decision requires us to correctly recognize "陈济棠" and then extract it. We mainly employ minimum slicing with the co-occurrence of the extracted person and the attribute and the nearest distance principle to decide attribute ownership, which will be expounded below.
(1) minimum co-occurrence slicing When the person and the attribute co-occur in the same grammatical unit as minimum as possible, and there is only one person, the attribute belongs to the person. In the two clauses of Example Sentence 19, "黄 永 胜 任 司 令 " and " 丁 盛 任 二 十 四 师 师 长 " means the title of "司令(commander)" belongs to "黄永胜", while the title of "师长(divisional commander)" belongs to "丁盛". In Example Sentence 20, "副总司令张学良", "主任杨虎城" and "主席邵力子" show that the person and the attribute co-occur in the same subject-predicate phrase.
(2) the nearest distance principle When there is a long distance between the person and the attribute, and at the same time, there are more persons in the sentence, the attribute belongs to the person with the nearest distance. In Example Sentence 24, the title " 部 长 (minister)" belongs to " 彭 述 之 ", the person which has a longer distance. This sentence needs deeper syntax or semantic analysis, which is a little difficult to process at present.

anaphora resolution of person pronouns 2
As for anaphora resolution in the description sentences with more persons, we mainly refer to the methods in (Wang, 2001;Wang, 2005). The extracted person is known, so its designation and sex can be annotated in advance, which facilitates anaphora resolution. As in Example Sentence 25, "居里夫妇" is plural, "他(He)" in Example Sentence 26 refers to "钱三 强" in the preceding sentence, which is a male name in singular form.

Experimental results
In this bakeoff, the performance of 6 groups attending the competition are shown in Table 1 According to the evaluation results, our system achieves 0.507388780 and 0.489505010 respectively in the lenient evaluation results and the strict evaluation results of SF_Value in CIPS-SIGHAN2014 Bakeoff, which turns out to be the best. The fact has shown that our system is effective. However, 50 percent of SF_Value implies that there is still room to increase the system's efficiencies. The system performance could be improved in 3 aspects: 1. to establish the word segmentation system specific for personal attribute extraction. 2. to establish grammatical knowledge system regarding personal attribute extraction, For example, "我父亲住在北京(My father lived in Beijing)" is different from "我和父亲住在北京 (My father and I live in Beijing)", with "我父亲" as a modifier-head construction in the former and "我和父亲" as a parallel construction in the latter. 3. to establish semantic knowledge system regarding personal attribute extraction, For example, in the sentence of "凯利与女演员劳 里 · 莫 顿 结 婚 后 居 住 于 Goatstown.(After wedding, Kerry and actress, Laurie Morton settled in Goatstown.)", certain semantic knowledge is needed to correctly extract the information that Laurie Morton is Kelly's wife.

Conclusion
This bakeoff is full of challenges with a number of personal attributes to be extracted. CUCBst, the word segmentation software, plays a significant role in named entity recognition, which provides a solid foundation for attribute extraction.
The strategy of sentence classifications is employed in attribute ownership decision, which, though cannot solve all the problems, simplifies analyses. This strategy plays a role in improving precision in attribute ownership decision.