Semantic Query Expansion for Arabic Information Retrieval

Traditional keyword based search is found to have some limitations. Such as word sense ambiguity


Introduction
As traditional keyword based search techniques are known to have some limitations, many researchers are concerned with overcoming these limitations by developing semantic information retrieval techniques. These techniques are concerned with the meaning the user seeks rather than the exact words of the user"s query. We consider four main features that make users prefer semantic based search systems over keyword-based: Handling Generalizations, Handling Morphological Variants, Handling Concept matches, and Handling synonyms with the correct sense (Word Sense Disambiguation).

Semantic-based Search Features
In this section we will discuss the main features of semantic search that makes it more tempting choice over the traditional keyword based techniques.

Handling Generalization
Handling generalizations allows the system to provide the user with pages that contains material relevant to sub-concepts of the user"s query. Consider the following example in Table 1 where a query contains a general term or concept "‫("ػٌف‬Violence).

Handling Morphological Variations
Handling morphological variations allows the system to provide the user with pages that contain words derived from the same root as those in user"s query. Consider the following example in Table 2.

Handling Concept Matches
The system should also be aware of concepts or named entities that may be addressed with different words. Consider the following example in Table 3.

Handling Synonyms With Correct Sense
Although the meaning of many Arabic words depends on the word"s diacritics, most Arabic text is un-vowelized. For example, Table 4 shows the word ‫"شؼب"‬ has more than a single meaning depending on its diacritization. System should be aware which meaning to consider for expansion. (Jinxi Xu and Ralph, 2001) used the highest TF-IDF 50 terms extracted from the top 10 retrieved documents from AFP (i.e. the TREC2001 corpus). These 50 terms where weighted due to their TF-IDF scores and added to the original query -with addition to terms from other thesaurus-with the following formula:

Arabic vowelized word
Where D is the top retrieved documents and t is the original term. Larkey and Connell (2001) used a similar technique, but with a different scoring method.
Wikipedia has been considered as an ontology source by many researchers. This is due to its large coverage, up-to-date, and domain independency. As in (Alkhalifa and Rodrguez, 2008), they proposed an automatic technique for extending Named Entities of Arabic WordNet using Wikipedia. They depended mainly on Wikipedia"s "redirect" pages and Cross-Lingual links. Also a large scale taxonomy from Wikipedia deriving technique was proposed by (Pozetto and Strube, 2007). (Abouenour et al., 2010) proposed a system that uses Arabic WordNet to enhance Arabic question/answering. Synonyms from WordNet are used to expand the question in order to extract the most semantically relevant passages to the question.
(Milne et al., 2007) proposed a system called "KORU" for query expansion using Wikipedia"s most relevant articles to user"s query. The system allows the user to refine the set of Wikipedia pages to be used for expansion. KORU used "Redirect" pages for expansion; "Hyper Links" and "Disambiguation Pages" to disambiguate unrestricted text.
Our proposed system differs from KORU in several points: (1) Adding "Subcategories" to handle generalization.
(2) Adding Wikipedia "Gloss" -First phrase of the articlewhen there is no "Redirect" pages available.
(3) Allowing the user to either expand all terms in a single query, or expand each term separately producing multiple queries. The result lists of these multiple queries are then combined into a single result list. (4) Adding terms from another two supportive thesaurus, namely "Al Raed" dictionary and our constructed "Google_WordNet" dictionary.

Arabic Wikipedia
Our system depends mainly on Arabic Wikipedia as the main semantic information source. According to Wikipedia, the Arabic Wikipedia is currently the 23rd largest edition of Wikipedia by article count, and is the first Semitic language to exceed 100,000 articles.
We were able to extract 397,552 Arabic Semantic set, with 690,236 collocations. The term "Semantic Set" stands for a set of expressions that refer to the same Meaning or Entity. For example, the following set of concepts forms a semantic set for ‫"بزٌطبًٍب"‬ (Britain): ‫لبزٌطبًٍب"[‬ ‫الوتحدة‬ ‫,"الوولكت‬ ‫وآٌزلٌدا"‬ ‫الؼظوى‬ ‫لبزٌطبًٍب‬ ‫الوتحدة‬ ‫,"الوولكت‬ ‫,"أًكلتزة"‬ " ‫بزٌطبًٍب‬ ‫.]"الؼظوً‬ To extract the semantic sets, we depend on the "redirect" pages in addition to the article gloss that may contain a semantic match. This match appears in the first paragraph of the article in a bold font. The categorization system of Wikipedia is very useful in the task of expanding generic queries in a more specified form. This is done by adding "subcategories" of the original term to the expanded terms.

The Al Raed Monolingual Dictionary:
The "Al Raed" Dictionary is a monolingual dictionary for modern words 1 . The dictionary contains 204303 modern Arabic expressions.

The Google_WordNet Dictionary
We collected all the words in WordNet, and translated them to Arabic using Google Translate. For each English word, Google Translate provides different Arabic translations for the English word each corresponds to a different sense, each sense has a list of different possible English synonyms. Using this useful information we were able to extend WordNet Synset entries into a bilingual Arabic-English dictionary that maps a set of Arabic synonyms to its equivalent set of English synonyms. The basic idea is that, two sets of English synonyms (each allegedly belongs to a different sense) can be fused together into one sense if the number of overlapping words between the two sets is two or more. Fusing two English sets together will fuse also their Arabic translations into one set, thus forming a list of Arabic synonyms matched to a list of English synonyms. Table 5 shows a sample of Google Translate for the word "tough". We can fuse the first and the fourth sense together because they have two words in common namely "strong" and "robust". The same applies to the second and the third senses with "strict" and "tough" in common.   Finally, we use words of the same Arabic set as an expansion to each other in queries.

Indexing and Retrieval
Our system depends on "LUCENE", which is free open source information retrieval library released under the Apache Software License. LUCENE was originally written in Java, but it has ported to other programming languages as well. We use the ".Net" version of LUCENE.
LUCENE depends on the Vector Space Model (VSM) of information retrieval, and the Boolean model to determine how relevant a given Document is to a User's query. LUCENE has very useful set of features, as the "OR" and "AND" operators that we depend on for our expanded queries. Documents are analyzed before adding to the index on two steps: diacritics and stop-words Removal, and text Normalization. A list of 75 words (Contains: Pronouns, Prepositions…etc.) has been used as stop-words.

Stemming
We implemented Light-10 stemmer developed by Larkey (2007), as it showed superior performance over other stemming approaches.
Instead of stemming the whole corpus before indexing, we grouped set of words with the same stem and found in the same document into a dictionary, and then use this dictionary in expansion. This reduces the probability of matching between two words sharing the same stem but with different senses, as they must be found in the same document in corpus to be used in expansion.
Consider the following example in  We see that both words share the same stem ‫,"طبع"‬ yet we don"t expand the word ‫"طبػت"‬ with the word ‫"الطبػىى"‬ as there is no document in the corpus that contains both words.

Query Expansion
To expand a query, we first locate named entities or concepts that appear in the query in Wikipedia. If a named entity or a concept has been located, we add title of "redirect" pages that leads to the same concept in addition to its subcategories from Wikipedia"s categorization system. If not, we depend on the other two dictionaries -Al Raed and Google_WordNet-for expansion.
We investigated two methodologies for query expansion; the first is the most common query expansion methodology which is to produce a single expanded query that contains all expanded terms. The second methodology we introduced is to expand each term one at a time producing multiple queries, and then combine the results of these queries into a single result list. The second methodology was found less sensitive to noise because for each expanded query, there is only one source of noise which is the term being expanded, while other terms are left without expansion. It also allows the system to boost documents from one expanded query over other documents according to the relevancy score of the expanded term.
The following example explains this intuition: For the query ‫األضبحً"‬ ‫"أحكبم‬ Single Expanded Query: We see that the term ‫"أحكبم"‬ gets fewer expansions than the term ‫;"األضبحً"‬ this is because the term ‫"األضبحً"‬ is less frequent in the corpus thus it needs more expansions. We then combine the results of the two queries by the following algorithm: 1-Foreach expanded query a. Foreach retrieved document for b. If the final list contains increment the score of by ( ) c. Else add to final list Where is a list of relevancy factors calculated for each term in the original query. This factor depends on the term frequency in corpus.
is calculated according to the following formula: Where is the term we need to calculate its relevancy score, is the numbers of times the term appeared in the corpus, and is the number of times words that share the same stem of the term appeared in the corpus. Then we sort the final list in ascending order according to their scores.
Note that the multiple expanded queries methodology consumes more time over the single expanded query. This is because each expanded query is sent to LUCENE separately. Then we combine the returned documents lists of the queries into a final documents list.
We also limit the maximum number of added terms for each term in order to reduce the noise effect of query expansion step; this maximum number also depends on the term"s relevancy factor. We set the maximum number of added terms to a single query to 50. Each term gets expanded with number of terms proportional to its relevancy score. This also increases the recall as less frequent terms gets expanded more times than most frequent terms, allowing LUCENE to find more relevant pages for infrequent terms.

Experiments
For testing our system, we used a data set constructed from "Zad Al Ma"ad" book written by the Islamic scholar "Ibn Al-Qyyim". The data set contains 25 queries and 2730 documents. Titles of the book chapters are used as "Queries" and sections of each chapter are used as set of relevant documents for that query. Each query is tested against the whole sections. The

Conclusion
In this paper we introduced a new technique for semantic query expansion using a domain independent semantic ontology constructed from Arabic Wikipedia. We focused on four features for semantic search: (1) Handling Generalizations.
(2) Handling Morphological Variants. (3) Handling Concept Matches. (4) Handling Synonyms with correct senses. We compared both single expanded query and multiple expanded queries approaches against the traditional keyword based search. Both techniques showed better results than the base line. While the Multiple Expanded Queries approach performed better than Single Expanded Query in most levels.