Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.
The identification of Verbal Multiword Expressions (VMWEs) presents a greater challenge compared to non-verbal MWEs due to their higher surface variability. VMWEs are linguistic units that exhibit varying levels of semantic opaqueness and pose difficulties for computational models in terms of both their identification and the degree of compositionality. In this study, a new approach to predicting the compositional nature of VMWEs in Persian is presented. The method begins with an automatic identification of VMWEs in Persian sentences, which is approached as a sequence labeling problem for recognizing the components of VMWEs. The method then creates word embeddings that better capture the semantic properties of VMWEs and uses them to determine the degree of compositionality through multiple criteria. The study compares two neural architectures for identification, BiLSTM and ParsBERT, and shows that a fine-tuned BERT model surpasses the BiLSTM model in evaluation metrics with an F1 score of 89%. Next, a word2vec embedding model is trained to capture the semantics of identified VMWEs and is used to estimate their compositionality, resulting in an accuracy of 70.9% as demonstrated by experiments on a collected dataset of expert-annotated compositional and non-compositional VMWEs.
We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
Supervised approaches usually achieve the best performance in the Word Sense Disambiguation problem. However, the unavailability of large sense annotated corpora for many low-resource languages make these approaches inapplicable for them in practice. In this paper, we mitigate this issue for the Persian language by proposing a fully automatic approach for obtaining Persian SemCor (PerSemCor), as a Persian Bag-of-Word (BoW) sense-annotated corpus. We evaluated PerSemCor both intrinsically and extrinsically and showed that it can be effectively used as training sets for Persian supervised WSD systems. To encourage future research on Persian Word Sense Disambiguation, we release the PerSemCor in http://nlp.sbu.ac.ir.
In this paper, we presented a WSD system that uses LDA topics for semantic expansion of document words. Our system also uses sense frequency information from SemCor to give higher priority to the senses which are more probable to happen.
Researchers use wordnets as a knowledge base in many natural language processing tasks and applications, such as question answering, textual entailment, discourse classification, and so forth. Lexico-semantic relations among words or concepts are important parts of knowledge encoded in wordnets. As the use of wordnets becomes extensively widespread, extending the existing ones gets more attention. Manually construction and extension of lexico-semantic relations for WordNets or knowledge graphs are very time-consuming. Using automatic relation extraction methods can speed up this process. In this study, we exploit an ensemble of lstm and convolutional neural networks in a supervised manner to capture lexico-semantic relations which can either be used directly in NLP applications or compose the edges of wordnets. The whole procedure of learning vector space representation of relations is language independent. We used Princeton WordNet 3.1, FarsNet 3.0 (the Persian wordnet), Root09 and EVALution as golden standards to evaluate the predictive performance of our model and the results are comparable on the two languages. Empirical results demonstrate that our model outperforms the state of the art models.
WordNet or ontology development for resource-poor languages like Persian, requires composition of several strategies and employment of appropriate heuristics. Lexical and linguistic structured resources are limited for Persian and there is a lot of diversity and structural and syntagmatic complexities. This paper proposes a system for extraction of verbal synsets and relations to extend FarsNet (Persian WordNet). The proposed method extracts verbal words and concepts using noun and adjective words and synsets. It exploits the data from digital lexicon glossaries, which leads to the identification of 6890 proper verbal words and 2790 verbal synsets, with 91% and 67% precision respectively. The proposed system also extracts relations such as semantic roles of verbal arguments (instrument, location, agent, and patient) and also “related-to” (unlabeled) relations and co-occurrence among verbs and other concepts. For this purpose, a combination of linguistic approaches such as morphological analysis of words, semantic analysis, and use of key phrases and syntactic and semantic patterns, corpus-based approach, statistical techniques and co-occurrence analysis have been utilized. The presented strategy extracts 5600 proper relations between the existing concepts in FarsNet 2.0 with 76% precision.
In this paper, we describe our proposed method for measuring semantic similarity for a given pair of words at SemEval-2017 monolingual semantic word similarity task. We use a combination of knowledge-based and corpus-based techniques. We use FarsNet, the Persian Word Net, besides deep learning techniques to extract the similarity of words. We evaluated our proposed approach on Persian (Farsi) test data at SemEval-2017. It outperformed the other participants and ranked the first in the challenge.
Sentiment shifters, i.e., words and expressions that can affect text polarity, play an important role in opinion mining. However, the limited ability of current automated opinion mining systems to handle shifters represents a major challenge. The majority of existing approaches rely on a manual list of shifters; few attempts have been made to automatically identify shifters in text. Most of them just focus on negating shifters. This paper presents a novel and efficient semi-automatic method for identifying sentiment shifters in drug reviews, aiming at improving the overall accuracy of opinion mining systems. To this end, we use weighted association rule mining (WARM), a well-known data mining technique, for finding frequent dependency patterns representing sentiment shifters from a domain-specific corpus. These patterns that include different kinds of shifter words such as shifter verbs and quantifiers are able to handle both local and long-distance shifters. We also combine these patterns with a lexicon-based approach for the polarity classification task. Experiments on drug reviews demonstrate that extracted shifters can improve the precision of the lexicon-based approach for polarity classification 9.25 percent.
This paper discusses the semantic augmentation of FarsNet - the Persian WordNet - with new relations and structures for verbs. FarsNet1.0, the first Persian WordNet obeys the Structure of Princeton WordNet 2.1. In this paper we discuss FarsNet 2.0 in which new inter-POS relations and verb frames are added. In fact FarsNet2.0 is a combination of WordNet and VerbNet for Persian. It includes more than 30,000 lexical entries arranged in about 20,000 synsets with about 18000 mappings to Princeton WordNet synsets. There ae about 43000 relations between synsets and senses in FarsNet 2.0. It includes verb frames in two levels (syntactic and thematic) for about 200 simple Persian verbs.
Nowadays, Wordnet is used in natural language processing as one of the major linguistic resources. Having such a resource for Persian language helps researchers in computational linguistics and natural language processing fields to develop more accurate systems with higher performances. In this research, we propose a model for semi-automatic construction of Persian wordnet of verbs. Compound verbs are a very productive structure in Persian and number of compound verbs is much greater than simple verbs in this language This research is aimed at finding the structure of Persian compound verbs and the relations between verb components. The main idea behind developing this system is using the wordnet of other POS categories (here means noun and adjective) to extract Persian compound verbs, their synsets and their relations. This paper focuses on three main tasks: 1.extracting compound verbs 2.extracting verbal synsets and 3.extracting the relations among verbal synsets such as hypernymy, antonymy and cause.
Semantic lexicons and lexical ontologies are some major resources in natural language processing. Developing such resources are time consuming tasks for which some automatic methods are proposed. This paper describes some methods used in semi-automatic development of FarsNet; a lexical ontology for the Persian language. FarsNet includes the Persian WordNet with more than 10000 synsets of nouns, verbs and adjectives. In this paper we discuss extraction of lexico-conceptual relations such as synonymy, antonymy, hyperonymy, hyponymy, meronymy, holonymy and other lexical or conceptual relations between words and concepts (synsets) from Persian resources. Relations are extracted from different resources like web, corpora, Wikipedia, Wiktionary, dictionaries and WordNet. In the system presented in this paper a variety of approaches are applied in the task of relation extraction to extract ladled or unlabeled relations. They exploit the texts, structures, hyperlinks and statistics of web documents as well as the relations of English WordNet and entries of mono and bi-lingual dictionaries.
Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex preprocessing tasks. Having different writing prescriptions, spacings between or within words, character codings and spellings are some of the difficulties and challenges in converting various texts into a standard one. The lack of fundamental text processing tools such as morphological analyser (especially for derivational morphology) and POS tagger is another problem in Persian text processing. This paper introduces a set of fundamental tools for Persian text processing in STeP-1 package. STeP-1 (Standard Text Preparation for Persian language) performs a combination of tokenization, spell checking, morphological analysis and POS tagging. It also turns all Persian texts with different prescribed forms of writing to a series of tokens in the standard style introduced by Academy of Persian Language and Literature (APLL). Experimental results show high performance.
Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Although there are a number of semantic lexicons for English and some other languages, Persian lacks such a complete resource to be used in NLP works. In this paper we introduce an ongoing project on developing a lexical ontology for Persian called FarsNet. We exploited a hybrid semi-automatic approach to acquire lexical and conceptual knowledge from resources such as WordNet, bilingual dictionaries, mono-lingual corpora and morpho-syntactic and semantic templates. FarsNet is an ontology whose elements are lexicalized in Persian. It provides links between various types of words (cross POS relations) and also between words and their corresponding concepts in other ontologies (cross ontologies relations). FarsNet aggregates the power of WordNet on nouns, the power of FrameNet on verbs and the wide range of conceptual relations from ontology community.
In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach high performance and accuracy. These taggers usually deal with inter-word relations and they make use of lexicons. In this paper we present a new tagging algorithm with a hybrid approach. This algorithm combines the features of probabilistic and rule-based taggers to tag Persian unknown words. In contrast with many other tagging algorithms this algorithm deals with the internal structure of the words and it does not need any built in knowledge. The introduced tagging algorithm is domain independent because it uses morphological rules. In this algorithm POS tags are assigned to unknown word with a probability which shows the accuracy of the assigned POS tag. Although this tagger is proposed for Persian, it can be adapted to other languages by applying their morphological rules.