This paper describes a method to enrich lexical resources with content relating to linguistic diversity, based on knowledge from the field of lexical typology. We capture the phenomenon of diversity through the notion of lexical gap and use a systematic method to infer gaps semi-automatically on a large scale, which we demonstrate on the kinship domain. The resulting free diversity-aware terminological resource consists of 198 concepts, 1,911 words, and 37,370 gaps in 699 languages. We see great potential in the use of resources such as ours for the improvement of a variety of cross-lingual NLP tasks, which we illustrate through an application in the evaluation of machine translation systems.
The emergence of Multi-task learning (MTL)models in recent years has helped push thestate of the art in Natural Language Un-derstanding (NLU). We strongly believe thatmany NLU problems in Arabic are especiallypoised to reap the benefits of such models. Tothis end we propose the Arabic Language Un-derstanding Evaluation Benchmark (ALUE),based on 8 carefully selected and previouslypublished tasks. For five of these, we providenew privately held evaluation datasets to en-sure the fairness and validity of our benchmark.We also provide a diagnostic dataset to helpresearchers probe the inner workings of theirmodels.Our initial experiments show thatMTL models outperform their singly trainedcounterparts on most tasks. But in order to en-tice participation from the wider community,we stick to publishing singly trained baselinesonly. Nonetheless, our analysis reveals thatthere is plenty of room for improvement inArabic NLU. We hope that ALUE will playa part in helping our community realize someof these improvements. Interested researchersare invited to submit their results to our online,and publicly accessible leaderboard.
We present a new wordnet resource for Scottish Gaelic, a Celtic minority language spoken by about 60,000 speakers, most of whom live in Northwestern Scotland. The wordnet contains over 15 thousand word senses and was constructed by merging ten thousand new, high-quality translations, provided and validated by language experts, with an existing wordnet derived from Wiktionary. This new, considerably extended wordnet—currently among the 30 largest in the world—targets multiple communities: language speakers and learners; linguists; computer scientists solving problems related to natural language processing. By publishing it as a freely downloadable resource, we hope to contribute to the long-term preservation of Scottish Gaelic as a living language, both offline and on the Web.
In this paper we discuss several models we used to classify 25 city-level Arabic dialects in addition to Modern Standard Arabic (MSA) as part of MADAR shared task (sub-task 1). We propose an ensemble model of a group of experimentally designed best performing classifiers on a various set of features. Our system achieves an accuracy of 69.3% macro F1-score with an improvement of 1.4% accuracy from the baseline model on the DEV dataset. Our best run submitted model ranked as third out of 19 participating teams on the TEST dataset with only 0.12% macro F1-score behind the top ranked system.
This paper describes the solution that we propose on MADAR 2019 Arabic Fine-Grained Dialect Identification task. The proposed solution utilized a set of classifiers that we trained on character and word features. These classifiers are: Support Vector Machines (SVM), Bernoulli Naive Bayes (BNB), Multinomial Naive Bayes (MNB), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Passive Aggressive(PA) and Perceptron (PC). The system achieved competitive results, with a performance of 62.87 % and 62.12 % for both development and test sets.
In this paper we present the Tren-toTeam system which participated to thetask 3 at SemEval-2017 (Nakov et al.,2017).We concentrated our work onapplying Grice Maxims(used in manystate-of-the-art Machine learning applica-tions(Vogel et al., 2013; Kheirabadiand Aghagolzadeh, 2012; Dale and Re-iter, 1995; Franke, 2011)) to ranking an-swers of a question by answers relevancy.Particularly, we created a ranker systembased on relevancy scores, assigned by 3main components: Named entity recogni-tion, similarity score, sentiment analysis.Our system obtained a comparable resultsto Machine learning systems.
WordNet represents polysemous terms by capturing the different meanings of these terms at the lexical level, but without giving emphasis on the polysemy types such terms belong to. The state of the art polysemy approaches identify several polysemy types in WordNet but they do not explain how to classify and organize them. In this paper, we present a novel approach for classifying the polysemy types which exploits taxonomic principles which in turn, allow us to discover a set of polysemy structural patterns.