In recent years, plenty of work has been done by the NLP community regarding gender bias detection and mitigation in language systems. Yet, to our knowledge, no one has focused on the difficult task of heteronormative language detection and mitigation. We consider this an urgent issue, since language technologies are growing increasingly present in the world and, as it has been proven by various studies, NLP systems with biases can create real-life adverse consequences for women, gender minorities and racial minorities and queer people. For these reasons, we propose and evaluate HeteroCorpus; a corpus created specifically for studying heterononormative language in English. Additionally, we propose a baseline set of classification experiments on our corpus, in order to show the performance of our corpus in classification tasks.
Word Association Norms (WAN) are collections that present stimuli words and the set of their associated responses. The corpus is widely used in diverse areas of expertise. In order to reduce the effort to have a good quality resource that can be reproduced in many languages with minimum sources, a methodology to build Automatic Word Association Norms is proposed (AWAN). The methodology has an input of two simple elements: a) dictionary, and b) pre-processed Word Embeddings. This new kind of WAN is evaluated in two ways: i) learning word embeddings based on the node2vec algorithm and comparing them with human annotated benchmarks, and ii) performing a lexical search for a reverse dictionary. Both evaluations are done in a weighted graph with the AWAN lexical elements. The results showed that the methodology produces good quality AWANs.
This paper presents our systems to solve Task 3 of Semeval-2020, which aims to predict the effect that context has on human perception of similarity of words. The task consists of two subtasks in English, Croatian, Finnish, and Slovenian: (1) predicting the change of similarity and (2) predicting the human scores of similarity, both of them for a pair of words within two different contexts. We tackled the problem by developing two systems, the first one uses a centroid approach and word vectors. The second one uses the ELMo language model, which is trained for each pair of words with the given context. Our approach achieved the highest score in subtask 2 for the English language.
In this paper, we show the enhancing of the Demanded Skills Diagnosis (DiCoDe: Diagnóstico de Competencias Demandadas), a system developed by Mexico City’s Ministry of Labor and Employment Promotion (STyFE: Secretaría de Trabajo y Fomento del Empleo de la Ciudad de México) that seeks to reduce information asymmetries between job seekers and employers. The project uses webscraping techniques to retrieve job vacancies posted on private job portals on a daily basis and with the purpose of informing training and individual case management policies as well as labor market monitoring. For this purpose, a collaboration project between STyFE and the Language Engineering Group (GIL: Grupo de Ingeniería Lingüística) was established in order to enhance DiCoDe by applying NLP models and semantic analysis. By this collaboration, DiCoDe’s job vacancies system’s macro-structure and its geographic referencing at the city hall (municipality) level were improved. More specifically, dictionaries were created to identify demanded competencies, skills and abilities (CSA) and algorithms were developed for dynamic classifying of vacancies and identifying terms for searches on free text, in order to improve the results and processing time of queries.
Mexico is a Spanish speaking country that has a great language diversity, with 68 linguistic groups and 364 varieties. As they face a lack of representation in education, government, public services and media, they present high levels of endangerment. Due to the lack of data available on social media and the internet, few technologies have been developed for these languages. To analyze different linguistic phenomena in the country, the Language Engineering Group developed the Corpus Paralelo de Lenguas Mexicanas (CPLM) [The Mexican Languages Parallel Corpus], a collaborative parallel corpus for the low-resourced languages of Mexico. The CPLM aligns Spanish with six indigenous languages: Maya, Ch’ol, Mazatec, Mixtec, Otomi, and Nahuatl. First, this paper describes the process of building the CPLM: text searching, digitalization and alignment process. Furthermore, we present some difficulties regarding dialectal and orthographic variations. Second, we present the interface and types of searching as well as the use of filters.
The computational treatment of temporal relations is based on the work of Allen, who establishes 13 different types, and Freksa, who designs a cognitive procedure to manage them. Freksa’s notation is not widely used because, although it has cognitive and expressive advantages, it is too complex from the computational perspective. This paper proposes a system for the annotation and management of temporal relations that combines the richness and expressiveness of Freksa’s approach with the simplicity of Allen’s notation. Our method is summarized in the application of bounding relations, thanks to which it is possible to obtain the temporary representation of complete neighborhoods capable of representing vague temporal relations such as those that can be frequently found in a text. Such advantages are obtained without the need to greatly increase the complexity of the labeling process since the markup language is almost the same as TimeML, to which only a second temporary “relType”’ type label relationship is added. Our experiments show that the temporal relationships that present vagueness are in fact much more common than those in which a single relationship can be established precisely. For these reasons, our new labeling system achieves a more agreeable representation of temporal relations.
This work is about the compilation process of parallel documents Spanish-Mixtec. There are not many Spanish-Mixec parallel texts and most of the sources are non-digital books. Due to this, we need to face the errors when digitizing the sources and difficulties in sentence alignment, as well as the fact that does not exist a standard orthography. Our parallel corpus consists of sixty texts coming from books and digital repositories. These documents belong to different domains: history, traditional stories, didactic material, recipes, ethnographical descriptions of each town and instruction manuals for disease prevention. We have classified this material in five major categories: didactic (6 texts), educative (6 texts), interpretative (7 texts), narrative (39 texts), and poetic (2 texts). The final total of tokens is 49,814 Spanish words and 47,774 Mixtec words. The texts belong to the states of Oaxaca (48 texts), Guerrero (9 texts) and Puebla (3 texts). According to this data, we see that the corpus is unbalanced in what refers to the representation of the different territories. While 55% of speakers are in Oaxaca, 80% of texts come from this region. Guerrero has the 30% of speakers and the 15% of texts and Puebla, with the 15% of the speakers has a representation of the 5% in the corpus.
This paper presents our approach to the Task 5 of Semeval-2019, which aims at detecting hate speech against immigrants and women in Twitter. The task consists of two sub-tasks, in Spanish and English: (A) detection of hate speech and (B) classification of hateful tweets as aggressive or not, and identification of the target harassed as individual or group. We used linguistically motivated features and several types of n-grams (words, characters, functional words, punctuation symbols, POS, among others). For task A, we trained a Support Vector Machine using a combinatorial framework, whereas for task B we followed a multi-labeled approach using the Random Forest classifier. Our approach achieved the highest F1-score in sub-task A for the Spanish language.
This work presents the Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students, a corpus of raw data for general use. Its purpose is to offer data for the study of of language and interactions via Instant Messaging (IM) among bachelors. Our paper consists of an overview of both the corpus’s content and demographic metadata. Furthermore, it presents the current research being conducted with it —namely parenthetical expressions, orality traits, and code-switching. This work also includes a brief outline of similar corpora and recent studies in the field of IM.
Childhood acquisition of written language is not straightforward. Writing skills evolve differently depending on external factors, such as the conditions in which children practice their productions and the quality of their instructors’ guidance. This can be challenging in low-income areas, where schools may struggle to ensure ideal acquisition conditions. Developing computational tools to support the learning process may counterweight negative environmental influences; however, few work exists on the use of information technologies to improve childhood literacy. This work centers around the computational study of Spanish word and syllable structure in documents written by 2nd and 3rd year elementary school students. The studied texts were compared against a corpus of short stories aimed at the same age group, so as to observe whether the children tend to produce similar written patterns as the ones they are expected to interpret at their literacy level. The obtained results show some significant differences between the two kinds of texts, pointing towards possible strategies for the implementation of new education software in support of written language acquisition.
Older adults tend to suffer a decline in some of their cognitive capabilities, being language one of least affected processes. Word association norms (WAN) also known as free word associations reflect word-word relations, the participant reads or hears a word and is asked to write or say the first word that comes to mind. Free word associations show how the organization of semantic memory remains almost unchanged with age. We have performed a WAN task with very small samples of older adults with Alzheimer’s disease (AD), vascular dementia (VaD) and mixed dementia (MxD), and also with a control group of typical aging adults, matched by age, sex and education. All of them are native speakers of Mexican Spanish. The results show, as expected, that Alzheimer disease has a very important impact in lexical retrieval, unlike vascular and mixed dementia. This suggests that linguistic tests elaborated from WAN can be also used for detecting AD at early stages.
A graph-based algorithm is used to analyze the co-occurrences of words in the British National Corpus. It is shown that the statistical regularities detected can be exploited to predict human word associations. The corpus-derived associations are evaluated using a large test set comprising several thousand stimulus/response pairs as collected from humans. The finding is that there is a high agreement between the two types of data. The considerable size of the test set allows us to split the stimulus words into a number of classes relating to particular word properties. For example, we construct six saliency classes, and for the words in each of these classes we compare the simulation results with the human data. It turns out that for each class there is a close relationship between the performance of our system and human performance. This is also the case for classes based on two other properties of words, namely syntactic and semantic word ambiguity. We interpret these findings as evidence for the claim that human association acquisition must be based on the statistical analysis of perceived language and that when producing associations the detected statistical regularities are replicated.
Human-computer interfaces require models of dialogue structure that capture the variability and unpredictability within dialogue. Semantic and pragmatic context are continuously evolving during conversation, especially by the distribution of turns that have a direct effect in dialogue exchanges. In this paper we use a formal language paradigm for modelling multi-agent system conversations. Our computational model combines pragmatic minimal units –speech acts– for constructing dialogues. In this framework, we show how turn-taking distribution can be ambiguous and propose an algorithm for solving it, considering turn coherence, trajectories and turn pairing. Finally, we suggest overlapping as one of the possible phenomena emerging from an unresolved turn-taking.