In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges, experiences and prospects of their application to under-resourced languages. We rst introduce the goals of cross-lingual linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied to language data can play in this context. We de ne under-resourced languages with a speci c focus on languages actively used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream applications for under-resourced languages via the localisation and adaptation of existing technologies and resources.
The OntoLex-Lemon model provides a vocabulary to enrich ontologies with linguistic information that can be exploited by Natural Language Processing applications. The increasing uptake of Lemon illustrates the growing interest in combining linguistic information and Semantic Web technologies. In this paper, we present Fuzzy Lemon, an extension of Lemon that allows to assign an uncertainty degree to lexical semantic relations. Our approach is based on an OWL ontology that defines a hierarchy of data properties encoding different types of uncertainty. We also illustrate the usefulness of Fuzzy Lemon by showing that it can be used to represent the confidence degrees of automatically discovered translations between pairs of bilingual dictionaries from the Apertium family.
This article discusses a survey carried out within the NexusLinguarum COST Action which aimed to give an overview of existing guidelines (GLs) and best practices (BPs) in linguistic linked data. In particular it focused on four core tasks in the production/publication of linked data: generation, interlinking, publication, and validation. We discuss the importance of GLs and BPs for LLD before describing the survey and its results in full. Finally we offer a number of directions for future work in order to address the findings of the survey.
The objective of the Translation Inference Across Dictionaries (TIAD) series of shared tasks is to explore and compare methods and techniques that infer translations indirectly between language pairs, based on other bilingual/multilingual lexicographic resources. In this fifth edition, the participating systems were asked to generate new translations automatically among three languages - English, French, Portuguese - based on known indirect translations contained in the Apertium RDF graph. Such evaluation pairs have been the same during the four last TIAD editions. Since the fourth edition, however, a larger graph is used as a basis to produce the translations, namely Apertium RDF v2. The evaluation of the results was carried out by the organisers against manually compiled language pairs of K Dictionaries. For the second time in the TIAD series, some systems beat the proposed baselines. This paper gives an overall description of the shard task, the evaluation data and methodology, and the systems’ results.
This paper describes the participation of two different approaches in the 3rd Translation Inference Across Dictionaries (TIAD 2020) shared task. The aim of the task is to automatically generate new bilingual dictionaries from existing ones. To that end, we essayed two different types of techniques: based on graph exploration on the one hand and, on the other hand, based on cross-lingual word embeddings. The task evaluation results show that graph exploration is very effective, accomplishing relatively high precision and recall values in comparison with the other participating systems, while cross-lingual embeddings reaches high precision but smaller recall.
Legal technology is currently receiving a lot of attention from various angles. In this contribution we describe the main technical components of a system that is currently under development in the European innovation project Lynx, which includes partners from industry and research. The key contribution of this paper is a workflow manager that enables the flexible orchestration of workflows based on a portfolio of Natural Language Processing and Content Curation services as well as a Multilingual Legal Knowledge Graph that contains semantic information and meaningful references to legal documents. We also describe different use cases with which we experiment and develop prototypical solutions.
In this paper we describe the contributions made by the European H2020 project “Prêt-à-LLOD” (‘Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors’) to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applicable to a wide range of sectors and applications and based around language resources and language technologies that can be integrated by means of semantic technologies. We describe the methods implemented for increasing the number of language data sets in the LLOD. We also present the approach for ensuring interoperability and for porting LLOD data sets and services to other infrastructures, as well as the contribution of the projects to existing standards.
We present a portfolio of natural legal language processing and document curation services currently under development in a collaborative European project. First, we give an overview of the project and the different use cases, while, in the main part of the article, we focus upon the 13 different processing services that are being deployed in different prototype applications using a flexible and scalable microservices architecture. Their orchestration is operationalised using a content and document curation workflow manager.
The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes’ degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and constitutes a large unified array of linked lexical entries and translations that are available and accessible on the Web (http://linguistic.linkeddata.es/apertium/). In particular, its graph structure allows for interesting exploitation opportunities, some of which are addressed in this paper. Two ‘massive’ experiments are reported: in the first one, the original EN-ES translation set was removed from the Apertium RDF Graph and a new EN-ES version was generated. The results were compared against the previously removed EN-ES data and against the Concise Oxford Spanish Dictionary. In the second experiment, a new non-existent EN-FR translation set was generated. In this case the results were compared against a converted wiktionary English-French file. The results we got are really good and perform well for the extreme case of correlated polysemy. This lead us to address the possibility to use cycles and nodes degree to identify potential oddities in the source data. If cycle density proves efficient when considering potential targets, we can assume that in dense graphs nodes with low degree may indicate potential errors.
The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data.
Language resources, such as multilingual lexica and multilingual electronic dictionaries, contain collections of lexical entries in several languages. Having access to the corresponding explicit or implicit translation relations between such entries might be of great interest for many NLP-based applications. By using Semantic Web-based techniques, translations can be available on the Web to be consumed by other (semantic enabled) resources in a direct manner, not relying on application-specific formats. To that end, in this paper we propose a model for representing translations as linked data, as an extension of the lemon model. Our translation module represents some core information associated to term translations and does not commit to specific views or translation theories. As a proof of concept, we have extracted the translations of the terms contained in Terminesp, a multilingual terminological database, and represented them as linked data. We have made them accessible on the Web both for humans (via a Web interface) and software agents (with a SPARQL endpoint).