European Language Grid: A Joint Platform for the European Language Technology Community

Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.


Introduction
Europe is a multilingual society with 24 EU Mem ber State languages and dozens of additional lan guages including regional and minority languages and languages spoken by immigrants, trade part ners and tourists. The only option to enable and to benefit from multilingualism is through Lan guage Technologies (LT) including Natural Lan guage Processing (NLP) and Speech Technologies (Rehm, 2017). While the European LT landscape is world class, it is also massively fragmented (Vasiljevs et al., 2019; Rehm et al., 2020d.
We describe Release 2 of the European Lan guage Grid (ELG) cloud platform. 1 This scal able system is targeted to evolve into the primary platform for LT in Europe. It will provide one umbrella platform for all LTs developed by the European LT landscape, including research and industry, addressing a gap that has been repeat edly raised by the European LT community for many years (Rehm and Uszkoreit, 2013; Rehm et al., 2016b; STOA, 2017; Rehm, 2017; Rehm and Hegele, 2018; European Parliament, 2018. ELG is meant to be a virtual home and marketplace for all products, services and organisations active in the LT space in Europe (Rehm et al., 2020a). The platform can be used by all stakeholders to show case, share and distribute their products, services, tools and resources. At the end of the EU project ELG (20192022), which will establish a legal en tity in early 2022, the platform will provide access to approx. 1300 commercial and noncommercial tools and services for all European languages, as well as thousands of language resources (LRs). ELG will enable the European LT community to deposit and upload their technologies and data sets and to deploy them through the grid. The ELG is also meant to support digital language equal ity in Europe (STOA, 2017; European Parliament, 2018), i. e., to create a situation in which all lan guages are supported through technologies equally well. The current imbalance is characterised by a stark predominance of LTs for English, while al most all other languages are only marginally sup ported and, thus, in danger of digital language ex tinction (Rehm and Uszkoreit, 2012; Kornai, 2013; Rehm et al., 2014, 2016a; ELRC, 2019.
Section 2 gives an overview of the ELG plat form and related activities. Section 3 touches upon related work. Section 4 concludes the paper.

The European Language Grid
The European LT community has been demand ing a dedicated LT platform for years. ELG con centrates on commercial and noncommercial LTs, both functional (processing and generation, writ ten and spoken language) and nonfunctional (cor pora, data sets etc.). We want to establish the ELG as the primary market place for the fragmented Eu ropean LT landscape (Rehm et al., 2020d) to con nect demand and supply. The ELG is based on ro bust, scalable and reliable open source technolo gies, enabling it to scale with the growing demand and supply. It contains records of all resources, service and application types, languages as well as LT companies, research organisations, projects, etc. (see Figure 1 and Figure 4 in the appendix).

Architectural Overview
ELG is a scalable platform with a web user in terface, backend components and REST APIs. It offers access (search, discovery, etc.) to various kinds of LTrelated resources such as functional services as well as corpora and data sets and or ganisations. An ELG functional service is an LT tool wrapped with the ELG LT Service API 2 and packaged in a Docker container; both steps have to be carried out by the LT provider. Then, the LT service container is integrated into the ELG (Sec tion 2.7) so that it can be used through the web UI or APIs. The architecture consists of three lay ers: base infrastructure, platform backend, plat form frontend (Figure 2).
The base infrastructure is operated on a Kuber netes 3 cluster in the data centre of a cloud provider located in Berlin, Germany, where all platform 2 https://gitlab.com/europeanlanguagegrid/platform/ 3 https://kubernetes.io components and all LT functional services run as Docker containers. The only components outside the cluster are the S3 storage, ReadtheDocs (ELG documentation), and any LT services deployed through external servers.  The platform backend consists of (1) the back end components of the ELG catalogue, i. e., an inventory of all metadata records (Section 2.3). Users can browse and search the catalogue through queries or by utilising filters (e. g., language, ser vice type, domain etc.). Users with the "LT provider" role can create new entries either by up loading XML descriptions or through a graphical metadata editor. The catalogue backend is im plemented using Django, PostgreSQL and Elas ticSearch.
(2) The LT Service Execution Server offers a common REST API for executing func tional services, also handling failures, timeouts etc. (3) The user management and authentication module is based on Keycloak, an identity and ac cess management solution. (4) The Storage Proxy is used for interacting with the S3compatible stor age. (5) All integrated LT services. Additional components, especially for billing and monitoring purposes, are currently work in progress.
The platform frontend consists of UIs for the dif ferent types of users, e. g., LT providers, potential buyers and administrators (Section 2.6.2). These include (1) catalogue UIs (browse, search, view), provider and metadata editor UIs for uploading and registering functional and nonfunctional re sources. They are implemented using React and packaged in the same container. (2) The adminis tration pages are implemented using Django. (3) The test/trial UIs for functional services run in separate containers. The UIs are powered by the catalogue REST API, e. g., a resource's metadata record is returned as a JSON object and rendered as HTML. The frontend also includes a Drupalbased CMS for additional content (Section 2.6.2).
All core components of the ELG platform are built with robust, scalable, reliable and widely used technologies, e. g., Django, Angular and Re act. For managing LT service containers, ELG makes use of Knative 4 , a layer on top of Kuber netes that handles autoscaling.

Base Infrastructure
The base infrastructure consists of the nodes run ning the ELG platform, volume storage, net working facilities and S3compatible object stor age. We use managed Kubernetes, i. e., the maintenance and operation of Kubernetes itself is taken care of by the provider. The infrastruc ture also consists of a large set of Git reposito ries and Docker registries, hosted in a common group on GitLab 5 for all ELG source and config uration files. Many external registries are used to pull in thirdparty components, like database servers (MariaDB 6 , PostgreSql 7 ), authentication and identity management (Keycloak 8 ), monitoring (Prometheus 9 ), among others. Most LT services offered by the ELG platform are pulled from the Docker registries of their respective developers.
ELG uses a GitOps approach to deployment, with the cluster configuration stored in a dedicated Git repository as a set of Helm charts 10 . A con tinuous integration pipeline triggers a deployment with each checkin to this repository.
Eventually hosting more than one thousand LT services with different hardware needs, we are unable to keep all of them up concurrently as this would require hundreds of Gigabyte of RAM. KNative is used to automatically scale down ser vices not currently in use to zero replicas. A ser vice is scaled up further if a certain threshold of requests is exceeded. This setup is suitable for services with little traffic. For services intended to power actual applications, however, the time to spin up a container is likely too long. ELG will, later on, offer scaling profiles, which will keep a specific number of replicas online at all times.
Nonfunctional LT resources uploaded to the platform are made persistent to an S3 compatible object storage and can be downloaded from there.

Catalogue
The metadata records stored in the catalogue en able access to services and data resources. They are described using the ELG metadata schema (Labropoulou et al., 2020) and can be browsed and explored. The catalogue also includes a registry of stakeholders who develop LT services or products, and relevant projects, thus providing an overview of the whole European LT landscape. The ELG metadata schema builds upon, consolidates and updates the METASHARE schema (Gavrilidou et al., 2012; Piperidis et al., 2018; Labropoulou et al., 2018, taking into account ELG's require ments, recent developments in the metadata do main (e. g., FAIR 11 ), and the need for creating a common pool of resources through exchange mechanisms with collaborating initiatives.
The metadata schema caters for the descrip tion of the ELG core entities, i. e., Language Technologies (tools/services), including functional services and nonfunctional ones, and Data Lan guage Resources, comprising data sets (corpora), language descriptions (i. e., models) and lexical/ conceptual resources (e. g., gazetteers, ontologies, etc.). It also provides for related entities involved in the production, namely actors (organizations, groups and persons), projects, documents, and li cences/terms of use. Metadata records are cre ated by providers using the online editor (Sec tion 2.6.1), or from other sources through harvest ing and conversion APIs (Section 2.5), gradually enriched through (semi)automatic processes and curated by persons who rightfully claim them.

Functional Services
The European LT landscape is broad and varied, with many providers of different classes of ser vices and tools, exposed through different APIs and data formats. We attempt to bring more or der to this varied landscape by identifying classes of related services, and providing a generic API for each class. So far, we have identified three classes. (1) Machine Translation (MT) services take text in one language and translate it into text in another language, possibly with additional meta data associated with each segment. (2) Informa tion Extraction (IE) services take text and anno tate it with metadata on specific segments. This class can cover a wide variety of services from ba sic NER through to complex sentiment analysis and domainspecific tools. (3) Automatic Speech Recognition (ASR) services take audio as input and produce text (e. g., a transcription) as output, pos sibly with metadata associated with each segment.  Other clusters are emerging as we are preparing more services for integration, e. g., texttospeech and text classification. Our goal is to provide ser vices of all classes for all official EU languages and for other EU and nonEU languages that are of so cial or strategic interest in the EU. Table 1 shows the overall language coverage of each category of services across all consortium partners; languages have been divided into four groups: (A) EU offi cial languages; (B) other EU languages without of ficial status, plus languages from candidate coun tries and free trade partners; (C) languages spoken by immigrants or important trade and political part ners; (D) languages that do not fit (A), (B), (C).
Release 1 of the platform (April 2020) targeted the languages spoken in the countries of the ELG consortium, with 141 IE and text analysis services, 24 MT, nine ASR, four TTS and two text categori sation services. Further services are being added on a regular basis with 200+ additional IE and text analysis services, 21 MT, eight ASR and nine TTS scheduled to be included by the time of ELG Re lease 2 in February 2021.
We aim to make it as simple as possible for LT providers to integrate their services, but in a way that avoids the proliferation of incompatible APIs for the same task, allowing users to access the widest range of services without being locked in to a single vendor. Our generic APIs use HTTP as the transport protocol and specific schemas of JSONbased messages as the payload. Providers who want to integrate their services into the ELG need to provide a Docker image that presents an HTTP endpoint that can receive requests and re turn responses in the specified format (user authen tication, authorisation, etc. are handled by the plat form). Once a service is integrated, it can be used via the public APIs and UIs (Section 2.6).

Data Sets and Language Resources
Already now ELG provides access to more than 2700 language resources. We ingested substan tial resources from existing repositories, especially ELDA/ELRA, ELRCSHARE (Lösch et al., 2018; Piperidis et al., 2018; Smal et al., 2020 and META SHARE (Piperidis, 2012; Piperidis et al., 2014. We have also been working on 'external' reposito ries, about 220 of which have been identified so far. Some (e. g., Zenodo, Quantum Stat) are al ready being ingested together with two reposito ries related to ELG, LINDAT/CLARIAHCZ and ELRASHARELRs (LRs published at LREC).

Access Methods and User Interfaces
Our main groups of users are: (1) LT/LR providers -companies or research organisations with tools, services or data that can be provided through the ELG; (2) Developers and integrators -companies and research institutions interested in using LT; (3) General LT information seekers; (4) Stakeholders who wish to provide information about events etc.; (5) Casual visitors. We provide three ways of ac cess: REST APIs, web UIs, Python package.

REST APIs
The ELG exposes several REST APIs, which are used by all clients. They are exposed for (1) brows ing and searching the catalogue, (2) creating, up dating and retrieving metadata records, (3) execut ing services, (4) downloading resources. Authen tication is performed through OAuth2 (OpenID Connect) using JSON Web Tokens.
The catalogue API is based on a JSON serialisa tion of the metadata schema. The entry point is the search operation, which supports free text search as well as faceted browsing. The metadata record creation, update and retrieval API is controlled by the catalogue module and associates each record with a creator and curator. The curator can edit and update the record until it is published.
The functional service API (internal LT API) provides a way of executing any functional service deployed in the ELG. All functional services of a given class (MT, ASR, etc.) are presented under a common API for that class, allowing the user to choose the best service for their requirements without being locked in to a single vendor. 12 The publicfacing LT service API mirrors the internal LT service provider API (see above), being based around the same JSON message formats, but also offers simplified options. It is possible to HTTP POST plain text to an MT service, or binary audio to an ASR service, without having to wrap it in the full JSON envelope or multipart MIME structure used by the internal API. Since the public and inter nal APIs are conceptually distinct, we can add and offer public APIs that use other technologies (e. g., gRPC). The LT Service Execution Server compo nent translates requests between the public and in ternal APIs. An asynchronous interaction style is offered for services that require a longer run time to process a request, this works by returning an im mediate response that directs the caller to another URL, which it can then poll to request the result.

Web Interface (GUI)
Angular 9.0 and Typescript were adopted for de veloping the Drupal CMS frontend which is used for presenting content such as news or conferences. For the catalogue UI we use React. Currently, both web applications (CMS and catalogue frontend) use clientsiderendering, i. e., they deliver a sin gle HTML file, the rest of the application comes as Javascript files. User authorisation is ensured by adding a JSON Web Token (JWT) to data re quests, where the user identity data is encoded and sent as an encrypted JSON object.
For LT services the catalogue record detail page includes a trial GUI, allowing users to experiment with the service in the browser. Generic trial UIs have been developed for the principal service types (ASR, MT, TTS, text annotation and classification services) but LT service providers can also supply their own GUI if the standard ones are not suitable. An example is the family of UDPipe dependency 12 While workflows that consist of multiple services are cur rently not addressed by ELG, we do experiment with work flow composition and platform interoperability (Rehm et al., 2020b,c; MorenoSchneider et al., 2020a Figure 3: Python Client Package -code example parser services, where the provider has created a custom UI to visualise dependency graphs. 13 The web GUI also includes a metadata editor that supports different entities (LTs, organisations, etc.). It provides validation rules, lookup mech anisms that use values from previously filledin metadata elements and an online help.

Python Client Package
The Python Client Package, available through the Python package manager pip 14 , comprises a com mand line interface and utility scripts for query ing the ELG catalogue and executing ELGhosted services via REST API calls. For features that require authentication, e. g., calling services, the client prompts the user to enter a token which is re ceived after successful authentication in a browser window (Figure 3). This simplifies the integration of ELGhosted services into Python projects.

Contribution of Services and Resources
We want to enable commercial and non commercial providers to adapt their LT services so that they can be integrated into the ELG and also to make this ingestion as simple as possible. Currently, the process consists of six steps: (1) adapt the service to fit the ELG API; (2) create a Docker image; (3) push the image into a Docker registry; (4) deploy the service by creating a Kubernetes configuration file; (5) create an ELG provider account; (6) register the service by creating a metadata record. For some of the ELG services, the integration took a few days, for others only a few hours. This effort was recently brought further down by adding Docker templates for the most common cases and introducing the metadata editor. Two alternative 13 Trial UIs can include thirdparty code. They are sand boxed using an iframe and configured via JavaScript message passing. 14 https://pypi.org/project/elg/ ways of integrating a service exist. It is possible to package the LT tool in a container that does not implement the ELG LT service API. In this case, a second container is required as an adapter, which implements the ELG LT service API and communicates with the LT tool container. It is also possible to run an LT service outside the cluster: here, a proxy container that implements the ELG LT service API is required and deployed in the cluster for accessing the external service. Libraries are available that produce skeleton code.

Key Stakeholders
The ELG is meant to be a joint umbrella platform for the whole European LT landscape including in dustry and research. ELG caters for commercial LT providers who want to showcase their products, services and their organisation. We want to pro vide the marketplace for European LT, which re quires coverage of, ideally, all European provider companies. In December 2020 we populated the ELG catalogue with a list of 900 LT companies.
Representatives of these organisations can claim (or delete) their record and take over maintenance of their ELG page, including upload of services or data sets. Research centres and universities are also LT providers but their interest is research driven, providing data sets and experimental soft ware. LT users are, e. g., organisations who want to make use of LT. They interact with the ELG in the role of a consumer or potential customer. ELG also collaborates with a number of EUfunded projects and initiatives (Rehm et al., 2020c,d) and set up a network of 32 National Competence Centres (NCCs), which function as bridges between the na tional and regional communities and the ELG.

Open Calls: Pilot Projects
ELG provides approx. 30% of its project budget to a number of pilot projects. The pilots either broaden ELG's portfolio (by developing services or resources), or demonstrate the ELG's useful ness. Financial support is awarded following an open, transparent and expertdriven evaluation pro cess. The first call was published in March 2020, the second one in October 2020. The first set of projects started in July 2020, the second set starts in February 2021 with a duration of 912 months.
In the first call, 110 proposals were accepted for evaluation with applicants from 29 countries. We received more proposals from SMEs (62) than re search organisations (48). While 79 proposals fo cused on contributing services or resources, 31 pro posals concentrated on developing applications us ing the ELG. We selected ten projects for fund ing, amounting to a sum of 1,363,915€ in total. 15 We received a total of 106 proposals to the second call with applicants from 28 countries. Again, we had more proposals from SMEs (61) than from re search organisations (45). In February 2021, five projects were selected for funding.

Legal Entity
We will establish a notforprofit legal entity in early 2022, which will take over operation of the ELG platform after the end of the current EU project (June 2022). The longterm operational model is currently under development.

Related Work
ELG builds upon previous work of the ELG con sortium and the wider European LT community, es pecially METANET/META and ELRC.
In addition, we have collected more than 30 plat forms, projects or initiatives that can be considered relevant for ELG including, among others, UIMA (Ferrucci and Lally, 2003), CLARIN (Hinrichs and Krauwer, 2014), DKPro (Gurevych et al., 2007); Rehm et al. (2020a provide an exhaustive com parison. They share at least one of the following goals with ELG, i. e., they provide: 1) a collec tion of LT/NLP tools or data sets; 2) a platform, which harvests metadata records from distributed sources, 3) a platform for the sharing of tools or data sets. While related projects do exist, the ap proach of ELG is unique. The platform that most closely resembles ELG is the National Platform for LT, operated by the Ministry of Electronics and In formation Technology in India. 16 Several global technology enterprises offer LT services. Among these are Amazon Compre hend 17 and Microsoft Azure Cognitive Services (Del Sole, 2018). Furthermore, Google recently (Sept. 2018) released a search platform for data sets. 18 Intento 19 offers commercial LT services from different providers for selected tasks.

Conclusions and Future Work
It has been argued that Europe should not out source its multilingual communication and lan guage infrastructure to other continents since the European demands are unique and complex (Rehm and Uszkoreit, 2013; Rehm, 2017; Rehm et al., 2020d. Instead, Europe should make use of and support its own LT community. One of the obsta cles to overcome is the creation of a joint technol ogy platform. The ELG will foster LTs for Europe built in Europe. In its first two years, the ELG project has seen the demo of the MVP in October 2019, Release 1 in early 2020 and two successfully completed open calls for pilot projects. We have been improving and extending the platform and continuously added services and data sets. While Release 2 of the platform will follow in March 2021, Release 3 is foreseen for early 2022.