Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources (2022)

Volumes

Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference 16 papers

pdf (full)
bib (full) Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference

pdf bib
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
Ingo Siegert | Mickael Rigault | Victoria Arranz

pdf bib abs
Keynote Speech - Major developments in the legal framework concerning language resources
Pawel Kamocki

Introductory Talk for the Workshop on Legal and Ethical Issues in Human Language Technologies, LREC 2022, Marseille, 24 June 2022

pdf bib abs
Sentiment Analysis and Topic Modeling for Public Perceptions of Air Travel: COVID Issues and Policy Amendments
Avery Field | Aparna Varde | Pankaj Lal

Among many industries, air travel is impacted by the COVID pandemic. Airlines and airports rely on public sector information to enforce guidelines for ensuring health and safety of travelers. Such guidelines can be policy amendments or laws during the pandemic. In response to the inception of COVID preventive policies, travelers have exercised freedom of expression via the avenue of online reviews. This avenue facilitates voicing public concern while anonymizing / concealing user identity as needed. It is important to assess opinions on policy amendments to ensure transparency and openness, while also preserving confidentiality and ethics. Hence, this study leverages data science to analyze, with identity protection, the online reviews of airlines and airports since 2017, considering impacts of COVID issues and relevant policy amendments since 2020. Supervised learning with VADER sentiment analysis is deployed to predict changes in opinion from 2017 to date. Unsupervised learning with LDA topic modeling is employed to discover air travelers’ major areas of concern before and after the pandemic. This study reveals that COVID policies have worsened public perceptions of air travel and aroused notable new concerns, affecting economics, environment and health.

pdf bib abs
Data Protection, Privacy and US Regulation
Denise DiPersio

This paper examines the state of data protection and privacy in the United States. There is no comprehensive federal data protection or data privacy law despite bipartisan and popular support. There are several data protection bills pending in the 2022 session of the US Congress, five of which are examined in Section 2 below. Although it is not likely that any will be enacted, the growing number reflects the concerns of citizens and lawmakers about the power of big data. Recent actions against data abuses, including data breaches, litigation and settlements, are reviewed in Section 3 of this paper. These reflect the real harm caused when personal data is misused. Section 4 contains a brief US copyright law update on the fair use exemption, highlighting a recent court decision and indications of a re-thinking of the fair use analysis. In Section 5, some observations are made on the role of privacy in data protection regulation. It is argued that privacy should be considered from the start of the data collection and technology development process. Enhanced awareness of ethical issues, including privacy, through university-level data science programs will also lay the groundwork for best practices throughout the data and development cycles.

pdf bib abs
Pseudonymisation of Speech Data as an Alternative Approach to GDPR Compliance
Pawel Kamocki | Ingo Siegert

The debate on the use of personal data in language resources usually focuses — and rightfully so — on anonymisation. However, this very same debate usually ends quickly with the conclusion that proper anonymisation would necessarily cause loss of linguistically valuable information. This paper discusses an alternative approach — pseudonymisation. While pseudonymisation does not solve all the problems (inasmuch as pseudonymised data are still to be regarded as personal data and therefore their processing should still comply with the GDPR principles), it does provide a significant relief, especially — but not only — for those who process personal data for research purposes. This paper describes pseudonymisation as a measure to safeguard rights and interests of data subjects under the GDPR (with a special focus on the right to be informed). It also provides a concrete example of pseudonymisation carried out within a research project at the Institute of Information Technology and Communications of the Otto von Guericke University Magdeburg.

pdf bib abs
Categorizing legal features in a metadata-oriented task: defining the conditions of use
Mickaël Rigault | Victoria Arranz | Valérie Mapelli | Penny Labropoulou | Stelios Piperidis

In recent times, more attention has been brought by the Human Language Technology (HLT) community to the legal framework for making available and reusing Language Resources (LR) and tools. Licensing is now an issue that is foreseen in most research projects and that is essential to provide legal certainty for repositories when distributing resources. Some repositories such as Zenodo or Quantum Stat do not offer the possibility to search for resources by licenses which can turn the searching for relevant resources a very complex task. Other repositories such as Hugging Face propose a search feature by license which may make it difficult to figure out what use can be made of such resources. During the European Language Grid (ELG) project, we moved a step forward to link metadata with the terms and conditions of use. In this paper, we document the process we undertook to categorize legal features of licenses listed in the SPDX license list and widely used in the HLT community as well as those licenses used within the ELG platform

pdf bib abs
About Migration Flows and Sentiment Analysis on Twitter data: Building the Bridge between Technical and Legal Approaches to Data Protection
Thilo Gottschalk | Francesca Pichierri

Sentiment analysis has always been an important driver of political decisions and campaigns across all fields. Novel technologies allow automatizing analysis of sentiments on a big scale and hence provide allegedly more accurate outcomes. With user numbers in the billions and their increasingly important role in societal discussions, social media platforms become a glaring data source for these types of analysis. Due to its public availability, the relative ease of access and the sheer amount of available data, the Twitter API has become a particularly important source to researchers and data analysts alike. Despite the evident value of these data sources, the analysis of such data comes with legal, ethical and societal risks that should be taken into consideration when analysing data from Twitter. This paper describes these risks along the technical processing pipeline and proposes related mitigation measures.

pdf bib abs
Transparency and Explainability of a Machine Learning Model in the Context of Human Resource Management
Sebastien Delecraz | Loukman Eltarr | Olivier Oullier

We introduce how the proprietary machine learning algorithms developed by Gojob, an HR Tech company, to match candidates to a job offer are as transparent and explainable as possible to users (i.e., our recruiters) and our clients (e.g. companies looking to fill jobs). We detail how our matching algorithm (which identifies the best candidates for a job offer) controls the fairness of its outcome. We have described the steps we have taken to ensure that the decisions made by our mathematical models not only inform but improve the performance of our recruiters.

pdf bib abs
Public Interactions with Voice Assistant – Discussion of Different One-Shot Solutions to Preserve Speaker Privacy
Ingo Siegert | Yamini Sinha | Gino Winkelmann | Oliver Jokisch | Andreas Wendemuth

In recent years, the use of voice assistants has rapidly grown. Hereby, above all, the user’s speech data is stored and processed on a cloud platform, being the decisive factor for a good performance in speech processing and understanding. Although usually, they can be found in private households, a lot of business cases are also employed using voice assistants for public places, be it as an information service, a tour guide, or a booking system. As long as the systems are used in private spaces, it could be argued that the usage is voluntary and that the user itself is responsible for what is processed by the voice assistant system. When leaving the private space, the voluntary use is not the case anymore, as users may be made aware that their voice is processed in the cloud and background voices can be unintendedly recorded and processed as well. Thus, the usage of voice assistants in public environments raises a lot of privacy concerns. In this contribution, we discuss possible anonymization solutions to hide the speakers’ identity, thus allowing a safe cloud processing of speech data. Thereby, we promote the public use of voice assistants.

pdf bib abs
Keynote Speech - Voice anonymization and the GDPR
Brij Mohan Lal Srivastava

Talk for the Workshop on Legal and Ethical Issues in Human Language Technologies, LREC 2022, Marseille, 24 June 2022

pdf bib abs
Cross-Clinic De-Identification of Swedish Electronic Health Records: Nuances and Caveats
Olle Bridal | Thomas Vakili | Marina Santini

Privacy preservation of sensitive information is one of the main concerns in clinical text mining. Due to the inherent privacy risks of handling clinical data, the clinical corpora used to create the clinical Named Entity Recognition (NER) models underlying clinical de-identification systems cannot be shared. This situation implies that clinical NER models are trained and tested on data originating from the same institution since it is rarely possible to evaluate them on data belonging to a different organization. These restrictions on sharing make it very difficult to assess whether a clinical NER model has overfitted the data or if it has learned any undetected biases. This paper presents the results of the first-ever cross-institution evaluation of a Swedish de-identification system on Swedish clinical data. Alongside the encouraging results, we discuss differences and similarities across EHR naming conventions and NER tagsets.

pdf bib abs
Generating Realistic Synthetic Curricula Vitae for Machine Learning Applications under Differential Privacy
Andrea Bruera | Francesco Alda | Francesco Di Cerbo

Applications involving machine learning in Human Resources (HR, the management of human talent in order to accomplish organizational goals) must respect the privacy of the individuals whose data is being used. This is a difficult aim, given the extremely personal nature of text data handled by HR departments, such as Curricula Vitae (CVs).

This paper presents the outcomes of the MAPA project, a set of annotated corpora for 24 languages of the European Union and an open-source customisable toolkit able to detect and substitute sensitive information in text documents from any domain, using state-of-the art, deep learning-based named entity recognition techniques. In the context of the project, the toolkit has been developed and tested on administrative, legal and medical documents, obtaining state-of-the-art results. As a result of the project, 24 dataset packages have been released and the de-identification toolkit is available as open source.

The days of large amorphous corpora collected with armies of Web crawlers and stored indefinitely are, or should be, coming to an end. There is a wealth of hidden linguistic information that is increasingly difficult to access, hidden in personal data that would be unethical and technically challenging to collect using traditional methods such as Web crawling and mass surveillance of online discussion spaces. Advances in privacy regulations such as GDPR and changes in the public perception of privacy bring into question the problematic ethical dimension of extracting information from unaware if not unwilling participants. Modern corpora need to adapt, be focused on testing specific hypotheses, and be respectful of the privacy of the people who generated its data. Our work focuses on using a distributed participatory approach and continuous informed consent to solve these issues, by allowing participants to voluntarily contribute their own censored personal data at a granular level. We evaluate our approach in a three-pronged manner, testing the accuracy of measurement of statistical measures of language with respect to standard corpus linguistics tools, evaluating the usability of our application with a participant involvement panel, and using the tool for a case study on health communication.

In this paper the authors detail the various legal and ethical issues faced during the ATCO2 project. This project is aimed at developing tools to automatically collect and transcribe air traffic conversations, especially conversations between pilots and air controls towers. In this paper the authors will develop issues related to intellectual property, public data, privacy, and general ethics issues related to the collection of air-traffic control speech.

pdf bib abs
It is not Dance, is Data: Gearing Ethical Circulation of Intangible Cultural Heritage practices in the Digital Space
Jorge Yánez | Amel Fraisse

The documentation, protection and dissemination of Intangible Cultural Heritage (ICH) in the digital age pose significant theoretical, technological and legal challenges. Through a multidisciplinary lens, this paper presents new approaches for collecting, documenting, encrypting and protecting ICH-related data for more ethical circulation. Human-movement recognition technologies such as motion capture, allows for the recording, extraction and reproduction of human movement with unprecedented precision. The once indistinguishable or hard-to-trace reproduction of dance steps between their creators and unauthorized third parties becomes patent through the transmission of embodied knowledge, but in the form of data. This new battlefield prompted by digital technologies only adds to the disputes within the creative industries, in terms of authorship, ownership and commodification of body language. For the sake of this paper, we are aiming to disentangle the various layers present in the process of digitisation of the dancing body, to identify its by-products as well as the possible arising ownership rights that might entail. ”Who owns what?”, the basic premise of intellectual property law, is transposed, in this case, onto the various types of data generated when intangible cultural heritage, in the form of dance, is digitised through motion capture and encrypted with blockchain technologies.