2024
pdf
bib
abs
Selling Personal Information: Data Brokers and the Limits of US Regulation
Denise DiPersio
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024
A principal pillar of the US Blueprint for an AI Bill of Rights is data privacy, specifically, that individuals should be protected from abusive practices by data collectors and data aggregators, and that users should have control over how their personal information is collected and used. An area that spotlights the need for such protections is found in the common practices of data brokers who scrape, purchase, process and reassemble personal information in bulk and sell it for a variety of downstream uses. Such activities almost always occur in the absence of users’ knowledge or meaningful consent, yet they are legal under US law. This paper examines how data brokers operate, provides some examples of recent US regulatory actions taken against them, summarizes federal efforts to redress data broker practices and concludes that as long as there continues to be no comprehensive federal data protection and privacy scheme, efforts to control such behavior will have only a limited effect. This paper also addresses the limits of informed consent on the use of personal information in language resources and suggests a solution in an holistic approach to data protection and privacy across the data/development life cycle.
2022
pdf
bib
abs
Data Protection, Privacy and US Regulation
Denise DiPersio
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
This paper examines the state of data protection and privacy in the United States. There is no comprehensive federal data protection or data privacy law despite bipartisan and popular support. There are several data protection bills pending in the 2022 session of the US Congress, five of which are examined in Section 2 below. Although it is not likely that any will be enacted, the growing number reflects the concerns of citizens and lawmakers about the power of big data. Recent actions against data abuses, including data breaches, litigation and settlements, are reviewed in Section 3 of this paper. These reflect the real harm caused when personal data is misused. Section 4 contains a brief US copyright law update on the fair use exemption, highlighting a recent court decision and indications of a re-thinking of the fair use analysis. In Section 5, some observations are made on the role of privacy in data protection regulation. It is argued that privacy should be considered from the start of the data collection and technology development process. Enhanced awareness of ethical issues, including privacy, through university-level data science programs will also lay the groundwork for best practices throughout the data and development cycles.
2020
pdf
bib
abs
Related Works in the Linguistic Data Consortium Catalog
Daniel Jaquette
|
Christopher Cieri
|
Denise DiPersio
Proceedings of the Twelfth Language Resources and Evaluation Conference
Defining relations between language resources provides an archive with the ability to better serve its users. This paper covers the development and implementation of a Related Works addition to the Linguistic Data Consortium’s (LDC) catalog. The authors go step-by-step through the development of the Related Works schema, implementation of the software and database changes, and data entry of the relations. The Related Work schema involved developing of a set of controlled terms for relations based on previous work and other schema. Software and database changes consisted of both front and back end interface additions, along with modification and additions to the LDC Catalog database tables. Data entry consisted of two parts: seed data from previous work and 2019 language resources, and ongoing legacy population. Previous work in this area is discussed as well as overview information about the LDC Catalog. A list of the full LDC Related Works terms is included with brief explanations.
pdf
bib
abs
A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community
Christopher Cieri
|
James Fiumara
|
Stephanie Strassel
|
Jonathan Wright
|
Denise DiPersio
|
Mark Liberman
Proceedings of the Twelfth Language Resources and Evaluation Conference
This latest in a series of Linguistic Data Consortium (LDC) progress reports to the LREC community does not describe any single language resource, evaluation campaign or technology but sketches the activities, since the last report, of a data center devoted to supporting the work of LREC attendees among other research communities. Specifically, we describe 96 new corpora released in 2018-2020 to date, a new technology evaluation campaign, ongoing activities to support multiple common task human language technology programs, and innovations to advance the methodology of language data collection and annotation.
2018
pdf
bib
From ‘Solved Problems’ to New Challenges: A Report on LDC Activities
Christopher Cieri
|
Mark Liberman
|
Stephanie Strassel
|
Denise DiPersio
|
Jonathan Wright
|
Andrea Mazzucchi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
bib
abs
Trends in HLT Research: A Survey of LDC’s Data Scholarship Program
Denise DiPersio
|
Christopher Cieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Since its inception in 2010, the Linguistic Data Consortium’s data scholarship program has awarded no cost grants in data to 64 recipients from 26 countries. A survey of the twelve cycles to date ― two awards each in the Fall and Spring semesters from Fall 2010 through Spring 2016 ― yields an interesting view into graduate program research trends in human language technology and related fields and the particular data sets deemed important to support that research. The survey also reveals regions in which such activity appears to be on a rise, including in Arabic-speaking regions and portions of the Americas and Asia.
pdf
bib
abs
Data Management Plans and Data Centers
Denise DiPersio
|
Christopher Cieri
|
Daniel Jaquette
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Data management plans, data sharing plans and the like are now required by funders worldwide as part of research proposals. Concerned with promoting the notion of open scientific data, funders view such plans as the framework for satisfying the generally accepted requirements for data generated in funded research projects, among them that it be accessible, usable, standardized to the degree possible, secure and stable. This paper examines the origins of data management plans, their requirements and issues they raise for data centers and HLT resource development in general.
2014
pdf
bib
abs
New Directions for Language Resource Development and Distribution
Christopher Cieri
|
Denise DiPersio
|
Mark Liberman
|
Andrea Mazzucchi
|
Stephanie Strassel
|
Jonathan Wright
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Despite the growth in the number of linguistic data centers around the world, their accomplishments and expansions and the advances they have help enable, the language resources that exist are a small fraction of those required to meet the goals of Human Language Technologies (HLT) for the worlds languages and the promises they offer: broad access to knowledge, direct communication across language boundaries and engagement in a global community. Using the Linguistic Data Consortium as a focus case, this paper sketches the progress of data centers, summarizes recent activities and then turns to several issues that have received inadequate attention and proposes some new approaches to their resolution.
pdf
bib
Intellectual Property Rights Management with Web Service Grids
Christopher Cieri
|
Denise DiPersio
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT
2012
pdf
bib
abs
LDC Language Resource Database: Building a Bibliographic Database
Eleftheria Ahtaridis
|
Christopher Cieri
|
Denise DiPersio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The Linguistic Data Consortium (LDC) creates and provides language resources (LRs) including data, tools and specifications. In order to assess the impact of these LRs and to support both LR users and authors, LDC is collecting metadata about and URLs for research papers that introduce, describe, critique, extend or rely upon LDC LRs. Current collection efforts focus on papers published in journals and conference proceedings that are available online. To date, nearly 300, or over half of the LRs LDC distributes have been searched for extensively and almost 8000 research papers about these LRs have been documented. This paper discusses the issues with collecting references and includes preliminary analysis of those results. The remaining goals of the project are also outlined.
pdf
bib
abs
Twenty Years of Language Resource Development and Distribution: A Progress Report on LDC Activities
Christopher Cieri
|
Marian Reed
|
Denise DiPersio
|
Mark Liberman
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
On the Linguistic Data Consortium's (LDC) 20th anniversary, this paper describes the changes to the language resource landscape over the past two decades, how LDC has adjusted its practice to adapt to them and how the business model continues to grow. Specifically, we will discuss LDC's evolving roles and changes in the sizes and types of LDC language resources (LR) as well as the data they include and the annotations of that data. We will also discuss adaptations of the LDC business model and the sponsored projects it supports.
2010
pdf
bib
abs
A Very Large Scale Mandarin Chinese Broadcast Corpus for GALE Project
Yi Liu
|
Pascale Fung
|
Yongsheng Yang
|
Denise DiPersio
|
Meghan Glenn
|
Stephanie Strassel
|
Christopher Cieri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present the design, collection, transcription and analysis of a Mandarin Chinese Broadcast Collection of over 3000 hours. The data was collected by Hong Kong University of Science and Technology (HKUST) in China on a cable TV and satellite transmission platform established in support of the DARPA Global Autonomous Language Exploitation (GALE) program. The collection includes broadcast news (BN) and broadcast conversation (BC) including talk shows, roundtable discussions, call-in shows, editorials and other conversational programs that focus on news and current events. HKUST also collects detailed information about all recorded programs. A subset of BC and BN recordings are manually transcribed with standard Chinese characters in UTF-8 encoding, using specific mark-ups for a small set of spontaneous and conversational speech phenomena. The collection is among the largest and first of its kind for Mandarin Chinese Broadcast speech, providing abundant and diverse samples for Mandarin speech recognition and other application-dependent tasks, such as spontaneous speech processing and recognition, topic detection, information retrieval, and speaker recognition. HKUSTâs acoustic analysis of 500 hours of the speech and transcripts demonstrates the positive impact this data could have on system performance.
pdf
bib
abs
Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
Kevin Walker
|
Christopher Caruso
|
Denise DiPersio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The development of technologies to address machine translation and distillation of multilingual broadcast data depends heavily on the collection of large volumes of material from modern data providers. To address the needs of GALE researchers, the Linguistic Data Consortium (LDC) developed a system for collecting broadcast news and conversation from a variety of Arabic, Chinese and English broadcasters. The system is highly automated, easily extensible and robust and is capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. In addition to this extensive system, LDC manages three remote collection sites to maximize the variety of available broadcast data and has designed a portable broadcast collection platform to facilitate remote collection. This paper will present a detailed a description of the design and implementation of LDCs collection system, the technical challenges and solutions to large scale broadcast data collection efforts and an overview of the systems operation. This paper will also discuss the challenges of managing remote collections, in particular, the strategies used to normalize data formats, naming conventions and delivery methods to achieve optimal integration of remotely-collected data into LDCs collection database and downstream tasking workflow.
2008
pdf
bib
abs
The Linguistic Data Consortium Member Survey: Purpose, Execution and Results
Marian Reed
|
Denise DiPersio
|
Christopher Cieri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The Linguistic Data Consortium (LDC) seeks to provide its members with quality linguistic resources and services. In order to pursue these ideals and to remain current, LDC monitors the needs and sentiments of its communities. One mechanism LDC uses to generate feedback on consortium and resource issues is the LDC Member Survey. The survey allows LDC Members and nonmembers to provide LDC with valuable insight into their own unique circumstances, their current and future data needs and their views on LDCs role in meeting them. When the 2006 Survey was found to be a useful tool for communicating with the Consortium membership, a 2007 Survey was organized and administered. As a result of the surveys, LDC has confirmed that it has made a positive impact on the community and has identified ways to improve the quality of service and the diversity of monthly offerings. Many respondents recommended ways to improve LDCs functions, ordering mechanism and webpage. Some of these comments have inspired changes to LDCs operation and strategy.
2006
pdf
bib
abs
Integrated Linguistic Resources for Language Exploitation Technologies
Stephanie Strassel
|
Christopher Cieri
|
Andrew Cole
|
Denise Dipersio
|
Mark Liberman
|
Xiaoyi Ma
|
Mohamed Maamouri
|
Kazuaki Maeda
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals require a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.