Nathan Zhang
2024
Data Anonymization for Privacy-Preserving Large Language Model Fine-Tuning on Call Transcripts
Shayna Gardiner
|
Tania Habib
|
Kevin Humphreys
|
Masha Azizi
|
Frederic Mailhot
|
Anne Paling
|
Preston Thomas
|
Nathan Zhang
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Large language models in public-facing industrial applications must accurately process data for the domain in which they are deployed, but they must not leak sensitive or confidential information when used. We present a process for anonymizing training data, a framework for quantitatively and qualitatively assessing the effectiveness of this process, and an assessment of the effectiveness of models fine-tuned on anonymized data in comparison with commercially available LLM APIs.
2021
Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction
Vagrant Gautam
|
Wang Yau Li
|
Zafarullah Mahmood
|
Fred Mailhot
|
Shreekantha Nadig
|
Riqiang Wang
|
Nathan Zhang
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
We describe three baseline beating systems for the high-resource English-only sub-task of the SIGMORPHON 2021 Shared Task 1: a small ensemble that Dialpad’s speech recognition team uses internally, a well-known off-the-shelf model, and a larger ensemble model comprising these and others. We additionally discuss the challenges related to the provided data, along with the processing steps we took.
Search
Fix data
Co-authors
- Frederic Mailhot 2
- Masha Azizi 1
- Shayna Gardiner 1
- Vagrant Gautam 1
- Tania Habib 1
- show all...