2023
pdf
bib
abs
Party Extraction from Legal Contract Using Contextualized Span Representations of Parties
Sanjeepan Sivapiran
|
Charangan Vasantharajan
|
Uthayasanker Thayasivam
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Extracting legal entities from legal documents, particularly legal parties in contract documents, poses a significant challenge for legal assistive software. Many existing party extraction systems tend to generate numerous false positives due to the complex structure of the legal text. In this study, we present a novel and accurate method for extracting parties from legal contract documents by leveraging contextual span representations. To facilitate our approach, we have curated a large-scale dataset comprising 1000 contract documents with party annotations. Our method incorporates several enhancements to the SQuAD 2.0 question-answering system, specifically tailored to handle the intricate nature of the legal text. These enhancements include modifications to the activation function, an increased number of encoder layers, and the addition of normalization and dropout layers stacked on top of the output encoder layer. Baseline experiments reveal that our model, fine-tuned on our dataset, outperforms the current state-of-the-art model. Furthermore, we explore various combinations of the aforementioned techniques to further enhance the accuracy of our method. By employing a hybrid approach that combines 24 encoder layers with normalization and dropout layers, we achieve the best results, exhibiting an exact match score of 0.942 (+6.2% improvement).
2021
pdf
bib
abs
A Survey on Paralinguistics in Tamil Speech Processing
Anosha Ignatius
|
Uthayasanker Thayasivam
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Speech carries not only the semantic content but also the paralinguistic information which captures the speaking style. Speaker traits and emotional states affect how words are being spoken. The research on paralinguistic information is an emerging field in speech and language processing and it has many potential applications including speech recognition, speaker identification and verification, emotion recognition and accent recognition. Among them, there is a significant interest in emotion recognition from speech. A detailed study of paralinguistic information present in speech signal and an overview of research work related to speech emotion for Tamil Language is presented in this paper.
pdf
bib
abs
Hypers@DravidianLangTech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube Comments and Posts
Charangan Vasantharajan
|
Uthayasanker Thayasivam
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Code-Mixed Offensive contents are used pervasively in social media posts in the last few years. Consequently, gained the significant attraction of the research community for identifying the different forms of such content (e.g., hate speech, and sentiments) and contributed to the creation of datasets. Most of the recent studies deal with high-resource languages (e.g., English) due to many publicly available datasets, and by the lack of dataset in low-resource anguages, those studies are slightly involved in these languages. Therefore, this study has the focus on offensive language identification on code-mixed low-resourced Dravidian languages such as Tamil, Kannada, and Malayalam using the bidirectional approach and fine-tuning strategies. According to the leaderboard, the proposed model got a 0.96 F1-score for Malayalam, 0.73 F1-score for Tamil, and 0.70 F1-score for Kannada in the bench-mark. Moreover, in the view of multilingual models, this modal ranked 3rd and achieved favorable results and confirmed the model as the best among all systems submitted to these shared tasks in these three languages.
2020
pdf
bib
Dialog policy optimization for low resource setting using Self-play and Reward based Sampling
Tharindu Madusanka
|
Durashi Langappuli
|
Thisara Welmilla
|
Uthayasanker Thayasivam
|
Sanath Jayasena
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
A Privacy Preserving Data Publishing Middleware for Unstructured, Textual Social Media Data
Prasadi Abeywardana
|
Uthayasanker Thayasivam
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management
Privacy is going to be an integral part of data science and analytics in the coming years. The next hype of data experimentation is going to be heavily dependent on privacy preserving techniques mainly as it’s going to be a legal responsibility rather than a mere social responsibility. Privacy preservation becomes more challenging specially in the context of unstructured data. Social networks have become predominantly popular over the past couple of decades and they are creating a huge data lake at a high velocity. Social media profiles contain a wealth of personal and sensitive information, creating enormous opportunities for third parties to analyze them with different algorithms, draw conclusions and use in disinformation campaigns and micro targeting based dark advertising. This study provides a mitigation mechanism for disinformation campaigns that are done based on the insights extracted from personal/sensitive data analysis. Specifically, this research is aimed at building a privacy preserving data publishing middleware for unstructured social media data without compromising the true analytical value of those data. A novel way is proposed to apply traditional structured privacy preserving techniques on unstructured data. Creating a comprehensive twitter corpus annotated with privacy attributes is another objective of this research, especially because the research community is lacking one.
2019
pdf
bib
abs
Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages
Yohan Karunanayake
|
Uthayasanker Thayasivam
|
Surangika Ranathunga
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Current state-of-the-art speech-based user interfaces use data intense methodologies to recognize free-form speech commands. However, this is not viable for low-resource languages, which lack speech data. This restricts the usability of such interfaces to a limited number of languages. In this paper, we propose a methodology to develop a robust domain-specific speech command classification system for low-resource languages using speech data of a high-resource language. In this transfer learning-based approach, we used a Convolution Neural Network (CNN) to identify a fixed set of intents using an ASR-based character probability map. We were able to achieve significant results for Sinhala and Tamil datasets using an English based ASR, which attests the robustness of the proposed approach.
2018
pdf
bib
Graph Based Semi-Supervised Learning Approach for Tamil POS tagging
Mokanarangan Thayaparan
|
Surangika Ranathunga
|
Uthayasanker Thayasivam
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
DataSEARCH at IEST 2018: Multiple Word Embedding based Models for Implicit Emotion Classification of Tweets with Deep Learning
Yasas Senarath
|
Uthayasanker Thayasivam
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
This paper describes an approach to solve implicit emotion classification with the use of pre-trained word embedding models to train multiple neural networks. The system described in this paper is composed of a sequential combination of Long Short-Term Memory and Convolutional Neural Network for feature extraction and Feedforward Neural Network for classification. In this paper, we successfully show that features extracted using multiple pre-trained embeddings can be used to improve the overall performance of the system with Emoji being one of the significant features. The evaluations show that our approach outperforms the baseline system by more than 8% without using any external corpus or lexicon. This approach is ranked 8th in Implicit Emotion Shared Task (IEST) at WASSA-2018.