Mike Joy
2024
WarwickNLP at SemEval-2024 Task 1: Low-Rank Cross-Encoders for Efficient Semantic Textual Relatedness
Fahad Ebrahim
|
Mike Joy
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
This work participates in SemEval 2024 Task 1 on Semantic Textural Relatedness (STR) in Track A (supervised regression) in two languages, English and Moroccan Arabic. The task consists of providing a score of how two sentences relate to each other. The system developed in this work leveraged a cross-encoder with a merged fine-tuned Low-Rank Adapter (LoRA). The system was ranked eighth in English with a Spearman coefficient of 0.842, while Moroccan Arabic was ranked seventh with a score of 0.816. Moreover, various experiments were conducted to see the impact of different models and adapters on the performance and accuracy of the system.
2023
Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning
Fahad Ebrahim
|
Mike Joy
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Source code plagiarism is a critical ethical issue in computer science education where students use someone else’s work as their own. It can be treated as a binary classification problem where the output can be either: yes (plagiarism found) or no (plagiarism not found). In this research, we have taken the open-source dataset ‘SOCO’, which contains two programming languages (PLs), namely Java and C/C++ (although our method could be applied to any PL). Source codes should be converted to vector representations that capture both the syntax and semantics of the text, known as contextual embeddings. These embeddings would be generated using source code pre-trained models (CodePTMs). The cosine similarity scores of three different CodePTMs were selected as features. The classifier selection and parameter tuning were conducted with the assistance of Automated Machine Learning (AutoML). The selected classifiers were tested, initially on Java, and the proposed approach produced average to high results compared to other published research, and surpassed the baseline (the JPlag plagiarism detection tool). For C/C++, the approach outperformed other research work and produced the highest ranking score.
2021
English-Arabic Cross-language Plagiarism Detection
Naif Alotaibi
|
Mike Joy
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
The advancement of the web and information technology has contributed to the rapid growth of digital libraries and automatic machine translation tools which easily translate texts from one language into another. These have increased the content accessible in different languages, which results in easily performing translated plagiarism, which are referred to as “cross-language plagiarism”. Recognition of plagiarism among texts in different languages is more challenging than identifying plagiarism within a corpus written in the same language. This paper proposes a new technique for enhancing English-Arabic cross-language plagiarism detection at the sentence level. This technique is based on semantic and syntactic feature extraction using word order, word embedding and word alignment with multilingual encoders. Those features, and their combination with different machine learning (ML) algorithms, are then used in order to aid the task of classifying sentences as either plagiarized or non-plagiarized. The proposed approach has been deployed and assessed using datasets presented at SemEval-2017. Analysis of experimental data demonstrates that utilizing extracted features and their combinations with various ML classifiers achieves promising results.
Search