Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution

Shifali Agrahari; Abhishek Anand; Shubham Kannaujiya; Sanasam Ranbir Singh; Sujit Kumar

Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution

Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, Sujit Kumar

Abstract

The rise of code-generating LLMs such as DeepSeek, Qwen, and Meta-LLaMA has improved developer productivity but also increased risks of plagiarism, copyright misuse, and insecure machine-generated code. While AI-text detection is well studied, machine-generated source-code detection especially across multiple languages, LLM families, and OOD conditions-remains underexplored. SemEval-2026 Task 13 addresses this via two subtasks: (A) binary human–machine code detection and (B) multi-class authorship attribution across ten LLM families. For Subtask A, we fine-tune RoBERTa, CodeBERT, GraphCodeBERT, and StarCoderBase-1B, introducing a stratified sampling strategy with class-weighted loss to mitigate imbalance and OOD shifts. For Subtask B, we mitigate the extreme human-class imbalance using undersampling, inverse-frequency weights, syntactic noising, and curriculum-based dual-path training with TinyStarCoderPy and CodeBERT. Both results show that long-context modeling, distribution-aware sampling, and noise-robust training are crucial for reliable in real-world settings. Overall, long-context modeling, distribution-aligned sampling, and lightweight noise-robust training emerge as key factors for reliable machine-generated code detection and authorship attribution.

Anthology ID:: 2026.semeval-1.360
Volume:: Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Ekaterina Kochmar, Debanjan Ghosh, Kai North, Mamoru Komachi
Venues:: SemEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2866–2876
Language:
URL:: https://aclanthology.org/2026.semeval-1.360/
DOI:
Bibkey:
Cite (ACL):: Shifali Agrahari, Abhishek Anand, Shubham Kannaujiya, Sanasam Ranbir Singh, and Sujit Kumar. 2026. Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution. In Proceedings of the 20th International Workshop on Semantic Evaluation (2026), pages 2866–2876, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution (Agrahari et al., SemEval 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.semeval-1.360.pdf

PDF Cite Search Fix data