CaseEncoder: A Knowledge-enhanced Pre-trained Model for Legal Case Encoding
Yixiao Ma | Yueyue Wu | Weihang Su | Qingyao Ai | Yiqun Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Legal case retrieval is a critical process for modern legal information systems. While recent studies have utilized pre-trained language models (PLMs) based on the general domain self-supervised pre-training paradigm to build models for legal case retrieval, there are limitations in using general domain PLMs as backbones. Specifically, these models may not fully capture the underlying legal features in legal case documents. To address this issue, we propose CaseEncoder, a legal document encoder that leverages fine-grained legal knowledge in both the data sampling and pre-training phases. In the data sampling phase, we enhance the quality of the training data by utilizing fine-grained law article information to guide the selection of positive and negative examples. In the pre-training phase, we design legal-specific pre-training tasks that align with the judging criteria of relevant legal cases. Based on these tasks, we introduce an innovative loss function called Biased Circle Loss to enhance the model’s ability to recognize case relevance in fine grains. Experimental results on multiple benchmarks demonstrate that CaseEncoder significantly outperforms both existing general pre-training models and legal-specific pre-training models in zero-shot legal case retrieval. The source code of CaseEncoder can be found at https://github.com/Anonymous-EMNLP2023/CaseEncoder.


Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover’s Distance Regularization
Meng Zhang | Yang Liu | Huanbo Luan | Yiqun Liu | Maosong Sun
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Being able to induce word translations from non-parallel data is often a prerequisite for cross-lingual processing in resource-scarce languages and domains. Previous endeavors typically simplify this task by imposing the one-to-one translation assumption, which is too strong to hold for natural languages. We remove this constraint by introducing the Earth Mover’s Distance into the training of bilingual word embeddings. In this way, we take advantage of its capability to handle multiple alternative word translations in a natural form of regularization. Our approach shows significant and consistent improvements across four language pairs. We also demonstrate that our approach is particularly preferable in resource-scarce settings as it only requires a minimal seed lexicon.


Identify Temporal Websites Based on User Behavior Analysis
Yong Wang | Yiqun Liu | Min Zhang | Shaoping Ma | Liyun Ru
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I