A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Large language models (LLMs) have demonstrated great potential for domain-specific applications, such as the law domain. However, recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. To systematically investigate their competency in the law, we design practical baseline solutions based on LLMs and test on the task of legal judgment prediction. In our solutions, LLMs can work alone to answer open questions or coordinate with an information retrieval (IR) system to learn from similar cases or solve simplified multi-choice questions. We show that similar cases and multi-choice options, namely label candidates, included in prompts can help LLMs recall domain knowledge that is critical for expertise legal reasoning. We additionally present an intriguing paradox wherein an IR system surpasses the performance of LLM+IR due to limited gains acquired by weaker LLMs from powerful IR systems. In such cases, the role of LLMs becomes redundant. Our evaluation pipeline can be easily extended into other tasks to facilitate evaluations in other domains. Code is available at https://github.com/srhthu/LM-CompEval-Legal


Introduction
Large language models have achieved great success in various Natural Language Processing (NLP) tasks (Brown et al., 2020;Touvron et al., 2023), while there are still some disputes over the potential for domain-specific applications (Martínez, 2023).Focusing on the law domain, the leading LLM, GPT-4 (OpenAI, 2023), was claimed to pass the Uniform Bar Exam (UBE) with a 90th percentile score.Although inspiring, however, this result was pointed out to be overestimated (Martínez, 2023).
Figure 1: The task of Legal Judgment Prediction and the evaluation settings.Different colors refer to different charges.For similar cases, "T" refers to true similar cases with the same charges as the query cases, while "F" refers to false similar cases.For task settings, "ZS" is the abbreviation for zero-shot and "FS" for few-shot.
This raises an interesting question: How exactly LLMs perform in various real-world legal tasks?
In this paper, we design practical baseline solutions based on LLMs and systematically investigate their competency in the law, to shed light on other domains as well.We attribute the main issues of the previous benchmark as follows.First, UBE is too general and not subject to any legal jurisdiction (Martínez, 2023).Second, UBE contains multichoice questions and open-ended questions that require human experts to evaluate.To avoid human evaluation, some datasets (Hendrycks et al., 2020) replace open-ended questions with multi-choice questions.However, in real-world applications, there are not only multi-choice but also open questions.Using multi-choice questions only may not be comprehensive enough.Third, specifically in but not limited to common law (Shulayeva et al., 2017;Xiao et al., 2019), similar cases are always introduced as evidence to support expertise legal reasoning (Zhong et al., 2020b), which are not fully studied in previous benchmark (Hendrycks et al., 2020).
For the first issue, we choose legal judgment prediction (LJP) (Xiao et al., 2018;Chalkidis et al., 2019;Zhong et al., 2020a) as the example task for investigation.It is a real-world problem to determine the charges committed by the defendants under a juridical system, as shown in Figure 1.LJP is typically formulated as a classification task to predict the most possible one from a list of predefined charges.Then, for the second and third issues, we design four settings derived from two work scenarios of LLMs to cover open and multichoice questions and the usage of similar cases.In the first scenario, LLMs work alone without explicit knowledge in prompts, assuming all domain knowledge is implicitly stored in parameters.In the second scenario, LLMs coordinate with an information retrieval (IR) system that enriches prompts with similar demonstrations and label candidates to benefit expertise reasoning.Specifically, demonstrations consist of pairs of similar cases and their charges, which are retrieved by the IR system based on similarity of case facts.Labels of the retrieved cases can form label candidates, shown as circles of different colors in Figure 1, to hint LLM with label information and narrow down label space (Ma et al., 2023).
The four evaluation settings in Figure 1 can be categorized based on the presence of two elements in prompts: demonstrations (similar cases) and label candidates.Demonstrations convert the setting from zero-shot to few-shot prompting, while label candidates simplify the task from open questions to multi-choice questions1 .The first scenario corresponds to the first setting, where neither element is present, while the second scenario encompasses the remaining three settings.We evaluate five upto-date LLMs of the close-source GPT-3 (Brown et al., 2020) family, ChatGPT and GPT-4 (OpenAI, 2023), and open-source LLMs including Vicuna (Chiang et al., 2023), ChatGLM (Du et al., 2022) and BLOOMZ (Muennighoff et al., 2022).The evaluation is conducted on a Chinese LJP dataset, namely CAIL (Xiao et al., 2018), which contains cases of 112 criminal law charges2 .
We highlight our key findings as follows: 1. Similar cases and label candidates can help LLMs recall domain knowledge that is critical for expertise legal reasoning.2. Label candidates result in more consistent outputs, indicating LLMs gain greater confidence in their domain knowledge (Jiang et al., 2021).3. Irrelevant demonstrations formed by fixed cases hardly improve performance.This excludes their effect on task illustration.4. Paradox: An IR system can outperform LLM+IR since weaker LLMs acquire limited gains from informative documents retrieved by a powerful IR system.Thus, it is critical to adapte LLMs to generate with retrieved documents.5.More similar cases introduce more knowledge and noise simultaneously, whose final outcome depends on LLMs.The main contributions are summarized in three aspects: • We investigate the law competency of LLMs on the task of legal judgment prediction.
• We propose practical baseline solutions for LLMs that tackle two scenarios: working alone or in coordination with an IR system.
• We evaluate five LLMs and conduct comprehensive analysis to demystify their characteristics of expertise reasoning.

Baseline Method
The goal of legal judgment prediction is to determine the committed charges given case facts.To harness LLMs for LJP, we adopt in-context learning (Brown et al., 2020) and use LLMs to generate the charges conditioned on prompts (Section 2.1).
To enhance LLMs, we incorporate label candidates and demonstrations consisting of similar cases into prompts, which are acquired by an IR system (Section 2.2).This derives four settings of baseline solutions, namely zero-shot open questions, few-shot open questions, zero-shot multi-choice questions, and few-shot multi-choice questions.The multichoice settings employ label candidates while fewshot settings include demonstrations, as shown in Figure 1.Finally, we introduce how to simulate IR systems with different capabilities to understand their effects (Section 2.3).

LLM Prompting
Prompt Design.A prompt begins with an instruction to illustrate the task followed by label candidates and task demonstrations in the form of input-output pairs.The templates of prompts are displayed in Appendix A.1.
Parsing.We adopt one automatic parsing function for all LLMs to map LLM outputs to predefined charge labels.No ad hoc heuristics are employed for a fair comparison.Specifically, we use the BM25 algorithm3 to measure text similarity between outputs and pre-defined charges and predict the most similar charges.BM25 is robust and yields comparable performances to neural similarity methods like text2vec4 in our pilot experiments.
Inference.Sampling is enabled during generation for consistent results, as inspired by Wang et al. (2022).Five outputs are sampled for each prompt with the temperature of 0.8.Their similarity scores of pre-defined labels are averaged.

IR System for Knowledge Incorporation
IR systems are utilized to retrieve similar cases, commonly referenced by lawyers and judges, to inform their judgments.In addition to providing demonstrations, these similar cases can also aid in generating potential labels by incorporating the labels from the top similar cases.By employing these smaller sets of predefined charges, namely label candidates, complex open questions can be simplified into multiple-choice questions.This approach is effective in enhancing LM prompting (Ma et al., 2023), as including hundreds of charges directly in prompts is impractical.
Implementation of IR System.We use the BM25 algorithm to measure the semantic similarity between cases.Similar cases are retrieved from the training dataset.To guarantee that the demonstrations exemplify one of the multi-choice options, we exclude demonstrations with labels that are not among the candidate options5 .

Simulation of IR Systems
To investigate the effects of IR capabilities, we simulate a series of IR systems of different capabilities as measured by Precision@16 .Then the top retrieved cases are used as demonstrations.We consider cases with identical charges to the query cases as true similar cases and vice versa.
Realistic Simulation.We prioritize the returning of true similar cases for easy query cases, rather than the returning in a random manner.The query difficulty is measured by the Precision@10 of the BM25 retriever described in Section 2.2.The motivation is that queries with shadow linguistic features are more possible to get relevant retrieval results than complex or obscure queries.For a specific value (e.g., a%) of Precision@1 to be simulated, the top a% of easy test cases are assured to have a true similar case, while the rest are assigned false similar cases.
3 Experimental Setup

Models
Below is a concise introduction to the five LLMs to be evaluated.GPT-4 (OpenAI, 2023) and ChatGPT are available from OpenAI API and the versions of gpt-4-0314 and gpt-3.5-turbo-0301are used.For technological details, ChatGPT is claimed to be a sibling model to InstructGPT (Ouyang et al., 2022) that is trained to follow instructions and align to human preferences with the RLHF algorithm (Christiano et al., 2017).
ChatGLM-6B7 is a dialog language model based on the GLM (Du et al., 2022) architecture and supports English and Chinese.
BLOOMZ (Muennighoff et al., 2022) is an instruction fine-tuned BLOOM (Scao et al., 2022), a multilingual language model.We use the bloomz-7b1-mt version that is tuned for multilingual prompts.Except for BLOOMZ, Vicuna and ChatGLM are mainly fine-tuned on conversational data.

Dataset and Pre-processing
The Chinese LJP dataset, CAIL (Xiao et al., 2018), is used in our experiments.Each sample consists of the case facts and the committed charge as the label.As the original dataset is very large (~100K for training and ~20K for test), we randomly sample a balanced small test set from the original test set.Five cases are sampled for each charge, accounting for 560 test cases in total for 112 charges.Similarly, we also sample the training and validation sets with 10 cases per charge.The training set is used to retrieve similar cases (Section 2.3), while the validation set is used to determine the optimal k of the kNN algorithm.
Truncation.Since some cases have very long descriptions, we truncate the case facts of demonstrations to 500 tokens and those of test samples to 1000 tokens.It is worth noting that the text is tokenized by the tokenizer of each model before truncation for a fair comparison.Recently, Petrov et al. (2023) address the issue that a tokenizer can lead to different performances of different languages.This suggests that the performance on a particular language can also be influenced by tokenizers from various models with varying language encoding efficiency.
Table 1 shows the statistics of the number of tokens processed by different tokenizers8 .The most efficient tokenizers for Chinese are those of Chat-GLM and BLOOMZ, indicated by the medians of token numbers.In contrast, the tokenizer of ChatGPT produces 2× tokens and that of Vicuna produces 2.5× tokens.The truncation length is proper to accommodate most samples.

LLM vs. LLM with IR System
We initially present the overall results, highlighting the importance of label candidates and similar cases, and conduct a comparative analysis of the models.Subsequently, we investigate the relationship between label candidates and self-consistency to unveil their actual effects on expertise reasoning.Additionally, we perform an ablation study by replacing similar cases with fixed cases as demonstrations to further understand their impact.

Overall Results
The macro comparison between the four settings is shown in Figure 2, where each point represents the performance of one specific run of one model.Significance of label candidates and similar cases.In comparison to the zero-shot open question setting where LLMs work alone, the inclusion of label candidates, similar cases, or both demonstrates noteworthy enhancements.This highlights the effectiveness of our baseline solutions that leverage IR systems to expand the capabilities of LLMs in legal domains.These findings align with previous research that has also recognized the significance of the two components (Ma et al., 2023;Liu et al., 2021).
The effects of label candidates and similar cases differ slightly in terms of performance mean and variance.Label candidates contribute to a higher mean performance, while similar cases introduce greater variance.Examining the model performances in the third setting (+Sim Case) displayed in Figure 2, GPT-4 and ChatGPT exhibit more significant improvements from similar cases compared to their smaller counterparts.They also gain more benefit from similar cases than from label candidates.This observation can be attributed to the varying difficulty levels of knowledge utilization.While the knowledge within label candidates is readily accessible and straightforward, leveraging similar cases requires stronger language understanding and few-shot learning abilities.
Furthermore, the coexistence of label candidates and similar cases further enhances the performance of GPT-4 and ChatGPT, but it diminishes the performance of Vicuna, ChatGLM, and BLOOMZ.This suggests that smaller LLMs may encounter challenges in effectively managing knowledge in multiple forms simultaneously, leading to confusion.
Model comparison.The performances of the models under zero-shot and few-shot prompting is shown in Figure 3, where few-shot performances are averaged among 1-shot to 4-shot.
The zero-shot setting emphasizes the ability to understand instructions.When only instructions are available, BLOOMZ performs better than Chat-GPT, indicating a superior multilingual instruction following ability.This result is reasonable as BLOOMZ is the only smaller LLM that is fine-tuned on multilingual instructions.Once provided with explicit domain knowledge, ChatGPT outperforms all smaller LLMs.The case is the same for BLOOMZ and ChatGLM, where Chat-GLM overtakes BLOOMZ with knowledge of label candidates.BLOOMZ performs worst when prompted with two forms of knowledge, indicating that BLOOMZ is not very robust to prompts.Among the three smaller LLMs, ChatGLM is the most robust to various forms of knowledge.
The significant effects of label candidates and similar cases can be explained as they activate LLM's memory of relevant domain knowledge.This view can be supported by two pieces of evidence about the relationship between label candidates and self-consistency (Section 4.2) and the negligible effect of irrelevant cases as fixed demonstrations (Section 4.3).

Label Candidates Enhance Self-consistency and Confidence
To further understand the effect of label candidates, we propose a metric to measure the self-consistency of LLMs that is calculated as the number of the majority prediction9 .Consistent outputs indicate a high level of confidence in LLMs, which is often associated with a better grasp of knowledge (Jiang et al., 2021(Jiang et al., , 2023)).
The changes in performance and selfconsistency after introducing label candidates are shown in Figure 4 as the arrows.We observe that the incorporation of label candidates leads to more consistent outputs (8 of 10 cases) and higher confidence in LLMs except zero-shot GPT-4 with a slight decrease and few-shot BLOOMZ.In the zero-shot setting, label candidates significantly boost LLM performances.We postulate that label candidates help by eliciting pre-stored domain knowledge with concise charge names.Besides, the self-consistency also correlates with model performances (7 of 10 cases).Such correlation is also observed in other tasks like question answering (Jiang et al., 2021).It is worth noting that label candidates decrease both self-consistency and performance of few-shot prompted BLOOMZ, which also aligns with the correlation.

Domain Knowledge Is More Critical Than Task Illustration
There is a possible argument that similar demonstrations can help LLMs understand instructions and tasks.To disentangle their effects on task illustration and provision of domain knowledge, we experiment with irrelevant demonstrations fixed for all test samples.We manually select two common cases with frequent charges in the original dataset as the fixed demonstrations.The 1-shot performance was averaged on the two demonstrations.
We compare the effects of fixed and similar demonstrations with the baseline setting of zeroshot open questions in Figure 5.The change of performance from center to left demonstrates that fixed demonstrations hardly benefit LLMs and sometimes harm the performance (e.g., ChatGLM).This indicates that LLMs can basically understand instructions and do not need general demonstrations for task clarification, implying that the main challenge of expertise reasoning is to recall domain knowledge instead of understanding a specific task.
We inspect the notable performance drop of ChatGLM resulting from fixed demonstrations.We find that ChatGLM tends to analyze the cases of both demonstrations and test samples and then answer with both of their charges.Its wordy style seems to result from the fine-tuning dialog corpus where an assistant LLM is supposed to provide rich information.In contrast, similar cases seem to encourage more concise outputs following the format of demonstrations.

Paradox of Information Retrieval System
The significance of similar demonstrations illustrated in Section 4.3 has motivated research focusing on prompting-oriented IR systems (Rubin et al., 2021;Sun et al., 2023) to retrieve high qual- ity demonstrations.However, we raise an intuitive question: Do LLMs gain substantial improvement from IR systems compared to the kNN baseline that harnesses IR systems for classification tasks?
The question is inspired by our observation that the BM25 retriever achieves 48.03% of Precision@110 and 57.68% prediction accuracy by majority vote of top k = 17 retrieved similar cases.This observation suggests a paradoxical scenario wherein an IR system outperforms the combination of LLM and IR, with the LLM taking on the leading role and the IR serving as a supporting role.In such a scenario, the LLM becomes redundant due to its failure to fully utilize the informative retrieved documents.
To investigate the paradox, instead of experimenting with different IR systems, we manipulate the BM25 retriever to simulate a series of IR systems with different capabilities measured by Precision@1 as described by Section 2.3.We take a case study of ChatGPT, whose 1-shot performance under different IR systems (denoted as Precision@1) is shown in Figure 6.
Results Although the performance of ChatGPT enhanced by IR systems improves with IR capability, it will eventually underperform the IR system once the IR capability surpasses a certain threshold.In the ideal situation where true similar cases are always retrieved, ChatGPT is unable to attain 100% accuracy and lags significantly behind the optimal IR system.According to Appendix A.4, all smaller LLMs are not comparable to the BM25 retriever.
Discussion The findings demonstrate that LLMs face challenges in effectively leveraging informa- tive retrieved documents.This underscores the need for significant research efforts to enhance the synergy between auto-regressive language models and retrieval by conditioning model outputs more on retrieved documents.Previous work has explored the augmentation of LLMs with retrieval at both the pre-training and fine-tuning stages (Borgeaud et al., 2022;Wang et al., 2023).Moreover, the marginal and inadequate improvement with retrieval indicates the limited legal reasoning ability of existing general LLMs.There is a need for future efforts to enhance domain-specific reasoning abilities of pre-trained foundation models.

More Demonstrations Are Not Always Better
The impact of the number of similar demonstrations (n) is depicted in Figure 7.It is evident that GPT-4 and ChatGPT demonstrate proficiency in handling larger numbers of demonstrations, leading to enhanced performance, whereas Vicuna, Chat-GLM and BLOOZ experience varying degrees of performance degradation with increasing numbers.Notably, ChatGLM displays the least sensitivity to n.Furthermore, even ChatGPT's performance declines when n is increased from three to four.The performance improvement resulting from larger values of n can be attributed to the increased recall of true similar cases.Conversely, the decline in performance can be attributed to the noise introduced by more false similar cases.
Performance variations.The change of performance after including an additional demonstration are visualized using heat maps in Figure 8.For each model, the three heat maps stand for the variations from k-shot to (k+1)-shot, which are denoted below.For each heat map, the two rows indicate the inclusion of a new demonstration with true (T) or false (F) similar cases, while the columns indicate the combinations of existing demonstrations.Take the second heat map as an example.The cell in the column of (F, T) and the row of (T) displays the performance variation between 2-shot of (F, T) demonstrations and 3-shot of (F, T, T) demonstrations.Purple represents performance improvement, while green represents performance decline.
For ChatGPT and BLOOMZ, the second rows of their three heat maps are mainly in purple, indicating significant enhancements resulting from the inclusion of true similar cases.However, the first lines of BLOOMZ display a deeper green color than those of ChatGPT, suggesting that BLOOMZ experiences greater degree of performance declines caused by the inclusion of false similar cases.These findings indicate different sensitivity to false similar demonstrations.Powerful language models like GPT-4 and ChatGPT exhibit robustness to noise in false similar cases, allowing them to remain focused on relevant information in true similar cases.In contrast, weaker LLMs are susceptible to the influence of such noise.Overall, ChatGPT performs better when provided with more similar demonstrations, whereas BLOOMZ demonstrates the opposite, as shown in Figure 7.
The conclusion is that increased numbers of demonstrations have both positive and negative implications for expertise reasoning.However, LLMs could potentially gain from additional demonstrations in tasks that requires clear task illustration.

The Impact of Absent Ground Truth Labels
We manually incorporate ground-truth labels into label candidates in cases where they are absent, which may occur due to the limited recall capability of the IR system described in Section 2.2.The test samples are categorized into two groups, namely "Easy" and "Hard", based on the retrieval of their ground truth labels by the IR system.The original performance of the two groups and the performance of the "Hard" group with modified prompts to include ground truth labels, namely "Hard+GT", are displayed in Figure 9.The performance gaps between the "Easy" and "Hard+GT" groups suggest that challenging samples for IR systems are also difficult for LLMs.However, this gap is insignificant for the powerful GPT-4 who perceives them as equal challeng- ing.The improvement of "Hard+GT" compared to "Hard" is notable in GPT-4, ChatGPT and Chat-GLM but inconspicuous in Vicuna with inferior competency in the law.Considering the relatively small size of the "Hard" group (79/560), the absence of ground truth labels does not have a significant impact, especially for weaker LLMs.

Incorporation of Law Articles
We examine the effect of incorporating legal articles that explicitly define the charges into prompts.
For each charge retrieved by the IR system 11 , Chat-GPT is required to determine whether the defendant is guilty for the particular charge by answering with a yes or no.We find that 94.46% of the ground truth charges are accurately detected, while only 27.31% of the detected charges are correct.The high recall and low precision indicate a substantial difference between ChatGPT and legal experts in the ability to distinguish charges and make precise judgments.
11 we also include the ground truth charge

Discussion
We compare the LLMs with supervised baselines.We fine-tune BERT (Devlin et al., 2018) on the same training set and achieve a comparable accuracy of 68% to ChatGPT but lower than GPT-4.Since LLMs are not fine-tuned on the specific LJP task, this result highlights the remarkable superiority of LLMs in acquiring significant knowledge and leveraging transfer learning Raffel et al. (2020).However, we observe that BERT's performance improves to 89% when trained with the original training set (~10K).We find that certain knowledge is present in shadow features, which can be easily learned with supervision.These superficial features can result in biased supervised models.Fortunately, unsupervised pre-training objectives, make LLMs more robust and less vulnerable to this issue.This depicts a promising future for NLP applications in various domains.

Conclusion
To address the deficiency in evaluating the competency of LLMs in the field of law, we focused on the task of legal judgment prediction and devised four settings to facilitate a thorough evaluation that encompassed both open and multiple-choice questions and incorporated similar cases to aid in the decision-making process.
The evaluation results revealed different behaviors among the prominent LLMs, namely GPT-4 and ChatGPT, compared to their smaller counterparts.Both GPT-4 and ChatGPT exhibited remarkable proficiency in effectively leveraging domain knowledge in various formats.Among the smaller LLMs, ChatGLM displayed greater robustness, while BLOOMZ showcased superior zeroshot ability.
We presented an intriguing paradox wherein LLMs could become abundant in the presence of a powerful IR system.When improving IR systems to benefit LLMs, it is crucial for researchers to acknowledge this paradoxical scenario and prevent great disparity between LLMs and IR systems.

Limitations
One limitation of this paper is the use of the close-source GPT-4 and ChatGPT whose availability depends on the commercial company OpenAI.According to OpenAI, the ChatGPT and GPT-4 versions used in this paper, namely gpt-3.5-turbo-0301and gpt-4-0314, will be deprecated and not available after September 13th, 2023.
Another limitation pertains to the selection of LLMs.Due to the rapid emergence of new LLMs, we are not able to include all of them with the constraint of limited time.Instead of more models, we focus more on designing comprehensive evaluation settings and conducting insightful analyses to shed light on other domains.

Ethics Statement
The task of legal judgment prediction is used to evaluate LLM's competency in the law.The primary objective of this task is to assist judges and lawyers in comprehending lengthy legal documents by offering them a supplementary tool.It is important to note that this task does not seek to replace the roles of judges and lawyers, nor does it aim to determine the guilt or charges of defendants through machine learning algorithms.Additionally, there is research focused on interpreting LJP models, aiming to enhance the transparency of black-box models for improved utilization by legal practitioners.The paper utilizes a public and anonymized dataset to exclude the potential issue of personal information leakage.

Figure 2 :Figure 3 :
Figure2: The macro comparison between the four settings."+Label" refers to zero-shot multi-choice questions; "+Sim Case" refers to few-shot open questions and "+Label +Sim Case" refers to few-shot multi-choice questions.More than one points of a model in the last two settings refer to runs with different number of demonstrations.

Figure 4 :
Figure 4: Changes of performance and self-consistency after adding label candidates.The change of each model is illustrated by an arrow pointing from the open question setting to the multi-choice setting.

Figure 5 :
Figure 5: The effects of fixed (irrelevant) and similar cases as demonstrations.Divided by the baseline setting of zero-shot open questions, the left part refers to fixed demonstrations with increasing numbers of demonstrations, while the right part refers to similar demonstrations.The shadow area represents the range of standard deviation.

Figure 6 :
Figure 6: The performance of ChatGPT coordinated with a series of simulated IR systems with varying capabilities as measured by Precision@1.The vertical blue line represents the threshold of IR capability at which IR systems overtake ChatGPT.The performance of ChatGPT in the real setting (1-shot open questions) is indicated by the red plus sign.

Figure 7 :
Figure 7: Performance vs. the number of similar demonstrations of the five LLMs.

Figure 8 :Figure 9 :
Figure 8: Heat maps of performance variations resulting from the inclusion of an addition demonstration."T" corresponds to demonstrations with true similar cases, while "F" represents those with false similar cases.Each row represents the included new demonstration, while each column indicate the status of existing demonstrations.

Table 1 :
Statistics of the number of tokens across tokenizers.The last two columns present the ratios of test samples with token counts below the specified values.