Parameter Selection: Why We Should Pay More Attention to It

The importance of parameter selection in supervised learning is well known. However, due to the many parameter combinations, an incomplete or an insufficient procedure is often applied. This situation may cause misleading or confusing conclusions. In this opinion paper, through an intriguing example we point out that the seriousness goes beyond what is generally recognized. In the topic of multilabel classification for medical code prediction, one influential paper conducted a proper parameter selection on a set, but when moving to a subset of frequently occurring labels, the authors used the same parameters without a separate tuning. The set of frequent labels became a popular benchmark in subsequent studies, which kept pushing the state of the art. However, we discovered that most of the results in these studies cannot surpass the approach in the original paper if a parameter tuning had been conducted at the time. Thus it is unclear how much progress the subsequent developments have actually brought. The lesson clearly indicates that without enough attention on parameter selection, the research progress in our field can be uncertain or even illusive.


Introduction
The importance of parameter selection in supervised learning is well known. While parameter tuning has been a common practice in machine learning and natural language processing applications, the process remains challenging due to the huge number of parameter combinations. The recent trend of applying complicated neural networks makes the situation more acute. In many situations, an incomplete or an insufficient procedure for parameter selection is applied, so misleading or confusing conclusions sometimes occur. In this opinion paper, we present a very intriguing example showing that, without enough attention on parameter selection, the research progress in our field can be uncertain or even illusive.
In the topic of multi-label classification for medical code prediction, Mullenbach et al. (2018) is an early work applying deep learning. The evaluation was conducted on MIMIC-III and MIMIC-II (Johnson et al., 2016), which may be the most widely used open medical records. For MIMIC-III, besides using all 8,922 labels, they follow Shi et al. (2017) to check the 50 most frequently occurring labels. We refer to these two sets respectively as MIMIC-III-full and MIMIC-III-50.
For the data set MIMIC-III-full, Mullenbach et al. (2018) tuned parameters to find the model that achieves the best validation performance. However, when moving to check the set MIMIC-III-50, they applied the same parameters without a separate tuning. We will show that this decision had a profound effect. Many works directly copied values from Mullenbach et al. (2018) for comparison and presented superior results. However, as demonstrated in this paper, if parameters for MIMIC-III-50 had been separately tuned, the approach in Mullenbach et al. (2018) easily surpasses most subsequent developments. The results fully indicate that parameter selection is more important than what is generally recognized. This paper is organized as follows. In Section 2, we analyze past results. The main investigation is in Section 3, while Section 4 provides some discussion. Some implementation details are in the appendix. Code and supplementary materials can be found at http://www.csie.ntu.edu.tw/ cjlin/papers/parameter_selection. The task considered in Mullenbach et al. (2018) is to predict the associated ICD (International Classification of Diseases) codes of each medical document. Here an ICD code is referred to as a label. The neural network considered is where the convolutional operation was based on Kim (2014). A focus in Mullenbach et al. (2018) was on the use of attention, so they detailedly compared the two settings 1 CNN: (1) without attention, CAML: (1).
For the data set MIMIC-III-full, CAML, which includes an attention layer, was shown to be significantly better than CNN on all criteria; see Table 1a. However, for MIMIC-III-50, the subset of the 50 most frequent labels, the authors reported in Table  1b that CAML is not better than CNN.
The paper (Mullenbach et al., 2018) has been highly influential. By exactly using their training, validation, and test sets for experiments, many subsequent studies have proposed new and better approaches; see references listed in Section 1. Most of them copied the CNN and CAML results from (Mullenbach et al., 2018) as the baseline for comparison. Table 2 summarizes their superior results on MIMIC-III-50. 2 While using the same MIMIC-III-50 set, these subsequent studies differ from Mullenbach et al. (2018) in various ways. They proposed sophisticated networks and may incorporate additional information (e.g., label description, knowledge graph of words, etc.). Further, they may change settings not considered as parameters for tuning in Mullenbach et al. (2018). For example, Mullenbach et al. (2018) truncated each document to have at most 2,500 tokens, but Vu et al. (2020) used 4,000.

Investigation
We investigate the performance of the CNN and CAML approaches in Mullenbach et al. (2018) for the set MIMIC-III-50. Some implementation details are left in supplementary materials.

Parameter Selection in Mullenbach et al. (2018)
Mullenbach et al. (2018) conducted parameter tuning on a validation set of MIMIC-III-full. By considering parameter ranges shown in Table 3, they applied Bayesian optimization (Snoek et al., 2012) to choose parameters achieving the highest pre-cision@8 on the validation set; see the selected values in Table 3 and the definition of precision in Table 1. However, the following settings are fixed instead of being treated as parameters for tuning.
• Each document is truncated to have at most 2,500 tokens. Word embeddings are from the CBOW method (Mikolov et al., 2013) with the embedding size 100. • The stochastic gradient method Adam implemented in PyTorch is used with its default setting. However, the batch size is fixed to be 16 and the learning rate is considered as a parameter. Binary cross-entropy loss is considered. • The Adam method is terminated if the preci-sion@8 does not improve for 10 epochs. The model achieving the highest validation prei-sion@8 is used to predict the test set for obtaining results in Table 1a.
Interestingly, for the 50-label subset of MIMIC-III, Mullenbach et al. (2018) did not conduct a parameter-selection procedure. Instead, a decision was to use the same parameters selected for the maximal value across all words.  (Mullenbach et al., 2018) 0.576 0.625 0.620 Y CAML (Mullenbach et al., 2018) 0.532 0.614 0.609 Y New network architectures MVC-LDA (Sadoughi et al., 2018) 0.597 0.668 0.644 N multi-view convolutional layers DACNM (Cao et al., 2020b) 0.579 0.641 0.616 N dilated convolution BERT-Large (Chen, 2020) 0.531 0.605 -N BERT model MultiResCNN (Li and Yu, 2020) 0.606 0.670 0.641 Y multi-filter convolution and residual convolution DCAN (Ji et al., 2020) 0.615 0.671 0.642 Y dilated convolution, residual connections G-Coder without additional information (Teng et al., 2020) -0.670 0.637 N multiple convolutional layers LAAT (Vu et al., 2020) 0.666 0.715 0.675 Y LSTM before attention New network architectures + additional information (e.g., label description, label co-occurrence, label embeddings, knowledge graph, adversarial learning, etc.) LEAM (Wang et al., 2018) 0.540 0.619 0.612 Y label embeddings used MVC-RLDA (Sadoughi et al., 2018) 0.615 0.674 0.641 N label description used MSATT-KG (Xie et al., 2019) 0.638 0.684 0.644 N knowledge graph HyperCore (Cao et al., 2020a) 0.609 0.663 0.632 N label co-occurrence and hierarchy used G-Coder with additional information (Teng et al., 2020) -0.692 0.653 N knowledge graph, adversarial learning Results of our investigation in Section 3 are listed below for comparison (values averaged from  full-label set. Further they switch to present preci-sion@5 instead of precision@8 because on average each instance is now associated with fewer labels. The decision of not separately tuning parameters for MIMIC-III-50, as we will see, has a profound effect. In fact, because in Table 1b CAML is slightly worse than CNN, Mullenbach et al. (2018) have suspected that a parameter tuning may be needed. They stated that "we hypothesize that this 3 is because the relatively large value of k = 10 for CAML leads to a larger network that is more suited to larger datasets; tuning CAML's hyperparameters on this dataset would be expected to improve performance on all metrics." However, it seems no subsequent works tried to tune parameters of CNN or CAML on MIMIC-III-50.

Reproducing Results in Mullenbach et al.
To ensure the correctness of our implementation, first we reproduce the results in Mullenbach et al. (2018) by considering the following two programs. • The public code by Mullenbach et al. (2018) at github.com/jamesmullenbach/caml-mimic.
• Our implementation of CNN/CAML by following the description in Mullenbach et al. (2018). The code is part of our development  Table 3.
After some tweaks, on one GPU machine both programs give exactly the same results in the following

Parameter Selection for MIMIC-III-50
We apply the parameter-selection procedure in Mullenbach et al. (2018) for MIMIC-III-full to MIMIC-III-50; see details in Section 3.1. A difference is that, because training MIMIC-III-50 is faster than MIMIC-III-full, instead of using Bayesian optimization, we directly check a grid of parameters that are roughly within the ranges given in Table 3. Specifically, we consider Because Mullenbach et al. (2018) switched to report test precision@5 for MIMIC-III-50, for validation we also use precision@5.
To see the effect of random seeds, besides the one used in Mullenbach et al. (2018), we checked two other seeds 1,331 and 42, selected solely because they are the lucky numbers of an author. Table 4 shows CNN/CAML results after parameter selection and we have the following observations. • Both CNN and CAML achieve better results than those reported in Table 1b by Mullenbach et al. (2018). The improvement of CAML is so significant that it becomes better than CNN. • From details in supplementary materials, for some parameters (e.g., d c and q for CAML), the selected values are very different from those used by Mullenbach et al. (2018). Thus parameters selected for MIMIC-III-full are not transferable to MIMIC-III-50 and a separate tuning is essential. • Results are not sensitive to the random seeds. 5 • A comparison with Table 2 shows that most subsequent developments cannot surpass our CAML results. Some are even inferior to CNN, which is the baseline of all these studies. • We checked if subsequent developments conducted parameter selection. A summary is in the supplementary materials. Based on our results, how much progress past works have made is therefore unclear.

Discussion and Conclusions
The intention of this paper is to provide constructive critiques of past works rather than place blame on their authors. For the many parameter combinations, it is extremely difficult to check them. However, what our investigation showed is that if resources or time are available, more attention should be paid to the parameter selection. For Mullenbach et al. (2018), as they have done a comprehensive selection on a super-set MIMIC-III-full, the same procedure on the simpler MIMIC-III-50 is entirely feasible. The decision of not doing so leads to a weak baseline in the subsequent developments.
In conclusion, besides proposing new techniques such as sophisticated networks, more attention should be placed on the parameter selection. In the future this helps to ensure that strong baselines are utilized to check the progress. so that all documents in this batch have the same number of tokens. Thus results of the forward operation depend on the batch size. This setting causes issues in validation because a result independent of the batch size is needed. Further, for many applications one instance appears at a time in the prediction stage. Thus we follow Mullenbach et al. (2018) to use batch size = 1 in validation and prediction.
After the convolutional layer, Mullenbach et al. (2018) consider the tanh activation function. For both convolutional and linear layers, a bias term is included.
Before the training process, Mullenbach et al. (2018) sort the data according to their lengths. In the stochastic gradient procedure, data are not reshuffled. Therefore, instances considered in each batch are the same across epochs. While this setting is less used in other works, we follow suit to ensure the reproducibility of their results.
In the stochastic gradient procedure, we follow (Mullenbach et al., 2018) to set 200 as the maximal number of epochs. This setting is different from the default 100 epochs in the software LibMultiLabel employed for our experiments. In most situations, the program does not reach the maximal number of epochs. Instead, it terminates after the validation P@5 does not improve in 10 epochs. This criterion also follows from Mullenbach et al. (2018).
All models were trained on one NVIDIA Tesla P40 GPU compatible with the CUDA 10.2 platform and cuDNN 7.6. Note that results may slightly vary if experiments are run on different architectures. B A Note on Macro-F1 Mullenbach et al. (2018) report macro-F1 defined as F1 value of macro-precision and macro-recall, where macro-precision and macro-recall are respectively the mean of precision and recall over all classes. This definition is different from the macro-F1 used in most other works. Specifically, F1 values are obtained for each class first and their mean is considered as Macro-F1; see the discussion of the Macro-F1 definitions in Opitz and Burst (2021). Because works mentioned in Table 2 may not indicate if they use the same Macro-F1 formula as Mullenbach et al. (2018), readers should exercise caution in interpreting Macro-F1 results in Table 2.
However, based on Micro-F1 and P@5 results the main point of this paper still stands.