Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages

Scarcity of data and technological limitations for resource-poor languages in developing countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and corresponding entities. We perform extensive experiments on a set of models which in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F-score show that our approach is practically useful to identify intents and entities.


Introduction
Healthcare is a top priority for every country.People across the world ask millions of health-related queries, hoping to get a response from a domain expert (Gebbia et al., 2020).These queries mostly deal with medical history of patients, possible drug interactions, disease related concerns, treatment protocols and so on.Conversational agents for healthcare play a pivotal role by facilitating useful information dissemination (Li et al., 2020;Maniou and Veglis, 2020).In order to understand these * Authors contributed equally queries better, practical conversational systems for healthcare need to be developed.However, the primary obstacle in developing such technologies for low-resource languages is the lack of usable data (Mehta et al., 2020;Daniel et al., 2019;Liu, 2022).
India is a country with a diverse language speaking population suffering from abject poverty and low-economic status (Mohanty, 2010;Pande and Yazbeck, 2003).This linguistic diversity and complex socio-economic situation in India certainly poses significant challenges in developing automatic healthcare systems; and there is a lack of linguistic resources specific to the medical domain.For example, situations such as the patient and the doctor speaking in different languages, is not an uncommon situation in rural India.These individuals are unable to avail the existing systems and facilities which exist mainly in the English language.Recent efforts in developing automatic translation systems, even from extremely low resource languages such as 'Mundari' and 'Gondi' (Joshi et al., 2019), should ideally improve this situation, but there is no extensive study on that front.
In order to bridge this language barrier, massively Multilingual Transformer based Language Models (MMLM) (Devlin et al., 2019;Lample and Conneau, 2019) have made impressive advancements on a wide range of downstream applications.But the real-world implications of such advancements in the Indian healthcare system remain largely unexplored.In this paper, we aim to explore scarcity of the data and study the extent to which the existing language technologies can be leveraged to develop practically useful healthcare systems for the low-resource languages in developing countries.
With an aim to answer our research question, we create two different multilingual healthcare datasets, namely, IHQID-WebMD and IHQID-1mg.These datasets are created by crawling frequently asked questions from two healthcare websites, WebMD and 1mg.These datasets comprise frequently asked questions about drugs, diseases and treatment methods in seven different languages, namely English, Hindi, Bengali, Tamil, Telugu, Gujarati and Marathi.The queries are manually tagged with intent labels and entity tags by domainexperts and translated by native speakers of the corresponding languages.We also collect real world Indian hospital queries (annotated) in seven languages to check the empirical effectiveness of our approach.Fig. 1 shows an example of a health query belonging to 'treatment' intent class manually translated into three different languages.Then we evaluate the performance of state-of-the-art language models (LMs), for both English and multilingual setups on our datasets, to answer the questions regarding their deployability and practicality.Various experimental configurations (Section 4) have been tried on these datasets where we try to figure out the ways of using best technologies through extensive experimentation in two real-world scenarios.First, we assume to have access to only English training queries (less costly) and the test queries are multilingual in nature.We observe that translatetest setup on RoBERTa seems to be a reasonable choice of technology.Second, we assume to have access to manually written multilingual training and test queries in the target languages, which is indeed quite expensive in terms of data collection effort.However, back-translation of both train and test queries proves to be a reasonable choice if we have budget of collecting data in target languages.
In sum, our contributions are four folds: • We propose two intent and entity labelled Indian healthcare datasets (annotated by domainexperts) comprising of frequently asked ques-tions from users.
• Even though the large language models have proved their effectiveness in almost every NLU operation, we want to determine their effectiveness in determining the correct intent and slot filling operations for practical domain-specific healthcare scenarios in the Indian context.We intend to analyze how should we prioritize the research and resource building investments for the economically backward countries with a high percentage of multilingual population?This will make us aware about the best techniques of deploying the language models in various scenarios such as: availability of English training data vs multilingual training data.Keeping this in mind, all our experiments have been carried out using both monolingual and multilingual setups of these models.Through our experiments, we try to point out the best possible language models and techniques to develop practically useful NLU solutions (pipeline based approach for intent detection and corresponding entity extraction from the queries).
• Through extensive experiments on the datasets, we recommend the community to use back-translation of test queries to English in two real-life scenarios as a reasonable choice when we have access to English training data.However, the same strategy can be applied to both train and test queries if we have the budget of collecting data in target languages.
• Our findings imply that the back-translation of queries using an intermediate bridge language proves to be a useful strategy in the intent recognition experiments.

Related Work
We pivot our study of related works into the following buckets -generalised intent and entity detection, entity and intent detection in healthcare, health care in Indian languages and multi-lingual healthcare datasets.

Necessity of a new dataset
India is a country with a diverse language speaking population.There is an increasing population of users consuming Indian language content.This linguistic diversity certainly poses significant challenges in healthcare setup, particularly in the situation when healthcare providers and patients speak different languages (also termed as Language Discordance) (Shamsi et al., 2020).Therefore, individuals with limited English proficiency are left behind and suffer from worse health outcomes than those who speak English with high proficiency.The growing need for the deployment of multilingual conversational agents in hospital and healthcare facilities in India, especially highlighted by the plight of the healthcare workers during the COVID-19 pandemic, warrants a multilingual healthcare query intent dataset in Indian languages (Daniel et al., 2019).Therefore, we resort to create two novel Indian Healthcare Query Intent Datasets -(IHQID-WebMD and IHQID-1mg) and one realworld healthcare dataset from hospitals.

Source of the dataset
Due to the unavailability of open-source multilingual NLU datasets in healthcare setup, we sample frequently asked medical queries (FAQs) in English from two popular data sources: WebMD1 : It is an American website containing a large repository of healthcare data.The queries, taken from the WebMD health forum are asked by ordinary users regarding a wide range of problems.1mg 2 : 1mg is an Indian website, which is also a rich source for healthcare data, especially in the Indian context.The English queries are scraped from the FAQ section in drug and disease pages.
Although, both the above datasets are curated from online forums where users post healthcare concerns, in order to evaluate our approach in a practical Indian context, we develop a real world healthcare query dataset in Indian scenario.We collect real world healthcare queries (asked by patients) from the doctors in local hospitals.All queries are anonymous without identity or any details of the patients.For each language, we fetch 100 queries (some of which overlap) belonging to different categories.

Dataset Sampling
The FAQs sampled from these data sources are unlabeled.Hence, for the purpose of supervised classification, it is necessary to categorize each query into a specific intent and list of corresponding entities.We broadly categorize queries into four different intent types, namely, 'Disease', 'Drug', 'Treatment Plan' and 'Other'.Each query is assigned one of the four intent labels.Two English-speaking medical graduate doctors annotate the intents from the English queries to prepare the datasets.Annotators also mark entities, belong to three different medical entity categories present in the datasets -'Disease', 'Drug' and 'Treatment'.The queries with their intent labels are retained where both annotators agree, otherwise discarded.On an average, this filtering lead to an average rejection of around 10% samples of the dataset for all our setups and languages.Overall Inter-annotator agreement, Cohen κ is 0.89.

Parallel Data Generation
In order to generate parallel corpora of these frequently asked questions in English, we choose six Indian languages apart from English.Language Selection: The language set includes English: USA version (EN-US) termed as ('En'), Hindi ('Hi'), Bengali ('Bn'), Tamil ('Ta'), Telugu ('Te'), Gujarati ('Gu') and Marathi ('Mr').The choice of languages was driven by (a) the number of native speakers of those languages in India, b) number of annotators available for creating the dataset, (c) combined with typological diversity amongst the languages -we choose languages from various language families.For instance, Bengali, Hindi, Gujarati, Marathi belong to the Indo-Aryan family whereas Tamil and Telugu belong to the Dravidian group.Annotation and Quality Control: Since the gold standard annotated queries are not available online in Indian languages, the English queries of 1mg and WebMD have to be manually translated.After discussions with the doctors and different patients, we create the annotation guidelines.Annotators are told to formulate the queries on their own regional languages with the help of Bing Translator API3 .Annotators are also asked to annotate the entities and their types (in their respective native languages) for each query being corrected with the idea of what common people of corresponding native language generally ask healthcare queries to doctors.
Three annotators are selected per language after several discussions and conditions of fulfilling many criteria like annotators should have native proficiency in their language of annotation, domain knowledge expertise along with a good working proficiency in English.Initial labeling is done by two annotators and any annotation discrepancy is checked and resolved by the third annotator after discussing with others.While formulating the query on their own manually, the annotators are also asked to annotate the entities and their types (in their respective native languages) for each query being corrected.The above quality control measures ensure that the translated data is of high quality, resembling real world data in the target language.In the case of a word such as a proper noun like 'Paracetamol' (drug), which does not have a translation in the target vocabulary, the word is asked to be simply transliterated in the target language.
In order to prepare the real world hospital query dataset in Indian healthcare contexts, we collect healthcare queries from the doctors of local hospitals.It also consists of six different Indic languages along with English.There are a hundred queries for each of the language.These queries also have similar intent classes and entity categories, which are labelled by the doctors.During collection of queries, we fix the minimum number of samples for each intent classes across all languages.
In order to maintain the quality of the Indian language annotations, the annotators are directed to use the native language words and grammar, keeping the original interpretation of the query.All query logs, annotations and changes are recorded in order to conduct future verification and analysis.On completion of the translation process, the annotators are asked to exchange their work and check the quality of translation for fluency and semantic stability.Inaccuracies are noted, and the respective queries are rectified in the dataset.
At the end, we finally have three multilingual intent and entity recognition labelled datasets -IHQID-WebMD, IHQID-1mg and a real world hospital query test dataset in seven different languages, the dataset distributions of which are provided in Table 1.The first two datasets (IHQID-WebMD and IHQID-1mg) help to build the models and real world hospital dataset is used to evaluate our approaches in real world contexts.Table 1 also shows the statistical details across different intent classes ('Disease', 'Drug', 'Treatment' and 'Other') and corresponding entities (of 'Disease', 'Drug' and 'Treatment' categories) along with the total counts and train-test divisions.It also shows the distribution of hospital collected practical healthcare queries across different languages (Right part of the table).

Strategies of Evaluation
In this section, we illustrate the strategies of evaluating the state-of-the-art LMs on our dataset.Our evaluation of these models for Healthcare is scoped down to two fundamental NLU tasks: a) Intent Recognition (Section 5.1) b) Entity Extraction (Section 5.2) Evaluation Setup Description: Our evaluation of the models has been conducted while keeping in mind about the availability of human-translated monolingual and multilingual training data in two possible real-life scenarios: 1) Scenario A: In this setup, we assume to have access to only English training data (less costly) and in 2) Scenario B: we assume to have access to manually written training queries in all the target languages (very expensive).During inference/testing, we expect all the queries are in the corresponding target languages.Scenario A: Setup 1) Backtranslated Test (S1): [Translate-Test] Here we develop our system by training the models on the English queries, and evaluate the intent detection and entity extraction systems in different languages by automatically backtranslating the test queries into English (e.g.similar to (Gupta et al., 2021)).Setup 2) Zero-Shot Cross-Lingual Test (S2): Cross lingual transfer learning is a useful methodology used for tasks involving scarce data (Zhou et al., 2016;Karamanolakis et al., 2020).In this setup, the models make use of zeroshot based cross-lingual capabilities from training on the English data (scraped from WebMD and 1mg) and use it for inference on test queries in Indic languages.Setup 3) Bridge Language Backtranslation (S3): Here a relatively low-resource language is first translated to an intermediate language and then finally to English.The motivation behind this setup lies in the fact that even though these Indic languages belong to different scripts, there are linguistic and morphological similarities among them which may improve the translation to English if they are used as intermediate languages.
In this paper, we have considered 'Hindi' as the bridge language.This notion of such "bridge" languages has been explored previously in the context of Machine Translation (Paul et al., 2013) and zero/few-shot transfer in MMLMs (Lauscher et al., 2020).Scenario B: Setup 4) Train and Test on Indic Data (S4): In this setup, we use the training dataset in indic lan-guages to train our NLU models in different target languages.Here, we use the IHQID-WebMD and IHQID-1mg Indic data (non-English) to evaluate the NLU detection performances of the developed models.Jennifer Bot (Li et al., 2020) use a similar setup to extend their English bot to Spanish.Setup 5) Full Backtranslation (S5): In this setup, both train and test data are backtranslated to English.This is useful for the countries with poor technical setups for low-resource languages, since an automated approach can translate low-resource medical queries to resource-rich language and test.
In all back translation experiments, we use Bing Translation Api4 .

Experiments and Results
Experimental Setup: Our experiments are conducted on two Tesla P100 GPUs with 16 GB RAM, 6 Gbps clock cycle and GDDR5 memory.All methods of entity extraction and intent detection took less than 30 GPU minutes for training.We perform a hyperparameter search and report the results of the settings which achieve the best results, and then fixed the same for all the models.The batch size is kept at 16, number of epochs is 10, optimization algorithm used is AdamW and the learning rate is 1e-5 with cross-entropy as the loss function.

Intent Detection
Task Description: It can be defined as a multiclass classification task of correctly assigning a medical query with an intent label from a fixed set of intents (drug, disease, treatment and other).Classification Models: Since in Setups 1, 3 and 5, we take both the training and test set in English, we use state-of-the-art LMs pre-trained on English corpora (as shown in (i)) for our classification experiments.Whereas in Setup 2 and 4, we make use of multilingual LMs (as shown in (ii)) which have been widely used for various benchmark tasks in Indian languages.Following are the baselines: (i) Pre-trained English Models: For setups 1, 3 and 5, we fine-tune the last layer of RoBERTa (Liu et al., 2019) and Bio_ClinicalBERT (Alsentzer et al., 2019) models on the English queries for intent detection by adding a classification layer that takes [CLS] token as input.The latter is a state-of-the-art domain-specific transformer based language model pre-trained on MIMIC III notes5 , which is a collection of electronic health records and discharge notes.
(ii) Pre-trained Multilingual Models: Two pretrained mulilingual LMs are used, mBERT (bertbase-multilingual-uncased) (Pires et al., 2019) and XLM-Roberta (xlm-roberta-base) (Conneau et al., 2020), both support all Indic languages in the datasets along with English.In Setup 2, we perform zero-shot classification using these models.The zero shot setting involves fine-tuning the model using English data, and testing on Indic languages.Whereas in Setup 4, we first train these models using the entire train sets in the target languages, separately for WebMD and 1mg, and check the performance on the test sets.

Entity Recognition
Task Description: This task is analogous to performing a Named Entity Recognition (NER) for three categories, namely, drugs, diseases and treatments on the query texts.We follow the standard BIO-tagging system while annotating the entities word-by-word.The train and test files for each configuration and language respectively are constructed from our WebMD and 1mg datasets.Extraction Frameworks: For entity recognition, we follow the same strategies of evaluating the predictive performance of the LMs as described in Section 4. The same models (as described in section 5.1) are also used for entity recognition experiments.

Evaluation
For all our experiments on intent detection and entity recognition, we calculate the Precision, Recall and report the F1-score.

Results and Analysis
Intent Detection: Table 2 shows the results of intent detection of five experimental strategies on the IHQID-WebMD and IHQID-1mg datasets in terms of Macro F1-score (in percentage).Finding 1: We observe that in general, Backtranslated Test (Setup 1) performs better than Zero-Shot Cross-Lingual Test (Setup 2).Moreover, it is interesting to notice that even though the performance of these models for most of the target languages in Setup 1 are comparable with that of English in WebMD (an average of 3% drop for all the languages compared to English), there is a significant drop (average of 6%) in the F1 scores for the Setup 1 results in 1mg Dataset.This holds true for both RoBERTa and BcBERT experiments.This denotes that the state-of-the-art English models, which are performing decently after backtranslation of the medical queries in English, pre-trained on both generic and medical domain, are lagging behind when the vocabularies of the medical entities are in the Indian context.This definitely calls for an immediate attention to developing LMs pre-trained on India-specific medical datasets.
Finding 2: Another interesting observation was that the use of Bridge Language Backtranslation (Setup 3) in Table 2, helps to boost performance of most of the languages in the case of 1mg dataset in comparison to Setup 1.The observation does not hold true for intent recognition in WebMD dataset.This might be attributed to the fact that using a bridge Indian language as an intermediate helps preserve the domain-specific sense of the queries instead of directly converting the queries from the target language to English.This seems like a reasonable alternative to develop useful intent recognition models for healthcare in Indian languages.
Finding 3: In comparison with zero-shot crosslingual transfer (Setup 2), both mBERT and XLM-R models are outperformed by few-shot experiments (Setup 4) for intent detection.This observation holds true for both WebMD and 1mg datasets.However, Setup 4 is much more cost-intensive than the Setup 2.
Finding 4: We report the average (Avg) F1-score across all languages.The best performing model is RoBERTa (Setup 1 for English and Setup 5 for non-English) for both WebMD (74.94%) and 1mg (70.33%).RoBERTa is used for further evaluations.
Entity Extraction: Table 3 displays the results of entity recognition task under five different strategies on IHQID-WebMD and IHQID-1mg datasets.
Finding 1: In the Backtranslation test performed in Setup 1, we observe that for WebMD dataset, the difference in the performance of the models (Performance on English is 0.33% more average F1 Score for RoBERTa and 3.58% more than average F1 for bcBERT) is far less significant than the drop observed for 1 mg (Performance on English is 9.66% more average F1 Score for RoBERTa and 10.49% more than average for bcBERT).This implies that loss of information is quite high for the entities in Indian context during backtranslation.
Finding 2: Unlike our findings on Setup 3 in intent recognition, we observe that backtranslation using a bridge language seems to induce more loss of   information on the entities compared to Setup 1.This observations holds true for both the models across two datasets.
Finding 3: Similar to intent recognition, we observe that completely backtranslating both training and test data to English performs the best among S1, S3 and S5.This holds true for both the datasets and both the models.However, this operation is indeed expensive in terms of data curation cost, since it requires original data in the target languages for both training and testing.
Finding 4: The abysmal performances of the multilingual models as shown in Table 3, for both S2 and S4 indicate that these approaches are not so useful in our case.
Finding 5: We report the average (Avg) F1-score across all languages.BioClinicalBERT performs the best (Setup 1 for English and Setup 5 for non-English (Avg)) for both WebMD (63.14%) and 1mg (68.69%).It is used for further evaluations.

Ablation Study
Experiments with Varying Training Size: We experiment with varying training sizes on both intent detection and entity extraction tasks using the best performing models, by taking 10%, 30%, 50%, 70% and 100%) of the training set.We then show the F1-scores (Y-axis) for all the languages with different training sizes (X-axis) in Fig. 2.  Real World Hospital Data Evaluation: We use the real world healthcare query dataset (100 queries per language) to test the usability of our models in practical Indian hospital scenarios.We run the best performing models trained on WebMD and 1mg For each language, we portray the results of the best model obtained for the corresponding dataset.
data for intent detection (RoBERTa in Setup 1 for English and Setup 5 for Non-English) and entity extraction (BioCLinicalBERT in Setup 1 for English and Setup 5 for non-English) and report the average of two models (trained on IHQID-WebMD and IHQID-1mg) for each language.Fig. 3 shows the average F1-score for each language, which is consistent with the earlier results shown in Table 2 and 3.This shows that the best performing proposed setup performs satisfactorily on real world data in Indic languages.

Demonstration
To be able to make the proposed methods accessible and usable by the community, we create an online interface, which could be found in our GitHub repository 6 .With the help of this website, one can post health query in the allowed language and obtain the predictions using our best models.

Discussion and Error Analysis
We categorize the issues in mis-classification and identify two broad themes of the reasons.The primary reason is model prediction error.Figure 4 shows the model prediction errors for various intents in different languages.For an example, 'How common is syphilis' is of 'disease' intent category but model wrongly predicts it as 'other' category.
Another reason is the misclassification due to incorrect translation of the medical entities such as the disease 'uticartia' has been transformed into 'ambat' during backtranslation as shown in Figure 5 which is not detected as an entity.So, the backtranslation error leads to intent mis-classification and entity extraction error.We speculate such random absurd behaviour due to the context of the query and languages are semantically different.Secondly, there are also certain issues in fluency and grammatical meaning after backtranslation.For instance,

Conclusion
We focus on developing novel Indian HealthCare Query Datasets and propose frameworks to detect intents and extract entities from queries in different Indian languages.Through extensive experiments on our proposed datasets, we recommend the community to use backtranslation of test queries to English in two real-life scenarios as a reasonable choice when we have access to English training data.However, the same strategy can be applied to both train and test queries if we have the budget of collecting data in target languages.Backtranslation of queries using an intermediate bridge language also proves to be a useful strategy in some cases.

Limitations
Our dataset needs to be scaled up in terms of size and intent labels which we aim to do as a part of future work.Another constraint is that we do not consider cases where queries are multi-labelled (e.g.-drug and disease both).We shall explore in future.

Ethical Concerns
We propose to release the dataset which neither reveals any personal sensitive information of the patients nor any toxic statement.Besides, we have paid enough token money (exact remuneration will be revealed once accepted to the conference) to the domain-expert annotators who have helped us in manually tagging the medical queries.

Figure 1 :
Figure 1: Example of a query of 'treatment' intent category for different languages along with associated entities.

Figure 2 :
Figure 2: Intent Detection and Entity Extraction F1-score (Y-axis) for Different Percentage of Training Data (X-axis) for WebMD and 1mg Fig. 2a  and 2b  show that the performance of the intent detection models do not vary too much with increasing training sample data.However, Fig.2cand 2d clearly show that entity extraction F1-scores increase significantly with the increase of training data.Thus, we can conclude that the intent detection model does not require a large amount of data to generalise, as opposed to the requirements of the entity extraction model.Category wise intent detection and entity extraction for the best model: We evaluate the F1-scores for different intent classes for the RoBERTa Model (Setup 1 for English and Setup 5 for non-English) trained on WebMD and 1mg (See Section 4 for setup descriptions).Similarly, with the help of Bio-ClinicalBERT (Setup 1 for English and Setup 5 for Non-English), we find the individual entity class wise F1-scores.The results in Table4show that the model is able to detect 'disease', 'drug' and 'treatment' intent classes with high F1-score but the performance on the 'Other' class is poor, thus bringing the macro averaged F1 score down considerably.This may be due to the fact that the system fails to detect open ended query types, present in the 'Other' class.This is supported by the intent class wise entity distribution, which shows an overwhelming dominance of 'drug', 'disease' and 'treatment' entities in their corresponding intent categories ('drug', 'disease' and 'treatment plan' intents, respectively), whereas the 'other' intent class, of which there are very few instances comparatively anyway, has no such dominant entity class associated with it.In the entity extraction task, the best performing model is able to extract all three entity categories with a similar F1-score performance.

Figure 3 :
Figure 3: Macro Average F1 Score for Intent detection and Entity Extraction across all different languages in Real World Hospital Data

Table 1 :
Distributions of different types of intent and entity labels in WebMD, 1mg datasets (IHQID) and Real World Hospital Query Data.(-+ -) represents (train + test) division.# denotes the count.

Table 2 :
Macro-F1 scores for intent classification on the WebMD (WMD) and 1mg datasets for five Setups (three different setups for Train on English (Scenario A) and two setups of Train on Indic Data (Scenario B)).bcBert indicates BioClinicalBERT, mBERT indicates Multilingual BERT.Underline denotes the best across five settings.

Table 3 :
Macro-F1 scores for entity extraction on the WebMD (WDM) and 1mg datasets for five Setups (three different setups for Train on English (Scenario A) and two setups of Train on Indic Data (Scenario B)).bcBert indicates BioClinicalBERT, mBERT indicates Multilingual BERT.Underline denotes the best across five settings.

Table 4 :
Macro-F1 scores for intent identification and entity extraction on the WebMD (WMD) and 1mg datasets.