CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering

Large language models (LLMs) have demonstrated remarkable performance by following natural language instructions without fine-tuning them on domain-specific tasks and data. However, leveraging LLMs for domain-specific question answering suffers from severe limitations. The generated answer tends to hallucinate due to the training data collection time (when using off-the-shelf), complex user utterance and wrong retrieval (in retrieval-augmented generation). Furthermore, due to the lack of awareness about the domain and expected output, such LLMs may generate unexpected and unsafe answers that are not tailored to the target domain. In this paper, we propose CarExpert, an in-car retrieval-augmented conversational question-answering system leveraging LLMs for different tasks. Specifically, CarExpert employs LLMs to control the input, provide domain-specific documents to the extractive and generative answering components, and controls the output to ensure safe and domain-specific answers. A comprehensive empirical evaluation exhibits that CarExpert outperforms state-of-the-art LLMs in generating natural, safe and car-specific answers.


Introduction
Conversational question answering (CQA) has recently gained increased attention due to the advancements of Transformer-based (Vaswani et al., 2017) large language models (LLMs).These LLMs (Devlin et al., 2019;Brown et al., 2020;Ope-nAI, 2023;Touvron et al., 2023b) are nowadays widely adopted for performing question answering in both open-domain and domain-specific settings (Robinson and Wingate, 2023).As the source of additional knowledge conversational question answering systems are typically provided with text paragraphs (Kim et al., 2021;Rony et al., 2022c), and knowledge graphs (Rony et al., 2022b;Chaudhuri et al., 2021) for generating informative dialogues in a domain-specific setting, where such What is the High Beam Assistant?
How do I activate it?
The High-beam Assistant ensures that you do not often have to turn the high-beam headlight on and off yourself in order to avoid dazzling oncoming traffic.
The system is activated automatically when the car is started.
Extracted from document Generated from document systems typically engage in a multi-turn interaction with a user in form of speech or text.Figure 1 demonstrates a conversation between a user and a conversational question answering system (CarExpert) in a BMW car.
Leveraging LLMs end-to-end has several drawbacks (Liang et al., 2022;Srivastava et al., 2023;OpenAI, 2023).Firstly, the generated answer is often hallucinated as the knowledge from the pre-trained weights of LLMs is limited to their training data collection time (Ji et al., 2022).Furthermore, retrieval-augmented answer generation suffers from hallucination as well, due to wrong retrieval, complexity of the user utterance and retrieved document.Secondly, LLMs can be exploited using adversarial instructions that may lead the system to ingest malicious input and generate unsafe output (Perez and Ribeiro, 2022;Greshake et al., 2023).In the context of a car, the aforementioned downsides imply that the answer could lead to unsafe handling of the vehicle due to a lack of instructions, preservation, warning messages, or appropriate information; or by providing erroneous or confusing information.
Addressing the aforementioned issues, in this paper we propose CarExpert, an in-car conversational question-answering system, powered by LLMs.CarExpert is a modular, language model agnostic, easy to extend and controllable conversational question-answering system developed to work on  the text level.On a high-level CarExpert performs question answering in two steps.First, given a user utterance it retrieves domain-specific relevant documents wherein the potential answer may exist.Second, for predicting the answer, CarExpert employs both extractive and generative answering mechanisms.Specifically, there are four sub-tasks involved in the overall process: 1) orchestration, 2) semantic search, 3) answer generation, and 4) answer moderation.Furthermore, CarExpert tackles unsafe scenarios by employing control mechanisms in three ways: i) in the Orchestrator using an input filter, ii) by defining prompts for controlling LLMbased answer generation, and iii) by an output filter in the Answer Moderator.Furthermore, CarExpert employs a heuristic during answer moderation to select answers from multiple models (extractive and generative) and provide the user with the potential best answer as the output.To facilitate voicebased user interaction in the car for real-life use, we encapsulate CarExpert with text-to-speech and speech-to-text services.Figure 2 depicts a highlevel overview of the CarExpert architecture.Such modular design of CarExpert allows flexible integration to various types of interfaces such as web browser and mobile app (i.e., BMW App).
To assess the performance of CarExpert we conduct exhaustive evaluations (both qualitative and quantitative).An empirical evaluation exhibits that CarExpert outperforms off-the-shelf state-of-theart LLMs in in car question answering.The contribution of this paper can be summarized as follows: • We introduce CarExpert, a modular, language model agnostic, safe and controllable in-car conversational question answering system.
• A novel answer moderation heuristic for selecting a potential best answer from multiple possible outputs.
• A comprehensive empirical evaluation, demonstrating the effectiveness of CarExpert over the state-of-the-art LLMs for in-car conversational question answering.

Approach
CarExpert aims to generate domain-specific document-grounded answers.The task is divided into four sub-tasks: 1) Orchestration, 2) Semantic Search, 3) Answer Generation, and 4) Answer Moderation.We describe the sub-tasks below.

Orchestration
A prompt-based Orchestrator component is incorporated in CarExpert to tackle unsafe content and deal with multi-turn scenarios.Depending on the user utterance, CarExpert also can e.g.respond by saying that it does not have enough information or ask a clarification question, since the system is designed to only answer questions about the car.Thus the Orchestrator controls the input in CarExpert.
The prompt used for this purpose is as follows: Task: Given a question and paragraphs: 1.For unsafe or harmful questions, politely decline to answer as they are out of context.Stop any further generation.
2. Flag any unsafe or harmful questions by politely stating that you cannot provide an answer.Stop any further generation.
3. If the question is safe and relevant, suggest a clarification question that demonstrates comprehension of the concept and incorporates information from the provided paragraphs.
Start the question with "Do you mean".

4.
If unsure about suggesting a specific clarification question, politely request more information to provide an accurate response.Stop any further generation.
Question: {user utterance} Paragraphs: {para-graphs} Answer: where, user utterance represent the current turn's user utterance and paragraphs the top-3 retrieved documents obtained from the semantic search (discussed in Section §2.2).

Semantic Search
For efficient and fast semantic search of the relevant documents, CarExpert pre-processes data and parses clean contents from various curated sources (owners' manuals, self-service FAQs, car configurator feature descriptions and press club publications) utilizing a data pipeline (more details in the Appendix A.1.1).The parsed data is utilized in two different ways.Firstly, we put humans in the loop to obtain high quality and domain expert annotated question-answer pairs for training an answer extraction model (discussed in Section 2.3.1).Secondly, the vector representation of the text is indexed only once as a pre-processing step to facilitate fast Semantic Search over a large set of text during the inference (see Figure 3).In the next step LLMs are fed with top-3 retrieved document for the answer generation.We use the terms 'document' and 'paragraph' interchangeably throughout this paper.

Answer Generation
CarExpert employs both extractive and generative models to get answers for the same user utterance.The answer generation step is controlled by instructing the LLM using prompts and next by an Answer Moderator component.It selects the best answer based on an extraction ratio-based heuristic (discussed in Section 2.4).We describe the answer generation methods in the following sections.

LLM-based Answer generation
In this step, CarExpert takes off-the-shelf GPT-3.5-turbo and instructs it in a few-shot manner for answer generation based on the current user utterance, retrieved documents and the dialogue history.The probability distribution of generating a response can be formally defined as: where S t is the generated answer, P is the prompt, H is the dialogue history, Q is the user utterance in the current turn, θ is model parameters, and n is the length of the response.Here, "; " indicates a concatenation operation between two texts.Depending on the type of questions that the user may ask, the generation task is split into two major categories: 1) Abstractive Summarization and 2) Informal Talk.We design separate prompt templates for both the categories to handle various types of user utterances.We provide a brief description of both the categories below.
i. Abstractive Summarization: We design a prompt template to handle information seeking user utterances that can be answered from the semantic search results where the template aims to generate the answer in a natural sentence.The abstractive summarization template is as follows: Task: Answer questions about the car given the following context and dialog.Answer always helpful.Answer in complete sentences.Don't use more than two sentences.Extract the answer always from the context as literally as possible.Dialogue 1:{example dialogue 1} .Dialogue 6: Context: {top paragraphs , dialogue history} User:{user utterance} System: where example dialogue 1 is a variable that represents a complete multi-turn conversation.Each dialogue may contain 1 to 5 user-system utterance pairs.The variables top paragraphs and dialogue history represent top-3 paragraphs from the semantic search results and the complete dialogue history such as adjacent user-system pairs, respectively.Furthermore, user utterance indicates the current user utterance that the system needs to answer.
ii. Informal Talk: A conversational AI system not only deals with information-seeking utterances but also needs to tackle follow-up questions, clarifications, commands, etc. which makes the conversation engaging and natural.To tackle various forms of user utterances we design an Informal Talk template as follows: Task: Answer the user feedback in a friendly and positive way.When asked about factual knowledge or about your opinion, just say that you can't answer these questions.Please never answer a question with a factual statement.If a question is about something else than the car, you may append a 'Please ask me something about the car'.Dialogue 1:{example dialogue 1} .Dialogue 20: User:{user utterance} System: In the Informal Talk template we provide 20 example dialogues covering various forms of user utterance.This way both abstract summarization and informal talk templates leverages pre-trained 1. Reduce your speed and come to a stop carefully.Avoid sudeen braking and steering maneuvers.2. Follow the instructions for what to do in case of a flat tire.
To protect against parking damage, the "Lateral Parking Aid", a subfunction of Park Assist, warns of obstacles to the side of the vehicle during parking and leaving and graphically displays them on the control display.The obstacles are already detected while driving past a parking space with Park Assist activated and stored for the parking manoeuvre.Please note: Park Assist in your car forgets lateral obstacles if PDC is manually deactivated and the vehicle is standing still for 13 seconds.
NOTICE Objects in unpaved areas, for instance stones or branches, can damage the vehicle.There is a risk of damage to property.Do not drive on unpaved terrain.

KNN-based Approximate Search
Question: How can I avoid parking damage?large language model in a few-shot manner to generate natural and engaging dialogues.The prompt templates are stored in the Prompt Template Store.

Answer extraction
In CarExpert, we investigate two different answer extraction methods: i. Machine Reading Comprehension Reader: Given a user utterance and a document the task of a MRC Reader model is to predict a continuous text span from the provided document that answers the user question.We fined-tune an Albert (Lan et al., 2020) model for the answer extraction task.
ii. LLM-based Reader: Engineering prompts is a popular way to instruct LLMs how to leverage their knowledge to solve downstream NLP tasks.In this approach, we leverage the pre-trained knowledge of LLMs, contained in their parameters to perform the same answer extraction task as the MRC Reader.However, in this case CarExpert does not need training data to perform the answer extraction.Specifically, in CarExpert we design a prompt that instructs the LLMs to perform answer extraction as literally as possible using both question and top-3 paragraphs from the semantic search results.The prompt template is as follows: Task: Given the following question and paragraphs, extract exactly one continuous answer span from only one of the paragraphs.
Question: {user utterance} Paragraphs: {para-graphs} Answer: During the inference, the variables user utterance and paragraphs are replaced with the actual user utterance and top three paragraphs retrieved from the semantic search.

Answer Moderation
An Answer Moderator component selects the best answer given the user utterance and potential answers (extractive and generative).We investigate the following two moderation techniques for answer moderation.
i. Cosine Similarity: This approach measures the semantic similarity between a user utterance and system response.The answer with a higher similarity score is selected as the system response.Formally, in this approach the answer selection can be defined as: , where ⃗ a ex , ⃗ a g , and ⃗ Q are the embedding representation of extracted answer, generated answer and user utterance.
ii. Extraction Score: This is a weighted Levenshtein distance-based heuristic that measures how syntactically close the system response is to the retrieved paragraphs.Formally, the Extraction Score (ES) can be defined as: where x is the generated answer, y i is the ith paragraph and n is the number of paragraphs.The cost of edit operation is computed by dist(•).This moderation technique allows CarExpert to generate a controlled and document grounded answer by (i) grounding the system response to the retrieved documents, and (ii) filtering out incorrect and hallucinated responses.More details on the edit operations can be found in Appendix A.5.

Experimental Setup
Data: The reader and retriever models in CarExpert are fine-tuned and evaluated on car-specific data from various sources (owners' manuals, selfservice FAQs, car configurator feature descriptions and press club publications).
Metrics To measure the performance of the Retriever we use Mean Reciprocal Rank (MRR@3).
For evaluating extractive Reader, we utilize tokenlevel metrics, such as F1-Score and Exact Match (EM).Furthermore, we employ Cosine Similarity and METEOR (Banerjee and Lavie, 2005) to capture the similarity of generated answer aginst the reference response.Further details of the datasets, hyper-parameter settings, and metrics can be found in the Appendix, in A.1, A.3 and A.4 respectively.

Experiments and Results
We conduct both qualitative and quantitative experiments to assess different parts contributing to the overall performance of CarExpert.

Quantitative Analysis
Table 2 and Table 3 demonstrate that the fine-tuned DPR and fine-tuned Reader perform better than the baseline models in the corresponding tasks.
The performance improvement may attributed to their inherent capability of effectively learning and capturing the distribution and characteristics of the training data.In Table 2, we notice that a fine-tuned DPR outperforms a fine-tuned Sentencetransformer.The fine-tuned DPR model preforms in MRR@1 and hence we integrate DPR as the retriever used for semantic search in CarExpert.
From Table 4 we observe that GPT-3.5-turboperforms better than the Luminous-extended model since the former is a larger model and hence offers better representations and generalization.heuristic for answer moderation.The Extraction Score technique is described in Appendix A.5.

Qualitative Analysis
Table 1 demonstrates a qualitative comparison between CarExpert (with document) and GPT-3.5-turbo(with and without document) of answer generation.When provided with the document we instruct both the models to answer from the provided documents.In the first case, without any documents provided GPT-3.5-turbo could not answer the question, where with the document it generated a very long answer.Furthermore, when answering it is referring to a specific paragraph such as "..The first paragraph mentions...", which is irrelevant to the user.Car-Expert in this case correctly generated the expected answer.In the second case, we asked the system about how to mount a child seat.Off-the-shelf GPT-3.5-turbogenerated generic answer from its pre-trained knowledge, which includes unnecessary detail such as "..Read the instruction...", and is not tailored to the target car brand.On the contrary, although GPT-3.5-turbogenerated a better answer, it includes additional irrelevant and lengthy details which are not suitable for in-car CQA.Still adding irrelevant information (right column, 3nd row: item 1., 6. & 7.).Overall, in both the cases, CarExpert exhibits precise answer prediction then off-the-shelf GPT-3.5-turbo with and without documents.Although, CarExpert leverages GPT-3.5-turbo for the answer generation, carefully designed prompts in CarExpert helped the system to generate precise answers.Precise answers are suitable for real-time use in the car, where the user may find an unnecessary detailed answer (which GPT-3.5-turbogenerated) very exhausting.More lemon-and cherry-picked examples can be found in Appendix D. To install the child restraint system in the vehicle, proceed as follows: 1. Mount the child restraint system in accordance with the instructions from the child seat manufacturer.2. Make sure that the child restraint system anchors engage correctly on both sides in the lower seat mounts.3. Pull out the seat belt strap completely.4. Secure the child restraint system with the seat belt. 5. Allow the seat belt strap to be pulled in, then pull it tight against the child restraint system.
Mounting child seats properly is crucial for the safety of your child while traveling in a vehicle.The specific instructions for mounting a child seat may vary depending on the make and model of the seat and the vehicle.However, here is a general guide on how to mount child seats: 1. Read the instructions: Carefully read the instruction manual provided with your child seat.It will provide specific details about the installation process and any additional requirements.

Choose the appropriate seat: [...]
To mount child seats, follow these steps: 1. Read and follow the instructions provided by the child seat manufacturer.2. Ensure that the child restraint system anchors engage correctly on both sides in the lower seat mounts.3. Pull out the seat belt strap compl.[...] 6. Refer strictly to the installation instructions supplied with the child restraint system for specific guidelines.7.Each rear seat position is equipped with a head restraint.[...].
Table 1: Dialog examples with answers of CarExpert compared to GPT-3.5-turbo stand alone and GPT-3.5-turbo with relevant paragraphs provided through prompting.Longer texts are truncated and replaced by [...] for the demonstration purpose.

Reader F1 EM
Pre-trained Albert-large 0.31 0.01 Fine-tuned Albert-large 0.60 0.21 GPT-3.5-turbo0.51 0.14 Luminous-extended 0.36 0.01  (Touvron et al., 2023a) and GPT-4 (OpenAI, 2023) are capable of performing complex downstream tasks without being trained for that tasks.A different line of recent research focuses on controlling the behaviour of LLMs such as NeMo-Guardrails3 .Inspired by humans capabilities of following instructions in natural language, recent research works fine-tuned LLMs so that it can understand instructions in a zero-shot or few-shot settings and perform a given task following the language instruction (Wei et al., 2022;Taori et al., 2023;Brown et al., 2020;Rony et al., 2022a;Schick and Schütze, 2021;Prasad et al., 2023).In CarExpert, prompt-guided LLMs are employed to control various tasks of the answer generation process.

Conclusion
We have introduced CarExpert, a new and controlled in-car conversational question-answering system powered by LLMs.Specifically, CarExpert employed semantic search to restrict the system generated answer within the car domain and incorporated LLMs to predict natural, controlled and safe answers.Furthermore, to tackle hallucinated answers, CarExpert proposed an Extraction Scorebased Answer Moderator.We anticipate that the proposed approach can not only be applicable for the in-car question answering but also be easily extendable and adapted for other domain-specific settings.In future, we plan to integrate multi-task models to handle multiple task using a single LLM and reduce error propagation in the system.

Limitations
While our modular framework offers considerable flexibility in employing diverse models and aligning them with specific tasks and objectives, it comes with few challenges as well.One major drawback is the difficulty in jointly optimizing and fine-tuning the individual components toward a common objective.When optimized independently, each module may overfit to certain tasks and subsequently propagate errors due to intricate inter-actions, ultimately impacting the overall system performance.Furthermore, given our reliance on LLMs, occasional hallucinations may occur despite our efforts to maintain control.Moreover, our system may struggle with handling highly complex and ambiguous queries, potentially requiring external resolution modules.In future, we intend to tackle the existing issues to develop a more robust conversational question answering system.with 2 turns, 33% with 4 turns and 33% with 6 turns), curated from 40 different paragraphs for randomly sampled document collection.We ensured that at least one dialog is crafted for every paragraph in this evaluation set.The human-annotation process for collecting these data are described in Section § C.

A.1.1 Data Processing Pipeline
The data processing pipeline in CarExpert takes data in various format (such as unstructured text, PDF, Excel, CSV, XML) and transforms them into SQuAD (Rajpurkar et al., 2016) format.SQuAD is a widely used question answering dataset format.The paragraphs in the SQuAD format are then converted into vectors, obtained from the Sentencetransformer and stored them in a vector database to facilitate quick semantic search (retrieval) given a user query.

A.3 Hyper-parameter Settings
We describe the hyper-parameters used in different components of the CarExpert below.
Retriever: We fine-tune the DPR model by employing a query encoder: facebook/dpr-question_encoder-multisetbase and facebook/dpr-ctx_encodermultiset-base as the paragraph encoder.We continued training for 10 epochs with a batch size of 8, warm-up steps of 6, and one hard negative sample per data point.We further fine-tuned the Sentence-transformer model all-MiniLM-L6-v2 with a batch size of 16 for 1 epoch, combining the objective of reducing Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) loss.
Reader: As the reader model, we fine-tuned Albert-large (Lan et al., 2020) as the base model.For the LLM-based reader, we used GPT-3.5-turboand Luminous-extended models.In both cases, we set a temperature of 0 to facilitate deterministic text generation, as well as a presence penalty of 0, top-p sampling rate of 0 and repetition penalty of 1.
Generator: For the LLM-based answer generation, we use GPT-3.5-turbo and Luminous-extended with a temperature of 0.8, top-p sampling rate 0.4, repetition penalty 1 and presence penalty of 0.6 .These settings allow for a more flexible answer generation, in contrast to the LLM-based reader.

A.4 Metrics
For quantitative evaluation of the system components and the system as a whole, we relied upon the following metrics.
Retriever: (i) Mean Reciprocal Rank (MRR) for the top-3 paragraphs calculates the average reciprocal rank of the first relevant document across multiple queries.The focus is on the rank of the first relevant document.
Reader: (i) F1-Score considers both precision (how many predicted words are correct) and recall (how many correct words are predicted).(ii) Exact Match (EM) measures the percentage of predicted answers that exactly match the ground truth answers.It is a strict metric that demands the model response to be identical to the ground truth.Generator: (i) Cosine Similarity between the system response and the human annotated response.
(ii) METEOR (Banerjee and Lavie, 2005) provides a single score reflecting the overall quality and fluency of the generated response against the human annotated response.
Answer Moderator (i) Accuracy of correctly yielding the extracted or the generated response as annotated by the human annotators.
System as a whole: (i) Cosine Similarity between the final system response and the expected system response.(ii) Component Contributions revealing if the system yields more extractive responses or generative results.

A.5 Answer Moderator
Edit Operations in Extraction score: Table 8 demonstrates the edit operation cost used in Extraction Score.Note that when the system deletes any reference token, it receives a maximum penalty.Eventually, the distance is normalized to a consistent scale using the maximum absolute value.

B.1 Retriever
We performed an extensive ablation study on different types of retriever (sparse, static, contextual, and hybrid) on both in-house and human-annotated evaluation datasets.
The retriever scores from the traditional BM25 and the static models are significantly lower, as expected, than the rest of the candidates.We observe that our datasets are reasonably hard for the retrievers which rely upon just the frequencies or associations between query-document pairs, essentially failing to yield meaningful contextual representations.The fine-tuned DPR performs the best on the human-annotated evaluation set, while the fine-tuned Sentence-transformer model performs the best on the in-house test set.It is also worth noting that the off-the-shelf SPLADE model performs almost as good as the fine-tuned contextual models.This could be attributed to how hybrid models are trained to combine the best of both worlds from the sparse and dense representations.

B.2 System as a whole
Table 10 demonstrates the experimental results of CarExpert with various system configurations.The component-wise evaluation presented earlier in Table 2 through 5) motivated us to conduct this elaborate study, within a scope with (i) fine-tuned DPR and fine-tuned Sentence Transformer models as Retriever, (ii) fine-tuned Reader and GPT-3.5-turbobased Reader, (iii) GPT-3.5-turbo as the Generator, and (iv) both answer moderation techniques.
It is evident from the results that the Extraction Score based Answer Moderator always prefers extractive responses than the generative responses when compared to the Cosine Similarity-based counterpart.For instance, the configurations C01 and C03 differ only by the Answer Moderator, however there is a significant increase in the contribution of extractive responses from 23% to 52%.This moderation technique helps our model to stay controllable regardless of the nature of the user utterances.The best share of extractive responses is obtained from C03.
We also observe how different retriever models affect the overall system response.For instance, the configurations C04 and C08 differ only by the retrievers used, however, with a significant difference in the similarity between the system response and reference response.In future, we intend to explore other sophisticated metrics that measure more nuanced aspects of language generation.In addition, we hypothesize that the cosine-similarity-based system evaluation might be biased towards the cosine similarity-based arbitration method as they may be measuring similar aspects of response similarity.considered to be the response generated by the system of high quality.Furthermore, Table 6 illustrates a complete conversation performed by CarExpert in real-life in-car settings with the user.

D.2 Lemon-picked Example
Refer to Table 12 for a selection of lemon-picked example question answer pairs.

E Error Analysis
Table 14 and Table 15 include the cases where the system failed or the most likely error source that failed the system.Note that the modulararchitecture of our system better helps us in making a well-educated identification of the erroneous component.We conduct the error analysis by comparing our system with GPT-3.5-turbo and Luminious-extended.For a fair evaluation, we provide the same set of retrieved paragraphs to all three systems.

E.1 Helpfulness vs Harmlessness trade-off
This type of query poses a trade-off between providing helpful and potentially harmful information.For example, in Table 14 the query "How can I disable the safety feature that prevents the engine from starting automatically in my car?" the user requests for information to gain more control of the system.However, at the same time the information might be potentially dangerous if not handled responsibly.Balancing such a request based on the importance of safety is therefore crucial.Furthermore, in Table 14, all the three CQA systems are trying to be helpful and promptly provide the appropriate answer to the user.While desirable, it is important for the system to warn the driver for potential risks like engine damage, legal violations, compromised safety, etc.

E.2 Hallucination
Hallucination by LLMs are generated responses that may comprise of misleading, factually incorrect, or fictional information which may seem very plausible and linguistically correct to humans.Despite the efforts to minimize hallucination through a controlled-architecture pipeline, our evaluation points at instances of hallucination as illustrated in the Table 15.The table demonstrates an example where both CarExpert and Luminous-extreme generate hallucinated responses by relying on the retrieved paragraphs.Even though GPT-3.5-turboseems like a better answer, it also hallucinates due to limited information found on battery health.A desired response would acknowledge the lack of specific information on driving with the engine off.This observation suggests that sometimes the retriever component retrieves paragraphs with incomplete information, leading to error propagation.The car is equipped with standard 20-inch aerodynamically optimized light-alloy wheels.21inch and 22-inch Air Performance wheels are optional.
The following sizes are recommended and approved by the vehicle manufacturer for the approved wheels and tires per vehicle type and special equipment: Wheel and tire combinations.Rim designs.Tire sizes.Tire brands.You can ask an authorized service center or another qualified service center or repair shop about the approved wheels and tires for the vehicle and the special equipment.For each tire size, the manufacturer of the vehicle recommends certain tire brands.
e How can I avoid parking damage?To protect against parking damage, the "Lateral Parking Aid", a subfunction of Park Assist, warns of obstacles to the side of the vehicle during parking and leaving and graphically displays them on the control display.
Park the vehicle as far away as possible from passing traffic and on solid ground.You can exit the car while driving on the highway by pressing the button.
Table 13: Example erroneous cases.The answer selected by the Answer Moderator is highlighted in yellow.

Figure 1 :
Figure 1: Illustration of a multi-turn in-car conversation between a user (in gray) and CarExpert (in blue).

Figure 2 :
Figure 2: High level overview of the CarExpert system architecture.

Figure 3 :
Figure3: Semantic search during the inference (the vector space is depicted as a vector database for demonstration).The potential answer to the question is encapsulated in the box of retrieved document A.

Table 5
exhibits that Extraction Score does a better job in moderating and selecting the best answer which aligns better to the retrieved documents.CarExpert incorporate the Extraction Score-based 1 https://openai.com/ 2 https://www.aleph-alpha.com/

Table 2 :
Performance comparison of retriever models.

Table 3 :
Evaluation results on the module: Reader.

Table 4 :
Performance of LLM-based Generator models.

Table 5 :
Performance of Answer Moderator approaches.

Table 8 :
Insertion costs (INS), Deletion costs (DEL) and Substitution costs (SUB) for different types of tokens.

Table 11 :
Example cherry-picked question-answer pairs.The answer selected by the Answer Moderator is highlighted in yellow.

Table 12 :
Example lemon-picked question-answer pairs.The answer selected by the Answer Moderator is highlighted in yellow.