Why chatbots cannot learn directly from human conversations

10 min read >

Why chatbots cannot learn directly from human conversations

Advanced Technologies

From the series: A practical approach between myth and reality

Why chatbots cannot learn directly from human conversations

In the previous article, we presented two ways of categorizing conversational agents — more widely known as chatbots — along with their advantages, limitations, and use case scenarios. In the current post, we are focused on emphasizing how chatbots infer knowledge from human conversations through a basic rule-based system, machine learning, and natural language processing, which all play a crucial role in facilitating the automation of the request handling process.

A few months ago, we took on the challenge to build a chatbot that could seamlessly integrate into a customer support platform, applicable to various industries. At Tremend, we developed a chatbot prototype for a major telecom client. The primary scope of the project was to create a quality-consistent chatbot that could minimize human labor with respect to tedious or repetitive tasks, usually time-consuming and hard to scale across brief periods of time and on an ad hoc basis.   

Therefore, the goal was to prototype a chatbot that was both closed-domain (i.e. being applied over the knowledge of a certain field of expertise) and task-oriented (handling at least a part of the complete customer solution for a user’s requests). 

In the following paragraphs, I am going to tackle a few of the suppositions with respect to the bot’s characteristics and behavior, which are sensible to be learned according to a human perspective, but difficult, if not impossible, when actually trying to model them in a machine learning environment.

Myth: Once processing real instances of human-to-human conversations, the bot will correctly infer the knowledge from them.

We dealt with a relatively limited dataset of approximately five hundred recorded conversations between an end-user and a human customer service representative. These conversations covered some of the main customer issues and complaints, with all the data acquired and corroborated over a period of three months. However, the same challenges and limitations apply to chatbots whose development and machine learning process exceed the tested 3-month period and five hundred real-life conversations. 

The initial hypothesis: an end-to-end system should be easily attainable for representative conversations.

The immediate idea that springs to mind are attempting to train a sequence-to-sequence model over the word representations of each conversation, where the encoder models the user query sequence and the decoder learns the answer belonging to the qualified person.

1. Natural language diversity

As someone with experience in what this type of model may predict, I argue that the results were the opposite of spectacular for many different reasons. For instance, the conversation dataset was relatively insufficient, and albeit it was industry-specific and the categories of predominant issues were limited  — 12 categories with more than 10 instances per category — there were multiple valid natural language utterances expressing the same idea. It was therefore very challenging to read into the entire set of information exchanged between the two interlocutors, even if the individual word representations were fairly good.

It is worth mentioning that the embeddings we employed for word representations were pre-trained on a larger corpus by Fasttext for the Romanian language. However, some particular industry-related terms are harder to learn as different notions from their denotative meanings.

2. Semantic challenges

There are certain aspects of an online conversation meant to provide additional information, which is relatively easy to comprehend by humans, but difficult to process by machines. These include, but are not limited to, typographical errors, sentence corrections, word suggestions with reference to misspelled ones, or extra information to complete a previous sentence. All these instances may not only appear in the customer’s utterances — to be understood by the chatbot — but also in the expert’s part — to be correctly learned by the chatbot.

In order to underline this challenge that our Tremend team was seeking to overcome, I will provide down below an explanatory realistic scenario. In our data, we may find conversations such as:

“C: Hello! I have no recption in the area where I live.

C: recetpion

C: reception, sorry

S: No problem. Can you confirm your address, please? …”

If we decide for our model that the encoder sequence would consist of the concatenation of our users’ consecutive utterances, after some preprocessing, it would look as follows: Hello ! I have no recption in the area where I live . recetpion reception , sorry. It is highly unlikely that even for human understanding such linguistic constructions make any sense at all.

3. Particular instances

If the model learns linguistic instances so well as to overtrain (otherwise known as overfitting), on the helper’s side, or the answers which include politeness phrases (here: “No problem.”) happen often in our dataset, a situation like the following may occur in real-time:

“C: Hello! I am experiencing some issues with the mobile phone signal.

S: No problem. Can you confirm your address, please? …”

It is by definition understood that we can encounter even greater challenges and problematic scenarios, all stemming from the diversity of natural language utterances and the fact that a language chatbot can infer anything it is underrepresented in the conversation dataset, in a different direction than the expected one.

For instance, the chatbot can erroneously observe and assimilate language patterns from recorded conversations between a customer and a human representative and employ these patterns inaccurately in its own dialogue with the customer. In other words, these challenges arise from the under-layers of the natural language diversity among humans, insofar as certain phrases and utterances appear underrepresented in the conversation dataset used for training the chatbot.

4. Linguistic accuracy

On the other side, all these smaller or more significant errors could be avoided by generating multiple answers and selecting the best one. This objective could be achieved by developing and implementing a reliable method for measuring the correctness of the answer, both from a lexico-syntactic perspective, more importantly, from a business and industry perspective. I dare to argue that no one would accept a recommendation that suggests throwing their phone across the room to fix their connectivity issue.

Other language-related matters include concision, coherence, and cohesion of the answer(s), which are also hard to predict or to measure with human-acceptable accuracy. On one hand, a short answer would be preferred, but with enough details to be easily followed by a non-technical person, in order to achieve a sense of conciseness. On the other hand, the current answer should not only make sense when interpreted along with the previous exchange of utterances, ensuring a cohesive conversation, but also continue in a natural way the idea of the previous sentence, or sentences, without creating a non-coherent, disruptive dialogue.

Let’s take a closer look at an output example that shows the importance of these natural language qualities:

“C: Hello! I have no reception in the area where I live.

S: Hello! How may I help you today? Please tell me about your problem briefly below.

C: I have bad reception in certain areas.

S: In order to activate your new SIM card, you need to follow these instructions: … 

Please let me know how long ago you acquired your SIM card.

C: ?!”

The above-mentioned key points are just some of the challenges that prevent such end-to-end models from behaving ideally in most real-life scenarios. In a business environment, where the accuracy and the swiftness of given information are crucial, we cannot rely on a generative component for the answering part, neither coming from a sequence-to-sequence ensemble nor from other models.

Solution by Tremend

Here at our TremendLabs, we proved that it wasn’t feasible for a model to accurately infer knowledge directly from previous conversations, regardless of how well-structured the data were, prior to deploying them to an end-to-end system — an argument supported by the presented scenarios in this article.

Instead, a task-oriented, closed-domain chatbot such as the one for customer support should always be governed, at least to some extent, by a set of rules and instructions so as to ensure greater control over the quality of the information provided when opening a dialogue with a customer.

The benefits are indisputable

By using rules as a guidance schema, it is a surefire way for the same piece of information to reach each customer without being altered. In case something changes within the infrastructure or the way the company decides how to handle a specific request, it will not take long for the modifications to be learned. A human specialist is not needed to handle the conversations in order to obtain initial training data and, moreover, there will not be a long learning period for the new information to take effect. Once again, the clients will all benefit from the new, updated regulation.

Machine learning remains the key component

Some areas in which we applied artificial intelligence models towards prototyping our chatbot are classifying the initial conversations, extracting similar classes of requests, classifying the utterances at each state of the conversation, and several neural network models for perfecting natural language understanding and, therefore, the global reliability of the system.

In this article, we reasoned why the chatbot cannot easily infer and output human-acceptable answers for a known set of issues, even if we feed sufficient training samples. In the following article from this series, we will tackle some of the other implementation challenges, easy to disregard at a first glance, but crucial for the overall performance of the chatbot.