Natural Language Understanding Benchmarking Report: Which conversational AI platforms are the best?

Written by Radha Yadav on Jan 26, 2022

A performance report comparing Google Dialogflow, IBM Watson, Microsoft LUIS, Netomi and RASA.

Business adoption of AI has accelerated tenfold as a result of the COVID-19 pandemic, and a handful of conversational AI platforms have emerged as the leaders in the space. We set out to see how Netomi’s Natural Language Understanding (NLU) performs against the most prominent conversational AI platforms: Google Dialogflow, IBM Watson, Microsoft LUIS and RASA. 

The report reveals a lot of disparity in AI performance even amongst the most established platforms, signaling that even as the market matures, the end user experience is still greatly differentiated based on the underlying AI platform a company deploys. 

When we look holistically at different conversational AI performances in the context of the end-user experience, customers are anywhere between 0.6X – 7.44X less frustrated when engaging with Netomi-powered bots, compared to other AI platforms. This is because IBM Watson, Microsoft LUIS, RASA and Google Dialog Flow are more likely to respond incorrectly to topics that they have not been trained on. Furthermore, the Netomi AI is anywhere between 11.97% – 23.38% points better at predicting the right course of action, in comparison to other leading AI platforms. 

NLU Benchmarking Report: The Process 

Each platform was thoroughly tested with 14 datasets representing real-world customer service queries. We assessed the ability of an AI to correctly answer use cases for businesses of all sizes (small to enterprise). The datasets spanned four core consumer-facing industries: retail, travel, gaming and telecommunications.


Download your copy of the NLU Benchmarking Report now for the full results.


NLU Benchmarking Report: The Results 

In our benchmarking report, we measured a few key metrics including Accuracy, Out of Scope Accuracy and Balanced Accuracy. Here are the highlights of the report: 

  • Accuracy: In conversational AI, accuracy relates to the total number of correct replies out of all replies for trained topics. Higher accuracy means that the AI is responding accurately, therefore causing less user frustration than if the bot provided an incorrect or irrelevant response.  In the Conversational AI Benchmarking Report, Netomi has the highest accuracy (85.17%), followed by IBM Watson (73.20%), Google Dialogflow (71.16%), RASA (68.56%) and Microsoft LUIS (61.79%). 
  • Out of Scope Accuracy: Conversational AI with high out-of-scope accuracy decreases user frustration, as it understands which topics it is not trained on and follows the appropriate behavior (i.e. escalates a user query to a human agent or directs them to another channel). We found that Netomi has the highest out-of-scope accuracy at 92.45%, signaling that it causes the least amount of user frustration associated with incorrectly answering a question that it has not been trained on. IBM Watson performs second best at only 52.82%, followed by Google Dialogflow (36.45%), Microsoft LUIS (19.65%) and RASA (10.64%).
  • Balanced Accuracy accounts for both coverage rates and out of scope accuracy. It’s the measure of value delivered by a conversational AI agent. When Balanced Accuracy is low, user frustration can be expected due to the bot’s incorrect or irrelevant responses. For Balanced Accuracy, Netomi scores the highest at 68.46%, followed by IBM Watson (59.81%), Google Dialogflow (52.95%), RASA (40.13%) and Microsoft LUIS (39.52%).

Businesses are adopting conversational AI platforms to enhance the user experience, create efficiencies and reduce costs. To reach these goals, the underlying AI platform needs to accurately understand the user and regoniznize what it hasn’t been trained on to take the right course of action. If not, users become frustrated and often reach out to companies on other channels, driving up costs. This study found that even amongst the most prominent conversational AI platforms, there is huge disparity regarding the strength of the NLU. Companies need to understand which platforms will deliver the best end-user experience in order to take full advantage of the superpowers of AI.

To see all of the findings from the Conversational AI Benchmarking Report, download your copy today.