NLP Project: How to Build an Automated Question Answering Model from FAQs Using Word-embeddings

Sakshi GuptaSakshi Gupta | 4 minute read | August 22, 2021
NLP Project: How to Build an Automated Question Answering Model from FAQs Using Word-embeddings

In this article

Natural language processing (NLP) is a fast-growing field within machine learning and artificial intelligence. Simply put, it’s the process of teaching machines to read, understand and process human languages. An NLP project can have hundreds of applications across search, spell check, auto-correct, chatbots, product recommendations, etc. We recently conducted a hands-on session with Lavanya Tekumalla (founder of AiFonic Labs and mentor at Springboard India) in which she walks us through an automated answering NLP project using FAQs from her website. Lavanya is a PhD in machine learning from the Indian Institute of Science. She has also worked in machine learning and data science teams of Amazon, Myntra, InMobi and more.

Start your First NLP Project with FAQs

YouTube video player for ZxR38An5TQE

Answering questions is a simple and common application of natural language processing. Most websites have a bank of frequently asked questions. An NLP algorithm can match a user’s query to your question bank and automatically present the most relevant answer. This is called ‘automated question answering’ and it is the NLP project we are going to implement today.

NLP Project Approach: How to Begin

In practice, the terms in user queries will not always match the questions in your FAQs bank. Your algorithm must be able to compare the user query sentence with the sentences/phrases in your FAQs. This is a sentence similarity comparison problem and here’s how to solve it.

  1. Convert the user query into a vector
  2. Convert FAQs into vectors
  3. Compare the two and measure cosine similarity
  4. Retrieve the best match result

In this blog post, we’ll do the automated question answering NLP project using four different methods: Bag of words, Word2Vec using Skipgram, Glove embeddings and BERT embeddings.

Get To Know Other Data Science Students

Bryan Dickinson

Bryan Dickinson

Senior Marketing Analyst at REI

Read Story

Isabel Van Zijl

Isabel Van Zijl

Lead Data Analyst at Kinship

Read Story

Hastings Reeves

Hastings Reeves

Business Intelligence Analyst at Velocity Global

Read Story

1. Pre-processing

The first step in this NLP project is getting the FAQs pre-processed. This includes getting all the questions and answers into a CSV file, filling in missing values, standardizing data by making it fit into a uniform range, etc.

Then, do the NLP-specific pre-processing:

  • Convert all sentences into lower case. This ensures that the meanings don’t change based on the case of the word. 
  • Remove non-alphanumeric characters.
  • Remove stop words: Frequently occurring words like conjunctions, prepositions, etc. don’t tell much about the underlying meaning of a sentence, so these can be removed. Just make sure you are careful about words like ‘not’, which can change the meaning of the sentence entirely.

2. NLP Project Using Bag-of-Words 

The bag-of-words model represents the text as a multiset of words. In essence, it’s exactly as the name suggests — it is a bag of individual words, without taking into account the word order or grammar. To perform retrieval using the bag-of-words in an NLP model, we typically follow these steps:

  • Break the dataset into individual words
  • Make a dictionary of the words
  • Print the dictionary
  • Create a bag-of-words corpus, from which you can make a sparse representation for each sentence
  • Get bag-of-words representation for a user query
  • Loop over all the question vectors created and perform a cosine similarity function with the query vector

When you use this model, you’ll be comparing the exact words in your query sentence and FAQs to identify similarities. This means that the bag-of-words model doesn’t process underlying meaning or context, resulting in inaccurate results. 

3. Word2Vec Model Using Skipgram

The word2vec model is more advanced than the bag-of-words. There are two ways to do word2vec — continuous-bag-of-words (CBOW) and Skipgram. In the session, Lavanya has explained the latter. 

Skipgram models add up word vectors and give a somewhat meaningful vector for a sentence or phrase. They are typically pre-trained on a large volume of text, which not only identifies words but can also give vector representation for words with similar meaning. For instance, ‘require’ and ‘need’, while using word2vec, will have high cosine similarity. This way, word2vec models understand context much better than bag-of-words.

4. Glove Embeddings

This model is similar to word2vec but is trained differently. Glove embeddings use matrix factorisation, a popular technique before deep neural networks came into the picture. It creates a word-to-word coherence matrix — frequency of word occurring with another word — and uses factorisation methods to understand the context. 

5. BERT Embeddings

Among the techniques we discussed so far, this one is the most sophisticated. It leverages techniques of deep learning and uses a concept called masked learning. This model can look at long-range dependencies in sentences and helps in understanding context much deeper. 

Depending on the context in the sentence, it creates different vectors for the same words, that are semantically different. While using this model, it’s best not to remove stop words, as they are helpful in identifying dependencies. Its advanced capabilities to understand the context and semantic similarity, it can handle queries that Word2Vec might not be enough for. 

Other things to keep in mind while executing an NLP project

It is extremely important that you have your evaluation criteria set clearly to measure the accuracy rate of your models. This can be done by:

  • Getting labelled data: Typically, someone manually picks the most relevant question for each user query, which is then compared to those retrieved by the model.
  • Setting the criterion: Measure what percentage of queries are getting the right questions retrieved and work on it.
  • Having a baseline for the notion of confidence: Set a measure below which you are not confident that the closest answer retrieved is relevant. In this case, you may not present any response to the user query.

If you’re interested in receiving access to the stepwise analysis from this session, please fill this form. For details of how to use each of the above models, watch Lavanya’s session on Youtube. This way, you can run your own NLP project and add it to your portfolio.

Since you’re here…Are you a future data scientist? Investigate with our free step-by-step guide to getting started in the industry. When you’re ready to build a CV that will make hiring managers melt, join our 4-week Data Science Prep Course or our Data Science Bootcamp—you’ll get a job in data science or we’ll refund your tuition.

Sakshi Gupta

About Sakshi Gupta

Sakshi is a Senior Associate Editor at Springboard. She is a technology enthusiast who loves to read and write about emerging tech. She is a content marketer and has experience working in the Indian and US markets.