In the last blog, Ebbot explained to you the training process which helps him correctly respond to your queries. As mentioned in the end of the blog, every time Ebbot fails to understand you, he will learn from your messages in the conversation to improve his performance. But do you know that not every sentence is useful for the learning process? There is information that we do not want Ebbot to memorize, such as phone numbers, emails and spam messages (e.g: asdfda, wqrherewrere safdfa). That is why we decided to build a Machine Learning (ML) model to classify messages as spam or not spam to filter out unnecessary data. Please keep reading to find out how we trained this spam classifier and its accuracy!
Collecting and labeling the dataset
Using data from a conversation script between users and Ebbot, we collected 2924 phrases in total and labeled them as "Spam" or "Not Spam". With the help of our sentence similarity model, we were able to cluster meaningful and spam sentences into two groups. Thus, avoided manual data-labeling as much as we could and saved a lot of time.
Training and evaluating the model
Inspired by the article "Create a SMS spam classifier in Python", we chose Multinomial Naive Bayes Classifier Model for this project. 75% of the dataset was used for the training and 25% was saved for testing. By analyzing the dataset, we also noticed that the average length of spam (12.4 characters) messages are much lower than non-spam (21.5 characters) ones.
After fitting the training dataset to the model, we used the test set to see how it performs. The AUC-ROC (Area Under The Curve- Receiver Operating Characteristics) score was very high, approximately 0.97!
AUC-ROC score for the model after fitting our dataset
Building and deplying a webapp to host the spam classifier
Before giving Ebbot this filter to help him collect only meaningful data, we still want to test and improve the model. Thanks to Streamlit, we were able to quickly build a web app in under one hour. Furthermore, we included a feedback system which would collect false predictions to improve the model. The web app was then deployed using [Heroku](http://Heroku: Cloud Application Platformwww.heroku.com). If you want to test it live, here is 🥁🥁🥁🥁🥁🥁🥁🥁🥁🥁🥁 thelink!
We are hoping to implement this filter into our Bot-builder product in order to help our clients reducing the training time of their digital assistants in the near future. As soon as this is launched, we will definitely notify you of the good news with another blog. Until then, please feel free to look at other posts on our website or follow our LinkedIn to be updated with exciting news almost every