Detecting toxic messages in Swedish language

Even though the rapid development of Internet and social media contributes significantly to human connection, it is undeniable that this is also the very reason why toxic behaviors become more common online. Thus, toxic comments classification has been researched by experts in the Machine Learning field for the past few years. Recently, one of our clients asked us to teach Ebbot to detect toxic messages in conversations. Thanks to this special request, we got a chance to work on one of the most difficult topics in the Natural Language Processing (NLP) field. And yes, we can not be more excited! 🥳

Challenges with collecting dataset

In order to successfully implement this classification task, we have to train Ebbot on a dataset of text with toxicity. Although large labeled training datasets exist, they are not available in Swedish. And using machine translation is not a good approach, since there are many slangs that cannot be translated accurately by machines.

Ebbot's solution to toxic messages detection

After researching, we found an open-source yet highly accurate trained model, built by Laura Hanu at Unitary. In addition to the original version, which only supported English and was trained on Wikipedia comments, Unitary also provided a multilingual model which was trained on 7 different languages (english, french, spanish, italian, portuguese, turkish and russian).

At the same time, we also found a machine translation model by the Language Technology Research Group at the University of Helsinki. This combination enables us to work around the lack of dataset and meet our clients' request. After receiving input text in Swedish, Ebbot will translate it to English first, then run it through the toxicity classifier. The output will be the scores for six categories of toxic messages: toxicity, severe toxicity, obscene, threat, insult and identity hate. Using this method, not only can we decide whether a message is toxic or not, but we are also able to see which type of inappropriate behaviors it brings.

We are aware that this is not the best solution when it comes to solving Machine Learning/Artificial Intelligence problems. Nevertheless, when facing the challenges of not having available training dataset, we consider this to be one of the quickest and easiest ways to tackle multilingual NLP challenges. Currently we are testing the model and gathering user feedback to improve the app's performance. But please feel free to contact us if you have any inquiries about our bot-builder product or special NLP integrations 🙌 We are usually very responsive 😉

Mia

March 22, 2021

Läs mer

How the EU AI Act will shape the future of service automation

The clock is ticking. The EU AI Act is set to become law, reshaping how artificial intelligence is developed, deployed, and regulated in Europe. For organizations looking to integrate AI solutions, this legislation raises important questions about compliance, accountability, and the choice of AI providers.

January 15, 2025

Back in December 2020, we successfully extended a powerful Natural Language Processing (NLP) model called "SentenceTransformers" to Swedish language. We hate to brag, but after publishing the [blog…

Mia

February 25, 2021