Extending SentenceTransformers to Swedish language
The story of the NLP team from Hello Ebbot extending SentenceTransformers to Swedish started a month ago, when we unexpectedly received a call from Santa Claus...
đ đŒ Santa: Hello, is this Hello Ebbot's NLP team? It's Santa Claus speaking! Hello Ebbot is on the Nice List this year and I have a gift for you.
đŸ Hello Ebbot team: Oh Santa!! Really, you have a gift for us?
đ đŒ Santa: Yes of course, you have all been working very hard in the year 2020. How may I help reducing your workload?
đŸ Hello Ebbot team: Hmm, there is actually one thing that we want to improve right now! So in order for our digital co-worker to respond to human-language, he has to be trained to detect intent, which basically is the purpose of a message. Then he learns how to accurately predict it through 10-20 example sentences for every intent. It would be nice if we can have an application that takes one sentence as an input and outputs many sentences with the same meaning, so we don't have to come up with these examples ourselves.
đ đŒ Santa: Aaah, then I know exactly what you need, how about my intelligent SentenceTransformers model? He can help you translate the sentences into numbers and you can use cosine similarity to find similar sentences in a big corpus.
đŸ Hello Ebbot team: That's great! We will prepare and clean our list of example sentences in our database and wait for your gift!
đ đŒ Santa: One little problem, you have to teach SentenceTransformers Swedish! He only speaks English.
đŸ Hello Ebbot team: That's okay Santa, we know you have to talk to other companies on the nice list. Let us take care of this from here.
That's when we decided to train the SentenceTransformers so that the model can embed Swedish text. And finally, after hours of training and many cups of coffee later...SentenceTransformers now speaks Swedish fluently! đ„ł đ
How we extended SentenceTransformers to Swedish
Based on the publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, we extended the teacher English SentenceTransformers to a student Swedish model using English - Swedish parallel sentences dataset, which was TED2020 corpus containing 119,602 sentences. We trained our Transformer based on UKPLab's example training script using Colab Pro notebook. Utilizing Colab Pro's Graphics Processing Unit (GPU), it took us only two hours to train and we achieved the accuracy of 95.6% evaluated on test set.
Hello Ebbot's application built using SentenceTransformers
After finishing extending SentenceTransformers To Swedish, we used the model to embed our corpus, which is a cleaned list of 56,538 example phrases that we came up with to teach Ebbot in the past. Then, cosine similarity was applied to compare the semantic similarity between the given text and sentences in the corpus. The application then prints out the most similar sentences along with similarity scores. Using Streamlit , our NLP team built a simple web app, allowing users to choose how many similar phrases they want to generate. There is also an option to print out top similar or all sentences within a chosen range of percentage.
Let's take a look at more examples!
NÀr skickar ni grejerna som jag bestÀllt?
- undrar nÀr ni skickar ivÀg det jag bestÀllt av er (Score: 0.93)
- jag undrar om nÀr jag fÄr grejerna som jag bestÀllt (Score: 0.91)
- hej jag har bestÀllt varor utav er fÄtt undrar vart resterande tagit vÀgen (Score: 0.90)
- nÀr fÄr jag mina bestÀllda varor (Score: 0.89)
- nÀr fÄr jag mitt paket som jag bestÀllt (Score: 0.89)
- nÀr mÄste jag hÀmta bestÀllning (Score: 0.89)
- nÀr kommer saker jag bestÀller fram (Score: 0.88)
- nÀr skickas min bestÀllning (Score: 0.88)
- och du undrar jag hur jag ska gÄ till vÀga skickar ni hit nÄgon som hÀmtar den dÄ jag hade hemleverans (Score: 0.88)
- vart Àr mina saker som jag har bestÀllt (Score: 0.88)
Tack för all hjÀlp, ni Àr bÀst!
- kanon tack för all hjÀlp ha det gott (Score: 0.98)
- superbra tack sÄ mycket för hjÀlpen (Score: 0.98)
- toppen tack sÄ mycket för hjÀlpen (Score: 0.98)
- toppen tack för din hjĂ€lp đđŸ (Score: 0.98)
- tack för hjÀlpen ha det sÄ bra (Score: 0.98)
- stort tack du har varit till stor hjÀlp (Score: 0.98)
- perfekt tack sÄ mycket för hjÀlpen (Score: 0.98)
- toppen tack tack för bra service (Score: 0.98)
- oh toppen tack för din hjÀlp (Score: 0.98)
- excellent thanks for your help (Score: 0.98)
You can see that the application is not only finding other sentences with similar words, but is actually able to return sentences with the same meaning. This is what makes the SentenceTransformers a powerful and helpful tool for us, because the more creative we are with the example phrases, the better Ebbot become at detecting intents!
Being extremely excited about our result, Santa đ đŒ called to congratulate us and ask when we will have the application ready to be used in production. Even though we are proud of ourselves for successfully extending SentenceTransformers to Swedish, we told him that we still want to test it internally and make improvements before the official release. We thanked Santa đ đŒ again and promised him we would be even more hard-working in the year 2021 to continue being on the nice list đ And so Hello Ebbot's journey for the year 2021 begins....