back to all articles

Extending SentenceTransformers to Swedish language

The story of the NLP team from Hello Ebbot extending SentenceTransformers to Swedish started a month ago, when we unexpectedly received a call from Santa Claus...

🎅🏼 Santa: Hello, is this Hello Ebbot's NLP team? It's Santa Claus speaking! Hello Ebbot is on the Nice List this year and I have a gift for you.

👾 Hello Ebbot team: Oh Santa!! Really, you have a gift for us?

🎅🏼 Santa: Yes of course, you have all been working very hard in the year 2020. How may I help reducing your workload?

👾 Hello Ebbot team: Hmm, there is actually one thing that we want to improve right now! So in order for our digital co-worker to respond to human-language, he has to be trained to detect intent, which basically is the purpose of a message. Then he learns how to accurately predict it through 10-20 example sentences for every intent. It would be nice if we can have an application that takes one sentence as an input and outputs many sentences with the same meaning, so we don't have to come up with these examples ourselves.

🎅🏼 Santa: Aaah, then I know exactly what you need, how about my intelligent SentenceTransformers model? He can help you translate the sentences into numbers and you can use cosine similarity to find similar sentences in a big corpus.

👾 Hello Ebbot team: That's great! We will prepare and clean our list of example sentences in our database and wait for your gift!

🎅🏼 Santa: One little problem, you have to teach SentenceTransformers Swedish! He only speaks English.

👾 Hello Ebbot team: That's okay Santa, we know you have to talk to other companies on the nice list. Let us take care of this from here.

That's when we decided to train the SentenceTransformers so that the model can embed Swedish text. And finally, after hours of training and many cups of coffee later...SentenceTransformers now speaks Swedish fluently! 🥳 🎉

How we extended SentenceTransformers to Swedish

Based on the publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, we extended the teacher English SentenceTransformers to a student Swedish model using English - Swedish parallel sentences dataset, which was TED2020 corpus containing 119,602 sentences. We trained our Transformer based on UKPLab's example training script using Colab Pro notebook. Utilizing Colab Pro's Graphics Processing Unit (GPU), it took us only two hours to train and we achieved the accuracy of 95.6% evaluated on test set.

Hello Ebbot's application built using SentenceTransformers

After finishing extending SentenceTransformers To Swedish, we used the model to embed our corpus, which is a cleaned list of 56,538 example phrases that we came up with to teach Ebbot in the past. Then, cosine similarity was applied to compare the semantic similarity between the given text and sentences in the corpus. The application then prints out the most similar sentences along with similarity scores. Using Streamlit , our NLP team built a simple web app, allowing users to choose how many similar phrases they want to generate. There is also an option to print out top similar or all sentences within a chosen range of percentage.

Let's take a look at more examples!

När skickar ni grejerna som jag beställt?

  • undrar när ni skickar iväg det jag beställt av er (Score: 0.93)
  • jag undrar om när jag får grejerna som jag beställt (Score: 0.91)
  • hej jag har beställt varor utav er fått undrar vart resterande tagit vägen (Score: 0.90)
  • när får jag mina beställda varor (Score: 0.89)
  • när får jag mitt paket som jag beställt (Score: 0.89)
  • när måste jag hämta beställning (Score: 0.89)
  • när kommer saker jag beställer fram (Score: 0.88)
  • när skickas min beställning (Score: 0.88)
  • och du undrar jag hur jag ska gå till väga skickar ni hit någon som hämtar den då jag hade hemleverans (Score: 0.88)
  • vart är mina saker som jag har beställt (Score: 0.88)

Tack för all hjälp, ni är bäst!

  • kanon tack för all hjälp ha det gott (Score: 0.98)
  • superbra tack så mycket för hjälpen (Score: 0.98)
  • toppen tack så mycket för hjälpen (Score: 0.98)
  • toppen tack för din hjälp 👍🏾 (Score: 0.98)
  • tack för hjälpen ha det så bra (Score: 0.98)
  • stort tack du har varit till stor hjälp (Score: 0.98)
  • perfekt tack så mycket för hjälpen (Score: 0.98)
  • toppen tack tack för bra service (Score: 0.98)
  • oh toppen tack för din hjälp (Score: 0.98)
  • excellent thanks for your help (Score: 0.98)

You can see that the application is not only finding other sentences with similar words, but is actually able to return sentences with the same meaning. This is what makes the SentenceTransformers a powerful and helpful tool for us, because the more creative we are with the example phrases, the better Ebbot become at detecting intents!

Being extremely excited about our result, Santa 🎅🏼 called to congratulate us and ask when we will have the application ready to be used in production. Even though we are proud of ourselves for successfully extending SentenceTransformers to Swedish, we told him that we still want to test it internally and make improvements before the official release. We thanked Santa 🎅🏼 again and promised him we would be even more hard-working in the year 2021 to continue being on the nice list 🎄 And so Hello Ebbot's journey for the year 2021 begins....

Mia
December 16, 2020