back to all articles

Extending SentenceTransformers to Swedish language

The story of the NLP team from Hello Ebbot extending SentenceTransformers to Swedish started a month ago, when we unexpectedly received a call from Santa Claus...

đŸŽ…đŸŒ Santa: Hello, is this Hello Ebbot's NLP team? It's Santa Claus speaking! Hello Ebbot is on the Nice List this year and I have a gift for you.

đŸ‘Ÿ Hello Ebbot team: Oh Santa!! Really, you have a gift for us?

đŸŽ…đŸŒ Santa: Yes of course, you have all been working very hard in the year 2020. How may I help reducing your workload?

đŸ‘Ÿ Hello Ebbot team: Hmm, there is actually one thing that we want to improve right now! So in order for our digital co-worker to respond to human-language, he has to be trained to detect intent, which basically is the purpose of a message. Then he learns how to accurately predict it through 10-20 example sentences for every intent. It would be nice if we can have an application that takes one sentence as an input and outputs many sentences with the same meaning, so we don't have to come up with these examples ourselves.

đŸŽ…đŸŒ Santa: Aaah, then I know exactly what you need, how about my intelligent SentenceTransformers model? He can help you translate the sentences into numbers and you can use cosine similarity to find similar sentences in a big corpus.

đŸ‘Ÿ Hello Ebbot team: That's great! We will prepare and clean our list of example sentences in our database and wait for your gift!

đŸŽ…đŸŒ Santa: One little problem, you have to teach SentenceTransformers Swedish! He only speaks English.

đŸ‘Ÿ Hello Ebbot team: That's okay Santa, we know you have to talk to other companies on the nice list. Let us take care of this from here.

That's when we decided to train the SentenceTransformers so that the model can embed Swedish text. And finally, after hours of training and many cups of coffee later...SentenceTransformers now speaks Swedish fluently! đŸ„ł 🎉

How we extended SentenceTransformers to Swedish

Based on the publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, we extended the teacher English SentenceTransformers to a student Swedish model using English - Swedish parallel sentences dataset, which was TED2020 corpus containing 119,602 sentences. We trained our Transformer based on UKPLab's example training script using Colab Pro notebook. Utilizing Colab Pro's Graphics Processing Unit (GPU), it took us only two hours to train and we achieved the accuracy of 95.6% evaluated on test set.

Hello Ebbot's application built using SentenceTransformers

After finishing extending SentenceTransformers To Swedish, we used the model to embed our corpus, which is a cleaned list of 56,538 example phrases that we came up with to teach Ebbot in the past. Then, cosine similarity was applied to compare the semantic similarity between the given text and sentences in the corpus. The application then prints out the most similar sentences along with similarity scores. Using Streamlit , our NLP team built a simple web app, allowing users to choose how many similar phrases they want to generate. There is also an option to print out top similar or all sentences within a chosen range of percentage.

Let's take a look at more examples!

NÀr skickar ni grejerna som jag bestÀllt?

  • undrar nĂ€r ni skickar ivĂ€g det jag bestĂ€llt av er (Score: 0.93)
  • jag undrar om nĂ€r jag fĂ„r grejerna som jag bestĂ€llt (Score: 0.91)
  • hej jag har bestĂ€llt varor utav er fĂ„tt undrar vart resterande tagit vĂ€gen (Score: 0.90)
  • nĂ€r fĂ„r jag mina bestĂ€llda varor (Score: 0.89)
  • nĂ€r fĂ„r jag mitt paket som jag bestĂ€llt (Score: 0.89)
  • nĂ€r mĂ„ste jag hĂ€mta bestĂ€llning (Score: 0.89)
  • nĂ€r kommer saker jag bestĂ€ller fram (Score: 0.88)
  • nĂ€r skickas min bestĂ€llning (Score: 0.88)
  • och du undrar jag hur jag ska gĂ„ till vĂ€ga skickar ni hit nĂ„gon som hĂ€mtar den dĂ„ jag hade hemleverans (Score: 0.88)
  • vart Ă€r mina saker som jag har bestĂ€llt (Score: 0.88)

Tack för all hjÀlp, ni Àr bÀst!

  • kanon tack för all hjĂ€lp ha det gott (Score: 0.98)
  • superbra tack sĂ„ mycket för hjĂ€lpen (Score: 0.98)
  • toppen tack sĂ„ mycket för hjĂ€lpen (Score: 0.98)
  • toppen tack för din hjĂ€lp đŸ‘đŸŸ (Score: 0.98)
  • tack för hjĂ€lpen ha det sĂ„ bra (Score: 0.98)
  • stort tack du har varit till stor hjĂ€lp (Score: 0.98)
  • perfekt tack sĂ„ mycket för hjĂ€lpen (Score: 0.98)
  • toppen tack tack för bra service (Score: 0.98)
  • oh toppen tack för din hjĂ€lp (Score: 0.98)
  • excellent thanks for your help (Score: 0.98)

You can see that the application is not only finding other sentences with similar words, but is actually able to return sentences with the same meaning. This is what makes the SentenceTransformers a powerful and helpful tool for us, because the more creative we are with the example phrases, the better Ebbot become at detecting intents!

Being extremely excited about our result, Santa đŸŽ…đŸŒ called to congratulate us and ask when we will have the application ready to be used in production. Even though we are proud of ourselves for successfully extending SentenceTransformers to Swedish, we told him that we still want to test it internally and make improvements before the official release. We thanked Santa đŸŽ…đŸŒ again and promised him we would be even more hard-working in the year 2021 to continue being on the nice list 🎄 And so Hello Ebbot's journey for the year 2021 begins....

Mia
December 16, 2020