Fun with word embeddings

10 minutes to read

In this tutorial, you will learn more about word embeddings and how they work using the german language and using byte-pair encodings (BPEmb). For this I will do some experiments to give you a little more insight, about how word-embeddings work.

In case you ask, word embeddings are a vectorized representation of words or word-tokens. They grew from the idea, that usign One-Hot encoding, you can only tell if a word appears in a sequence, but you have no information about its context and also about it meaning compared to others words.

Lets take the sentence “Peter likes to eat marmalade on his bread” and we have a task of finding other words, you could replace marmalade with. We would need a system, that understood the context, in which marmalade is used. Therefore, words that are used in the same context as marmalade will be in the same vector space as marmalade, they could be also things you can put on your bread! Lets look at this further.

Into the python code!

As mentioned before, we use the byte-pair encodings of BPEmb. You can find more about them on https://github.com/bheinzerling/bpemb.
So we need to get the package first by using in terminal:

pip install bpemb

1	pip install bpemb

Then we can jump immediatly into the python code!

We import bpemb and numpy and define, that we want to use the german model with a vocabulary size of 25.000 and 25 dimensions.

But what does this mean? Let me show you a simple example, where we code the encodings manually and to vizualize the meaning of dimensions.

Now we just need to call _encode and see what happens:

What we can read from this dictionary is, that “Mann” (Key:0) is defined by beeing grownup (1) and having a relation-value of -1

“Frau” (Key:1) is defined by beeing grownup also (1) and having a relation-value of 1.

“Tochter” (Key:2) is defined by beeing not grownup (0) and having a relation-value of 0.

Numpy allows us to do array-wise calculations, so we can do mathematic operations on these word-vectors. Lets find out what happens, when “Mann” and “Frau” are combined. You might have an idea, whats coming next.

This combined word vector contains the values of the combination of the words “Mann” and “Frau” which are the same as the values of the word “Tochter”. We teached this simple, hard coded model an association, that man and women combined point at daughter.

Word embeddings in bpemb

Now we know about the dimension-parameter of the bpemb import. But let some words be said about the vocab_size parameter. Generally, each vocab in NLP should use tokens, that represent rare or unknown word-tokens. Depending on your hardware, the vocab size will restrict you at some point in your model building, since it will run out of memory. Simply said, having a vocab_size of 25.000 word-tokens means having 24.999 real-word tokens and 1 token, that represents the rare ones. (There might also be other tokens, like for special encodings, or other special word parts). Increasing the vocab size to 50.000 will result in more different word_tokens, that would be hidden in the 25.000 vs model in the rare token.

In bpemb, we have way more complex pretrained word vectors, that allow us to do some experiments to understand them better. What happens, when we do the same as above in bpemb?

Stop, lets go through each line to see what happens

embeds1 = bpemb_ger.embed('Mann')
print(embeds1)
[[-0.69423   0.142541  0.367741  0.114064 -0.191817  0.329579 -0.181716
   1.041099 -1.150414  0.57117  -0.039958  0.936779  0.35633  -0.221654
   0.423722  0.433451 -0.308177 -0.793244  0.912121 -0.83934  -0.194813
   0.740967 -0.4088    0.140494  0.922742]]

embeds1 = bpemb_ger.embed('Mann')

print(embeds1)

[[-0.69423 0.142541 0.367741 0.114064 -0.191817 0.329579 -0.181716

1.041099 -1.150414 0.57117 -0.039958 0.936779 0.35633 -0.221654

0.423722 0.433451 -0.308177 -0.793244 0.912121 -0.83934 -0.194813

0.740967 -0.4088 0.140494 0.922742]]

We use the .embed function of our bpemb model, which tokenizes our input ‘Mann’ into subtokens, that are used by bpemb and then returns the corresponding word vector

The result is a list, containing lists of 25 elements (as the dimensions, we choose). Since “Mann” only results in one token, we only have one list element of 25 values.

We can add numpy arrays in python, so the lists are merged

embed_add = embeds1 + embeds2

1	embed_add = embeds1 + embeds2

What we have now, is a new vector containing the following Information: We want a word, that has the information of adding “Mann” and “Frau” together. For this we use .most_similar, with getting the top 6 results

bpemb_ger.most_similar(embed_add, topn=6)
[('▁frau', 0.9576594233512878),
 ('▁mann', 0.9351305365562439),
 ('▁freund', 0.8832379579544067),
 ('▁mutter', 0.8588590621948242),
 ('▁vater', 0.8556250333786011),
 ('▁kind', 0.8479531407356262)]

bpemb_ger.most_similar(embed_add, topn=6)

[('▁frau', 0.9576594233512878),

('▁mann', 0.9351305365562439),

('▁freund', 0.8832379579544067),

('▁mutter', 0.8588590621948242),

('▁vater', 0.8556250333786011),

('▁kind', 0.8479531407356262)]

“_frau” and “_mann” are still near our start vector, but we also find other words like “_freund” (friend) or mother and father. At the 6.th position we find “_kind”.

When we teached our predefined _embed method, which information each dimension contains, the bpemb model had to learn it on his own by alot of wikipedia articles. So how about we try to find out more about the meaning of each dimension? For this, we iterate through the dimensions of our embed_add vector and set the to 1 and print the top 3 results. Lets see if we can “kid” or even “daughter” to the top with this.

for i in range(25):
    new_embed_add= embed_add
    new_embed_add[0][i] = 1
    print(bpemb_ger.most_similar(embed_add, topn=3))

[('▁frau', 0.8940346240997314), ('▁mann', 0.8566429615020752), ('▁freund', 0.808347761631012)]
[('▁frau', 0.8886023759841919), ('▁mann', 0.8520528078079224), ('▁geliebte', 0.8106622695922852)]
[('▁frau', 0.8825910091400146), ('▁mann', 0.853992223739624), ('▁freund', 0.8089397549629211)]
[('▁frau', 0.8841907978057861), ('▁mann', 0.8498778343200684), ('▁freund', 0.8016359806060791)]
[('▁frau', 0.8412380814552307), ('▁mann', 0.8276678323745728), ('▁freund', 0.7763669490814209)]
[('▁frau', 0.8366867899894714), ('▁mann', 0.8289816975593567), ('▁freund', 0.7783395051956177)]
[('▁mann', 0.812994122505188), ('▁frau', 0.7430222630500793), ('▁freund', 0.7360460162162781)]
[('▁mann', 0.7824007272720337), ('felt', 0.7378767132759094), ('▁eigentlich', 0.7248014211654663)]
[('▁klar', 0.6995150446891785), ('▁gerade', 0.659595787525177), ('▁eigentlich', 0.6595355868339539)]
[(',', 0.6670671701431274), ('▁paar', 0.6461077928543091), ('▁klar', 0.6414101123809814)]
[('▁klar', 0.6853453516960144), ('▁gerade', 0.6187617778778076), ('▁enttäus', 0.6111329793930054)]
[('▁klar', 0.6924160718917847), ('▁gerade', 0.6136806011199951), ('paar', 0.6119207143783569)]
[('▁klar', 0.6746844053268433), ('paar', 0.6257284283638), ('▁paar', 0.6127488613128662)]
[('▁klar', 0.6750163435935974), ('▁hinter', 0.6560536026954651), ('▁halb', 0.6483876705169678)]
[('▁klar', 0.6728219985961914), ('▁hinter', 0.6533928513526917), ('▁halb', 0.6446695327758789)]
[('▁klar', 0.6776669025421143), ('▁gerade', 0.6607164740562439), ('▁halb', 0.6388657093048096)]
[('▁seite', 0.6523836851119995), ('▁griff', 0.6437400579452515), ('▁klar', 0.6390032172203064)]
[('▁wirft', 0.6730402112007141), ('▁enttäus', 0.6729576587677002), ('▁schlägt', 0.6632205247879028)]
[('▁enttäus', 0.6794372200965881), ('ähern', 0.6544743776321411), ('▁wirft', 0.64780193567276)]
[('▁griff', 0.6711666584014893), ('▁matt', 0.6624730825424194), ('▁republikaner', 0.6422385573387146)]
[('▁matt', 0.6543236374855042), ('▁griff', 0.6426835060119629), ('undet', 0.6052414178848267)]
[('▁matt', 0.6375241875648499), ('▁vorsprung', 0.6192678213119507), ('▁griff', 0.6146632432937622)]
[('▁calder', 0.6027392745018005), ('▁nelson', 0.599494993686676), ('▁sull', 0.5962843894958496)]
[('▁nelson', 0.5932034850120544), ('▁sull', 0.5895329117774963), ('▁gamb', 0.5844269394874573)]
[('lando', 0.5946722626686096), ('▁gamb', 0.5890830159187317), ('ingo', 0.5865989327430725)]

for i in range(25):

new_embed_add= embed_add

new_embed_add[0][i] = 1

print(bpemb_ger.most_similar(embed_add, topn=3))

[('▁frau', 0.8940346240997314), ('▁mann', 0.8566429615020752), ('▁freund', 0.808347761631012)]

[('▁frau', 0.8886023759841919), ('▁mann', 0.8520528078079224), ('▁geliebte', 0.8106622695922852)]

[('▁frau', 0.8825910091400146), ('▁mann', 0.853992223739624), ('▁freund', 0.8089397549629211)]

[('▁frau', 0.8841907978057861), ('▁mann', 0.8498778343200684), ('▁freund', 0.8016359806060791)]

[('▁frau', 0.8412380814552307), ('▁mann', 0.8276678323745728), ('▁freund', 0.7763669490814209)]

[('▁frau', 0.8366867899894714), ('▁mann', 0.8289816975593567), ('▁freund', 0.7783395051956177)]

[('▁mann', 0.812994122505188), ('▁frau', 0.7430222630500793), ('▁freund', 0.7360460162162781)]

[('▁mann', 0.7824007272720337), ('felt', 0.7378767132759094), ('▁eigentlich', 0.7248014211654663)]

[('▁klar', 0.6995150446891785), ('▁gerade', 0.659595787525177), ('▁eigentlich', 0.6595355868339539)]

[(',', 0.6670671701431274), ('▁paar', 0.6461077928543091), ('▁klar', 0.6414101123809814)]

[('▁klar', 0.6853453516960144), ('▁gerade', 0.6187617778778076), ('▁enttäus', 0.6111329793930054)]

[('▁klar', 0.6924160718917847), ('▁gerade', 0.6136806011199951), ('paar', 0.6119207143783569)]

[('▁klar', 0.6746844053268433), ('paar', 0.6257284283638), ('▁paar', 0.6127488613128662)]

[('▁klar', 0.6750163435935974), ('▁hinter', 0.6560536026954651), ('▁halb', 0.6483876705169678)]

[('▁klar', 0.6728219985961914), ('▁hinter', 0.6533928513526917), ('▁halb', 0.6446695327758789)]

[('▁klar', 0.6776669025421143), ('▁gerade', 0.6607164740562439), ('▁halb', 0.6388657093048096)]

[('▁seite', 0.6523836851119995), ('▁griff', 0.6437400579452515), ('▁klar', 0.6390032172203064)]

[('▁wirft', 0.6730402112007141), ('▁enttäus', 0.6729576587677002), ('▁schlägt', 0.6632205247879028)]

[('▁enttäus', 0.6794372200965881), ('ähern', 0.6544743776321411), ('▁wirft', 0.64780193567276)]

[('▁griff', 0.6711666584014893), ('▁matt', 0.6624730825424194), ('▁republikaner', 0.6422385573387146)]

[('▁matt', 0.6543236374855042), ('▁griff', 0.6426835060119629), ('undet', 0.6052414178848267)]

[('▁matt', 0.6375241875648499), ('▁vorsprung', 0.6192678213119507), ('▁griff', 0.6146632432937622)]

[('▁calder', 0.6027392745018005), ('▁nelson', 0.599494993686676), ('▁sull', 0.5962843894958496)]

[('▁nelson', 0.5932034850120544), ('▁sull', 0.5895329117774963), ('▁gamb', 0.5844269394874573)]

[('lando', 0.5946722626686096), ('▁gamb', 0.5890830159187317), ('ingo', 0.5865989327430725)]

As we can see, we get alot of different results, which are rather more distant to having “kid” in the top three than before. This is because the whole learnt language in the model had to be squashed into 25 dimensions. Thats far too less for each dimension to have a clear context. We just dont know, what each dimension represents.

Lets try something else, restrict the information in the word vector more by adding extra information

Oh look! We just got “_kind” (children) on the second place thorugh adding the word vector of “birth” That is pretty cool, right? But what happens, when we swap “Mann” and “Frau”?

Mathematicans wont be surprised, it is the same. Simply, because 1+2 is the same as 2+1. But, lets see if we can “_kind” even higher through weighting the word_vectors

This is better, but we still did not got “children” to the top. Maybe Adding an extra information about children will raise it more. So lets puts Spielzeug (toys) in it.

We did it! Through adding the information “Spielzeug” we got the model to understand, that we talk about children. What do you think, can we change one word, to get the word “baby” to the top?

Lets try changing “Mann” to Windeln (diapers)

“kind” is still on the top. Maybe the vocabulary is too small, to contain the word “baby”, lets try it again with 50.000 words.

bpemb_ger_mid = BPEmb(lang="de", vs=100000, dim=25)

1	bpemb_ger_mid = BPEmb(lang="de", vs=100000, dim=25)

Weird, isn’t it? Still we don’t get “Baby” to the top. Lets see, what is similar to baby.

And there we find out explanation. The word baby has in this 50k vocab model different meanings but the most important difference is, that baby is seen as an english word, which is basically similar to other english words, which were also collected in the 50k vocab model. This might be a hint, that a model with more dimensions could resolve that, since its data might be better differentiated. The model learned, that the word baby appears only in the context of other english words. So by using german words, we will not the word baby easily with this model.

In this post you learned how to use the bpemb model, to extract word embeddings to analyze them. You also learned how to find other similar words and how bpemb stores them in vectors. I hope you found it interesting! There is alot to tell about word-embeddings, but for beginners, this might be a good way of starting to use them. Why dont you share your thoughts in the comments? Thanks for reading!