Dima's Blog

No More Hand-Waving: How Vector Embeddings Work

You've probably heard about embeddings in the context of LLMs. You've likely even encountered them in discussions of RAG or the famous Word2Vec paper. Almost invariably, people and tutorials will hand-wave the concept, saying something like "Vector embeddings capture semantic relationships and contextual information," while avoiding any deeper explanation—as if it's something highly complex or magical.

I hate tech magic. I try to demystify it whenever possible. The scope of this article is precisely this:

  1. Show you how to experiment with word embeddings to build strong intuition around semantic similarities.
  2. Provide enough resources so that with some basic PyTorch knowledge and access to a few dozen GPUs, you can recreate Word2Vec or similar. Guide you toward generalizing the concept of vector embeddings to any sequence: playlists, customer browsing patterns in your e-shop, or discovering hidden patterns in your user base.

While you don't need to be a math person, you should be comfortable recalling concepts from high school geometry—think vectors, distances, and angles in multi-dimensional space.

Let's start by building intuition. Word2Vec is famous for the king – man + woman = queen example. Simple addition and subtraction of numerical representations (vector embeddings) of words apparently allows us to travel through conceptual space. The vector arithmetic properties in Word2Vec were indeed emergent properties that weren't explicitly designed into the model. This was one of the most surprising discoveries from the original Word2Vec research. It's easy to say this now after it was already discovered, but the arithmetic properties arise because the training process naturally captures semantic and syntactic relationships in the vector space.

Let's start with a slightly simpler example: cat + baby = kitten.

import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

res = wv['cat'] + wv['baby']
closest_words = wv.similar_by_vector(res, topn=5)

print(closest_words)
# [('cat', 0.8430671691894531), ('baby', 0.8285756707191467), 
#  ('kitten', 0.7884738445281982), ('puppy', 0.7541406750679016),
#  ('pup', 0.7308820486068726)]

Don't hesitate to download my notebook and try it yourself. We're here to build intuition, not just read through my examples.

Now, let's reproduce the "king – man + woman = queen" example. While we use a most_similar_cosmul method instead of simple addition, if you dig deeper and check the implementation or the original paper, it's nothing more than a well-structured and experimentally confirmed trigonometry function:

wv.most_similar_cosmul(positive=['king', 'woman'], negative=['man'])
# [('queen', 0.9314123392105103),
#  ('monarch', 0.858533501625061),
#  ('princess', 0.8476566672325134),
#  ('Queen_Consort', 0.8150269985198975),
#  ('queens', 0.8099815249443054),
#  ('crown_prince', 0.808997631072998),
#  ('royal_palace', 0.8027306795120239),
#  ('monarchy', 0.8019613027572632),
#  ('prince', 0.800979733467102),
#  ('empress', 0.7958388328552246)]

Unlike Ukrainian or French (and many other languages), English doesn't use grammatical gender of nouns. However, I noticed that people often seem to have implicit assumptions about a noun’s gender. I wanted to explore this idea, and I was able to confirm my intuition. This example is taken from one of the sources I’ll provide below:

wv.most_similar_cosmul(positive=['shirt', 'woman'], negative=['man'])
# [('blouse', 0.9350481629371643),
#  ('scarf', 0.8647228479385376),
#  ('T_shirt', 0.8582283854484558),
#  ('sleeveless_blouse', 0.8581035733222961),
#  ('bra', 0.8560400605201721),
#  ('sweater', 0.8505845069885254),
#  ('Herve_Leger_dress', 0.8483014702796936),
#  ('floral_blouse', 0.8475623726844788),
#  ('spaghetti_strap_dress', 0.8438350558280945),
#  ('shirts', 0.8410544395446777)]

There are many tutorials and overviews of varying quality online. I'm not going to write another one, but I'd like to share a single insight: with a properly structured and large enough dataset, the model architecture is surprisingly shallow—it uses just one hidden layer. The hidden layer weights become the word embeddings after training.

For the rest, I'll point you to "The Illustrated Word2vec" for understanding how to build training datasets and to "Vectoring Words (Word Embeddings)" by Computerphile for building intuition about why conceptually similar words end up in similar areas of hyperspace. The Computerphile video was especially insightful for me: they used an example with slightly modified contrast in images and image embeddings to explain why all these images land in approximately the same area. If this doesn't make sense from reading alone, go watch the video!

I hope you've now arrived at some form of generalization. First, vector embeddings can be produced for any form of sequential data. While natural language is clearly sequential, there are many more areas of our daily lives where sequences emerge and where we'd expect patterns in those sequences: listening to music in the form of playlists, navigating websites when looking for a hotel room or a new book to buy, or even finding similar users based on their behavior.

You can also think beyond sequences—vector embeddings as representations. A representation is an encoded, typically intermediate form of an input. Such representations try to compress the input into a fixed-length vector of floating-point numbers while discovering and preserving regularities in the input.

So grab some data from your own domain and start experimenting—you might be surprised by the patterns hiding in your sequences. 👷🏻‍♂️