Embeddings are a clever technique used to convert diverse types of data, such as text, images, songs, or even user behaviour, into points in a mathematical space. These points are not placed randomly; instead, their positions are calculated so that related items are located closer to each other. For example, in a space created for songs, tunes that sound alike would be neighbours, while in a space for images, visually similar pictures would be grouped together. This makes it easier to find items that are alike, whether you are looking for songs that have the same vibe, images that share visual themes, or customers with comparable shopping patterns.
The real power of embeddings is highlighted when you want to search for things that are similar. Imagine you are dealing with movie plots. By using embeddings, you could quickly find movies with related stories. Or, if you are looking at news articles, you could find other articles that cover related topics. This is not just about finding synonyms for words; it is about uncovering an entire world of related data, from movies to articles, based on the content's deeper meanings and connections.
Moreover, embeddings allow us to measure how similar two pieces of data are by calculating the distance between their points in this space. This similarity measurement can be used in various applications, such as identifying duplicates, recognizing faces, or even correcting typos. It is fascinating because it can even capture complex relationships and nuances in data, like grammatical details in text, allowing for neat tricks like solving word analogies. This approach of mapping data into a structured space unlocks a multitude of possibilities for understanding and interacting with information in more intuitive and meaningful ways.
How can vector embeddings help find similarity in a text?
Let us take two sentences to work through the analogy:
"The quick brown fox jumps over the lazy dog."
"A fast brown fox leaps over the sleepy dog."
Now, let us break down the process:
Tokenization:
Sentence 1 Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Sentence 2 Tokens: ["A", "fast", "brown", "fox", "leaps", "over", "the", "sleepy", "dog"]
Conversion to Embeddings:
Each token (word) is mapped to a unique vector in a high-dimensional space. This vector is determined by an embedding layer in a neural network, which has been trained on a large corpus of text to position words with similar meanings close to one another.
For instance, the word "quick" might be represented as a vector [0.5, 1.2, -0.3], while "fast" might be [0.51, 1.19, -0.31] — remarkably similar because the words are synonyms.
Calculating Similarity:
The similarity between the two embedding vectors is calculated using cosine similarity, which measures the cosine of the angle between the two vectors. The similarity score is 0.6396 on a scale from 0 to 1, where 1 means identical. A score of approximately 0.64 suggests that the sentences are fairly like each other, at least in the context of the simple Bag of Words embedding model used here.
The cosine similarity calculation is based on the dot product of the vectors divided by the product of their magnitudes. It provides a measure of the orientation of the two vectors in space, which is interpreted as the similarity in the context of embeddings.
How are they created?
Vector embeddings are created through a process that involves mapping each token or piece of data to a vector of real numbers. This process is typically performed using neural networks or matrix factorization techniques.
Vector embeddings are developed through different techniques, each with unique characteristics and suited for various applications.
TF-IDF (Term Frequency-Inverse Document Frequency): This is a statistical measure that evaluates how relevant a word is to a document within a collection of documents. By considering the frequency of words in documents and offsetting this by the number of documents the words appear in, TF-IDF can highlight words that are more specific to a particular document. While it is good for identifying important terms within documents, it does not capture the meaning of words or their semantic relationships. Hence, it is widely used in tasks like information retrieval and keyword extraction.
Word2Vec: This technique relies on a neural network model and comes in two flavours: Continuous Bag-of-Words (CBOW) and Skip-gram. Both are adept at discerning meanings and various semantic relationships between words by predicting words from their context or vice versa. Word2Vec is particularly useful for semantic analysis tasks, as it can capture nuanced word associations which are essential in understanding the meaning of texts.
GloVe (Global Vectors for Word Representation): Instead of focusing on local context as in Word2Vec, GloVe captures global word-word co-occurrence in a corpus, effectively addressing the limitation of local context. It uses matrix factorization techniques and is particularly strong at tasks like word analogy and named-entity recognition. GloVe often yields comparable or even superior results to Word2Vec in certain semantic analysis tasks.
BERT (Bidirectional Encoder Representations from Transformers): This more advanced technique uses transformers, which employ an attention mechanism to capture contextual information from both directions in a text (left-to-right and right-to-left). BERT is excellent at understanding the context of words in a sentence and, as such, is a powerhouse for language translation, question-answering systems, and even for improving the relevance of search engine results, as seen with its deployment in Google Search.
Vector embeddings help AI understand and create both text and pictures. For text, like stories or articles, they turn words into special codes that the AI uses to write in a way that makes sense. For pictures, they help AI like DALL·E turn words into images that match what the words describe.
Comments