Blog / AI
AI

What Are Embeddings in Machine Learning?

Embeddings help AI understand words and data, powering search engines, LLMs, and recommendations.
7 min read
What are Embeddings in Machine Learning blog image

Without embeddings, the AI industry and tech in general would be virtually unrecognizable. LLMs wouldn’t understand you, search engines would have no clue what you’re looking for, and all other recommendation systems would spit out random junk.

Follow along and we’ll explore how embeddings work and their importance in machine learning.

What Are Embeddings?

Machines don’t understand words, but they do understand numbers. When you write code in any software language, through compilation or interpretation, it eventually winds up as binaries or hex code (both numerical formats) that a machine can understand.

In AI, particularly with machine learning, the model needs to understand information. This is where embeddings come in. Using embeddings, we can transform words, images and any other type of information into machine readable numbers. This allows AI to find patterns, relationships, and meaning.

Machines understand numbers, not words. Embeddings are the bridge between human data and AI.

Why Embeddings Matter

Imagine a world where you search for a pizza place and get recommendations for tacos. When web scraping, imagine asking ChatGPT or Claude for Python tips and receiving instructions on how to take care of a pet python!

Embeddings allow models to understand your intent. Without them, most systems would work by matching your exact text to something in their database.

  • Search Engines: Embeddings help Google understand what you’re actually looking for.
  • LLMS: With embeddings, these models can understand what you’re actually saying. Without them, LLMs would fail to find your meaning… Remember the Python tips?
  • Recommendations: Companies like Netflix use them along with filtering and a few other techniques to recommend shows you’ll actually enjoy.

Embeddings allow machines to not just read data, but actually understand it.

Vectors: The Language of Embeddings

In its simplest form, a vector is just a list. Imagine you wish to represent a list of laptops. Each laptop has details like OS, CPU manufacturer, processing cores, and RAM.

Windows Laptop and Chromebook

If we have two laptops, they might be represented like this below.

  • Windows Laptop: ["Windows", "Intel", 4, "8"]
  • Chromebook: ["ChromeOS", "Mediatek", 8, "4"]

Matrices: Combining Vectors Into Tables

A matrix is a list of lists. Technical purists will correct me here and call it a vector of vectors… but as we established earlier, a vector is just a list. When humans look at a matrix, we view it as a table.

Here is our human readable matrix.

OS CPU Manufacturer Processor Cores RAM (GB)
Windows Intel 4 8
ChromeOS Mediatek 8 4

Our matrix is vector of vectors (list of lists). As you can see, this is tougher to read, but still understandable. For a machine it’s actually easier to read than the table above, but we’re still not optimized for machine readability.

[
    ["Windows", "Intel", 4, 8],
    ["ChromeOS", "Mediatek", 8, 4]
]

For it to be truly machine readable, we need to replace words with numbers. We’ll assign a number to represent each of our non-numerical traits.

OS

  • Windows: 0
  • ChromeOS: 1

CPU Manufacturer:

  • Intel: 0
  • Mediatek: 1

At this point, our “table” completely loses human readability. However, machines handle numbers extremely well. This allows machines to efficiently process this data to find relationships.

[
    [0, 0, 4, 8],
    [1, 1, 8, 4]
]

This is perfect for a machine to look at. Machines don’t read words, but they can detect patterns in numbers. In this format, a model can effectively analyze our data and look for patterns.

How Embeddings Work

Word Grouping Based On Context

Embeddings go far beyond the numerical encoding we created above. Embeddings allow us to convert large sets of data into more complex matrices that you or I wouldn’t be able to comprehend without extensive analysis.

With embeddings, AI can actually analyze this data and apply formulas to find relationships. King and Queen are similar concepts. Both of these objects would have similar vectors because their concepts are almost identical.

With vectors, we can actually perform math. Machines are much better at it than we are. A machine might view their relationship with the formula you see below.

  • King - Man + Woman = Queen

Supervised and Unsupervised Embeddings

There are two main types of embeddings: Supervised and Unsupervised.

Supervised Embeddings

Shapes: Labeled Data

If we train a model on structured data with labels and mappings, this is called Supervised Learning and it generates Supervised embeddings. The AI is being explicitly taught by a human.

Common Uses

  • Email: Certain types of email are mapped as either spam or not spam.
  • Images: A model is trained on labeled images of cats and dogs.

With Supervised Embeddings, humans are already aware of a pattern and they teach it to the machine.

Unsupervised Embeddings

Unstructured Human Writing

Unsupervised embeddings are unstructured and unlabeled. The model scans massive amounts of data. Then it groups together words and characters that commonly appear together. This allows the model to discover patterns rather than learn them directly from a human. With enough discovery, these patterns can lead to prediction.

Common Uses

  • LLMs: Large Language Models are designed to scan large datasets of words and accurately predict how they fit together.
  • Autocomplete and Spellcheck: A more primitive form of this same concept. It’s designed to accurately predict the characters that words are built from.

How Embeddings Are Created

Steps To Create Embeddings

Embeddings aren’t just assigned by humans, they are learned. To learn similarities, patterns, and eventually relationships, a model needs to be trained on a massive amount of data.

Step 1: Collecting the Data

A model needs a large dataset to train on. If you train your model using Wikipedia, it will learn facts from Wikipedia and speak like Wikipedia. Our Web Scraper API can help you extract high quality data in real time.

You can train your model on pretty much anything.

  • Text: Books, PDFs, websites etc.
  • Images: Labeled images, pixel relationships
  • User Interactions: Product recommendations, browser behavior

Step 2: Converting the Data Into Vectors

As we learned earlier, machines don’t perform well with human readable data. The data collected from our previous step needs to be converted into numerical vectors.

There are two types of encoding:

  • One-Hot Encoding: This method is more basic. In this format, the model cannot capture relationships in the data.
  • Dense Embeddings: These are more common in modern AI. Closely related objects (King and Queen) are grouped closely together within the matrix.

Step 3: Training the Model

To create embeddings, models use machine learning techniques like the ones outlined below.

  1. Word Co-Occurence (Word2Vec, GloVe)
    • The model scans massive amounts of text in order to analyze relationships and learn.
    • Words occurring in similar context are grouped closely within the vector.
    • “Paris” is located close to “France” in the vector but far from “Pizza”.
  2. Contextual Learning (BERT, GPT)
    • Transformer models are designed to understand the context of an entire sentence.
    • Models can capture multiple meanings of words based on context.
    • “River bank” has a completely different meaning than “money in the bank” and transformer models understand this.

Step 4: Fine Tuning

Once a model has been trained, it needs to be fine tuned. To fine tune a model, its embeddings are tweaked to fit its purpose for specific tasks.

  • Search engines refine their embeddings to better understand queries.
  • Recommendation systems often adjust their embeddings based on user behavior.
  • LLMs require periodic fine tuning to adjust their embeddings based on new data.

Conclusion

Embeddings are an integral part of not only the modern AI industry, but the tech industry as a whole. They underpin everything from search results to LLMs. With our datasets, you get access to vast amounts of good data to train your model.

Sign up now and start your free trial, including dataset samples.

No credit card required