Vector Embeddings For Long Documents

Published by Mika Berglund on

Vector embeddings for long text.

How do you create vector embeddings for long documents or long texts? This might be a question you find asking yourself when working with vector embeddings. This article is a follow-up to my previous article where I showed you how you can create vector embeddings with Azure AI Foundry and store the in Cosmos DB.

Problem Description

Currently, all embedding models in Azure AI Foundry have a limit on how long the input text can be. Limits may vary from model to model, but there is a limit on each embedding model. You can read more about the details in my previous article. The main point is – If your input text is longer than the limit, you need to chunk up your input text, and generate the vectors for each chunk separately, as suggested on Microsoft Learn.

This will result in your input text being represented by more than one vector embedding. This might be an acceptable solution in some cases. However, I think that in most cases you would also want to have a vector embedding that represents the entire input text, regardless of how long the input text is.

Combining Multiple Vectors Into One

One good solution is to take each chunk of input text and create a vector embedding out of the text. Then you would combine each chunk vector into a single vector that represents the entire input text. Now how would you do that then?

Remember that the length of a vector generated by an embedding model is constant. For instance, the text-embedding-3-small embedding model generates a vector with 1536 dimensions, by default. You can then just take the average of each dimension in each of your chunk vector. This would create quite a good result.

However, to make this result even better, you could create the resulting vector as a weighted average. When you use embedding models in Azure AI Foundry, in addition to the embedding vector, you also get a score that indicates how many tokens were consumed to produce that vector embedding. This consumption score is a good option to be used as the weight. The more tokens that you consume to generate an embedding vector, the longer and more complex the input text is. And because of that, it should also have a bigger influence on the resulting average. Let me explain using an example.

Let’s take the following two short and simple vector embedding responses.

// Vector Embedding #1

{
    "data": [
        "embedding": [1,2,1]
    ],
    "usage": {
        "total_tokens": 10
    }
}
// Vector Embedding #2

{
    "data": [
        "embedding": [1,2,5]
    ],
    "usage": {
        "total_tokens": 2
    }
}

These two vector embedding results are overly simplified for a better overview. Each vector has 3 dimensions. The resulting vector would then also have 3 dimensions.

If the dimension value in the first vector is D1, the dimension value in the second vector is D1, the number of consumed tokens for the first vector is T1 and the consumed tokens for the second vector is T2, then each dimension value for the resulting vector embedding (Tr) would be calculated as below.

Tr = ((D1 * T1) + (D2 * T2)) / (T1 + T2)

This would result in the following dimension values using the sample vectors from above.

  1. ((1 * 10) + (1 * 2)) / (10 + 2) = 1
  2. ((2 * 10) + (2 * 2)) / (10 + 2) = 2
  3. ((1 * 10) + (5 * 2)) / (10 + 2) = 1.66

So the resulting vector using weighted average is [1,2,1.66]. A pure average operation without weights would have produced a vector [1,2,3].

Conclusion

Using a weighted average is better that a pure average, since your chunks can vary is both size and complexity. However, this solves only half of the problem. You still have to figure out how to produce your chunks from your long input text.

I am currently working on a library that will help you with both chunking and calculating vector embeddings from chunked text. I will publish a link to that library when it’s ready for publishing, so stay tuned.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *