Vector embedding, chunks and similarities

Ive been experimenting a bit with vector embedding and symantic search for a proof of concept of a customer of ours.

The way it works in Thinkwise is clear to me, and I think I can get that working.

However, all examples are in the documentation are based on small texts, like an email or chat history. In my POC I'm talking about legal documents of >80 pages in which I need to search.

Based on that, some research all show that the best way is to chop the text in chuncks, which can be done in various ways.
ChatGPT has a standard way of chopping text in 800 tokens with an overlap of 400.

At the moment I use the vector embedding of chatGPT but this can become quite expensive, so I'd prefere to do as much as possible locally.

I was wondering if someone already has experience with a best practice for doing so. Maybe some python service that can be called via api or something else? Or any other tips/ / tricks?

Page 1 / 1

Hi @tiago,

I also work with an application that uses embeddings to compare the similarity between different topics. I'm not sure whether your question is driven by a need to add embeddings as quickly as possible, or by the need to search through them efficiently afterwards.

In my case, processing all the embeddings is still manageable, so I don’t have experience with splitting up the texts. However, I do run into performance issues when searching through the data afterwards, simply due to the large volume.

I'm planning to explore this further once I can start using SQL Server 2025, which introduces native support for vector search:
https://learn.microsoft.com/en-us/sql/relational-databases/vectors/vectors-sql-server?view=sql-server-ver17

I’m not sure if you are working with SQL Server, or whether this answers your question—but hopefully it’s of some help.

Hi Jeroen,

Thanks for your reply.

In my POC I save a pdf via API in a vector store.

There it is cut in chunks, and embedded

I created an assistant that uses the vector store to search in the document

(=it embeds the search words or sentence, compares the vectors, retrieve text of best chunks and use generative capabilities to respond)

Especially the chipping in chunks part is something I was looking for.

I actually was not aware SQL did not have any vector capabilities. You save it now ina varchar field?

Perhaps this presentation can help answer some questions about Vectors and SQL Server 2025:

Anyone here already has experiences with specific vector databases like chroma, milvus or pgvector that are made for these type of use cases?

...

I actually was not aware SQL did not have any vector capabilities. You save it now ina varchar field?

This is how I set it up:

Sign up

Login to the Thinkwise Software Community

Scanning file for viruses.

This file cannot be downloaded