Contact Information

152 W Park Ave #150
El cajon CA 92020

We Are Available 24/ 7. Call Now.



#Embeddings #Vector #Databases #ChromaDB

Now that you understand the mechanisms behind ChromaDB, you’re ready to tackle a real-world scenario. Say you have a library of thousands of documents, and you need a way to search through them.

In particular, you want to be able to make queries that point you to relevant documents. For example, if your query is find me documents containing financial information, then you want whatever system you use to point you to a financial document in your library.

How would you design this system? With your knowledge of vectors and embeddings, your first inclination might be to run all of the documents through an embedding algorithm and store the documents and embeddings together. You’d then convert a new query to an embedding and use cosine similarity to find the documents that are most relevant to the query.

While you’re perfectly capable of writing the code for this, you’re sure there has to be something out there to do this for you. Enter vector databases!

What Is a Vector Database?

A vector database is a database that allows you to efficiently store and query embedding data. Vector databases extend the capabilities of traditional relational databases to embeddings. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are similar to a query.

As an example use case, suppose you’ve stored company documents in a vector database. This means each document has been embedded and can be compared to other embeddings through a similarity metric like cosine similarity.

The vector database will accept a query like how much revenue did the company make in Q2 2023 and embed the query. It’ll then compare the embedded query to other embeddings in the vector database and return the documents that have embeddings that are most similar to the query embedding.

In this example, perhaps the most similar document says something like Company XYZ reported $15 million in revenue for Q2 2023. The vector database identified the document that had an embedding most similar to how much revenue did the company make in Q2 2023, which likely had a high similarity score based on the document’s semantics.

To make this possible, vector databases are equipped with features that balance the speed and accuracy of query results. Here are the core components of a vector database that you should know about:

  • Embedding function: When using a vector database, oftentimes you’ll store and query data in its raw form, rather than uploading embeddings themselves. Internally, the vector database needs to know how to convert your data to embeddings, and you have to specify an embedding function for this. For text, you can use the embedding functions available in the SentenceTransformers library or any other function that maps raw text to vectors.

  • Similarity metric: To assess embedding similarity, you need a similarity metric like cosine similarity, the dot product, or Euclidean distance. As you learned previously, cosine similarity is a popular choice, but choosing the right similarity metric depends on your application.

  • Indexing: When you’re dealing with a large number of embeddings, comparing a query embedding to every embedding stored in the database is often too slow. To overcome this, vector databases employ indexing algorithms that group similar embeddings together.

    At query time, the query embedding is compared to a smaller subset of embeddings based on the index. Because the embeddings recommended by the index aren’t guaranteed to have the highest similarity to the query, this is called approximate nearest neighbor search.

  • Metadata: You can store metadata with each embedding to help give context and make query results more precise. You can filter your embedding searches on metadata much like you would in a relational database. For example, you could store the year that a document was published as metadata and only look for similar documents that were published in a given year.

  • Storage location: With any kind of database, you need a place to store the data. Vector databases can store embeddings and metadata both in memory and on disk. Keeping data in memory allows for faster reads and writes, while writing to disk is important for persistent storage.

  • CRUD operations: Most vector databases support create, read, update, and delete (CRUD) operations. This means you can maintain and interact with data like you would in a relational database.

There’s a whole lot more detail and complexity that you could explore with vector databases, but these core concepts should be enough to get you going. Next up, you’ll get your hands dirty with ChromaDB, one of the most popular and user-friendly vector databases around.

Meet ChromaDB for LLM Applications

ChromaDB is an open-source vector database designed specifically for LLM applications. ChromaDB offers you both a user-friendly API and impressive performance, making it a great choice for many embedding applications. To get started, activate your virtual environment and run the following command:

If you have any issues installing ChromaDB, take a look at the troubleshooting guide for help.

Because you have a grasp on vectors and embeddings, and you understand the motivation behind vector databases, the best way to get started is with an example. For this example, you’ll store ten documents to search over. To illustrate the power of embeddings and semantic search, each document covers a different topic, and you’ll see how well ChromaDB associates your queries with similar documents.

You’ll start by importing dependencies, defining configuration variables, and creating a ChromaDB client:

You first import chromadb and then import the embedding_functions module, which you’ll use to specify the embedding function. Next, you specify the location where ChromaDB will store the embeddings on your machine in CHROMA_DATA_PATH, the name of the embedding model that you’ll use in EMBED_MODEL, and the name of your first collection in COLLECTION_NAME.

You then instantiate a PersistentClient object that writes your embedding data to CHROMA_DB_PATH. By doing this, you ensure that data will be stored at CHROMA_DB_PATH and persist to new clients. Alternatively, you can use chromadb.Client() to instantiate a ChromaDB instance that only writes to memory and doesn’t persist on disk.

Next, you instantiate your embedding function and the ChromaDB collection to store your documents in:

You specify an embedding function from the SentenceTransformers library. ChromaDB will use this to embed all your documents and queries. In this example, you’ll continue using the "all-MiniLM-L6-v2" model. You then create your first collection.

A collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named demo_docs, it uses the "all-MiniLM-L6-v2" embedding function that you instantiated, and it uses the cosine similarity distance function as specified by metadata={"hnsw:space": "cosine"}.

The last step in setting up your collection is to add documents and metadata:

In this block, you define a list of ten documents in documents and specify the genre of each document in genres. You then add the documents and genres using collection.add(). Each document in the documents argument is embedded and stored in the collection. You also have to define the ids argument to uniquely identify each document and embedding in the collection. You accomplish this with a list comprehension that creates a list of ID strings.

The metadatas argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, "genre", that records the genre of each document. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query.

With documents embedded and stored in a collection, you’re ready to run some semantic queries:

In this example, you query the demo_docs collection for documents that are most similar to the sentence Find me some delicious food!. You accomplish this using collection.query(), where you pass your queries in query_texts and specify the number of similar documents to find with n_results. In this case, you only asked for the single document that’s most similar to your query.

The results returned by collection.query() are stored in a dictionary with the keys ids, distances, metadatas, embeddings, and documents. This is the same information that you added to your collection at the beginning, but it’s filtered down to match your query. In other words, collection.query() returns all of the stored information about documents that are most similar to your query.

As you can see, the embedding for Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens was most similar to the query Find me some delicious food. You probably agree that this document is the closest match. You can also see the ID, metadata, and distance associated with the matching document embedding. Here, you’re using cosine distance, which is one minus the cosine similarity between two embeddings.

With collection.query(), you’re not limited to single queries or single results:

Here, you pass two queries into collection.query(), Teach me about history and What’s going on in the world. You also request the two most similar documents for each query by specifying n_results=2. Lastly, by passing include=["documents", "distances"], you ensure that the dictionary only contains the documents and their embedding distances.

Calling query_results["documents"][0] shows you the two most similar documents to the first query in query_texts, and query_results["distances"][0] contains the corresponding embedding distances. As an example, the cosine distance between Teach me about history and Einstein’s theory of relativity revolutionized our understanding of space and time is about 0.627.

Similarly, query_results["documents"][1] shows you the two most similar documents to the second query in query_texts, and query_results["distances"][1] contains the corresponding embedding distances. For this query, the two most similar documents weren’t as strong of a match as in the first query. Recall that cosine distance is one minus cosine similarity, so a cosine distance of 0.80 corresponds to a cosine similarity of 0.20.

Another awesome feature of ChromaDB is the ability to filter queries on metadata. To motivate this, suppose you want to find the single document that’s most related to music history. You might run this query:

Your query is Teach me about music history, and the most similar document is Einstein’s theory of relativity revolutionized our understanding of space and time. While Einstein is a historical figure who was a musician and teacher, this isn’t quite the result that you’re looking for. Because you’re particularly interested in music history, you can filter on the "genre" metadata field to search over more relevant documents:

In this query, you specify in the where argument that you’re only looking for documents with the "music" genre. To apply filters, ChromaDB expects a dictionary where the keys are metadata names and the values are dictionaries specifying how to filter. In plain English, you can interpret {"genre": {"$eq": "music"}} as filter the collection where the "genre" metadata field equals "music".

As you can see, the document about Beethoven’s Symphony No. 9 is the most similar document. Of course, for this example, there’s only one document with the music genre. To make it slightly more difficult, you could filter on both history and music:

This query filters the collection of documents that have either a music or history genre, as specified by where={"genre": {"$in": ["music", "history"]}}. As you can see, the Beethoven document is still the most similar, while the American Revolution document is a close second. These were straightforward filtering examples on a single metadata field, but ChromaDB also supports other filtering operations that you might need.

If you want to update existing documents, embeddings, or metadata, then you can use collection.update(). This requires you to know the IDs of the data that you want to update. In this example, you’ll update both the documents and metadata for "id1" and "id2":

Here, you rename the documents for "id1" and "id2", and you also modify their metadata. To confirm that your update worked, you call collection.get(ids=["id1", "id2"]) and can see that you’ve successfully updated both documents and their metadata. If you’re not sure whether a document exists for an ID, you can use collection.upsert(). This works the same way as collection.update(), except it’ll insert new documents for IDs that don’t exist.

Lastly, if you want to delete any items in the collection, then you can use collection.delete():

In this block, you use collection.delete(ids=["id1", "id2"]) to delete all data associated with "id1" and "id2". You then verify the deletion of these two documents by calling collection.count(), and you can see that collection.get(["id1", "id2"]) has no data.

You’ve now seen many of ChromaDB’s main features, and you can learn more with the getting started guide or API cheat sheet. You used a collection of ten hand-crafted documents that allowed you to get familiar with ChromaDB’s syntax and querying functionality, but this was by no means a realistic use case. In the next section, you’ll see ChromaDB shine while you embed and query over thousands of real-world documents!



Share:

administrator

Entrepreneur Creative Designer @al.janahy Founder of @inkhost another face of @aljanahy.fit I hope to stay passionate in what I doing

Leave a Reply

Your email address will not be published. Required fields are marked *