Introduction to Vector Stores and Embeddings with LangChain

In this post, we explore vector stores and embeddings, which are crucial components for building chatbots and performing semantic search on data corpora.

Workflow

Recall the entire workflow of Retrieval Augmented Generation (RAG):

RAG Workflow

We start with documents, create smaller splits of these documents, generate embeddings for these splits, and store them in a vector store. A vector store is a database that allows easy lookup of similar vectors later.

Vector Store

Setup

Set the appropriate environment variables and load the documents we will be working with - cs229_lectures:

import os
from langchain_openai import OpenAI
from dotenv import load_dotenv, find_dotenv
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

_ = load_dotenv(find_dotenv())
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
)
loaders = [
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"), # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture03.pdf"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)
splits = text_splitter.split_documents(docs)
print("Length of splits: ", len(splits)) # Length of splits:  209

Embeddings

Now that we have split our documents into smaller, semantical meaningful chunks, it's time to create embeddings for them. Embeddings take a piece of text and create a numerical representation of that text, such that texts with similar content have similar vectors in this numerical space. This allows us to compare these vectors and find similar text snippets.

Embeddings

To illustrate, let's try some toy examples:

from langchain_openai import OpenAIEmbeddings
import numpy as np

embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
print(np.dot(embedding1, embedding2))  # 0.9631227500523609
print(np.dot(embedding1, embedding3))  # 0.7703257495981695
print(np.dot(embedding2, embedding3))  # 0.7591627401108028

As expected, the first two sentences about pets have very similar embeddings (dot product 0.96), while the sentence about weather is less similar to both pet-related sentences (dot products 0.77 and 0.76).

Vector Store

Next, we store these embeddings in a vector store, which will enable us to easily look up similar vectors later when trying to find relevant documents for a given question.

Vector Store

In this lesson, we'll use the Chroma vector store because it's lightweight and in-memory, making it easy to get started:

from langchain.vectorstores import Chroma

persist_directory = "docs/chroma/"
# this code is only for ipynb files
# !rm -rf ./docs/chroma  # remove old database files if any 
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)
print(vectordb._collection.count())  # 209

Similarity Search

How similarity search works:

Similarity Search

question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question, k=3)
print("Length of context docs: ", len(docs))  # 3
print(docs[0].page_content)

Output:

cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So 
rather than sending us email individually, if you send email to this account, it will 
actually let us get back to you maximally quickly with answers to your questions.  
If you're asking questions about homework probl ems, please say in the subject line which 
assignment and which question the email refers to, since that will also help us to route 
your question to the appropriate TA or to me  appropriately and get the response back to 
you quickly.  
Let's see. Skipping ahead — let's see — for homework, one midterm, one open and term 
project. Notice on the honor code. So one thi ng that I think will help you to succeed and 
do well in this class and even help you to enjoy this cla ss more is if you form a study 
group.  
So start looking around where you' re sitting now or at the end of class today, mingle a 
little bit and get to know your classmates. I strongly encourage you to form study groups 
and sort of have a group of people to study with and have a group of your fellow students 
to talk over these concepts with. You can also  post on the class news group if you want to 
use that to try to form a study group.  
But some of the problems sets in this cla ss are reasonably difficult.  People that have 
taken the class before may tell you they were very difficult. And just I bet it would be 
more fun for you, and you'd probably have a be tter learning experience if you form a

This returns the relevant chunk mentioning the cs229-qa@cs.stanford.edu email address for asking questions about the course material.

After this, let's persist the vector database for future use:

vectordb.persist()

Failure Modes

While basic semantic search works well, there can be some edge cases and failure modes. Let's explore some of them.

Duplicate Documents

question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k=5)
print(docs[0].page_content)
# Document(page_content='...', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})
print(docs[1].page_content)
# Document(page_content='...', metadata={'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'})

Note that the first two results are identical. This is because we deliberately duplicated the PDF of the first lecture earlier, causing the same information to appear in two different chunks. Ideally, we want to retrieve distinct chunks.

Uncaptured Structured Information

question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question, k=5)

for doc in docs:
    print(doc.metadata)
# {'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 14, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 0, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
# {'page': 6, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
# {'page': 8, 'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf'}
print(docs[4].page_content)

Output:

into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutori al in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion 
sections will be taught by the TAs, and atte ndance at discussion sections is optional, 
although they'll also be recorded and televi sed. And we'll use the discussion sections 
mainly for two things. For the next two or th ree weeks, we'll use the discussion sections 
to go over the prerequisites to this class or if some of you haven't seen probability or 
statistics for a while or maybe algebra, we'll go over those in the discussion sections as a 
refresher for those of you that want one.

In this case, we expected all retrieved documents to be from the third lecture as specified in the question. However, we see that results also include chunks from other lectures. The intuition here is that the structured information about querying only the third lecture is not captured in the semantic embeddings, which focus more on the concept of regression itself.

Summary

In this post, we covered the basics of using vector stores and embeddings for semantic search, along with some edge cases and failure modes that can arise. In the next post, we will discuss how to adress these failure modes and enhance our retrieval capabilities, ensuring we retrieve relevant and distinct chunks while incorporating structured information into the search process.

Tags: LangChain Vector Store Embeddings Semantic Search RAG

Posted on Sun, 17 May 2026 07:35:31 +0000 by rallen102