When working with extensive text documents, it's essential to divide them into manageable pieces. While this might appear straightforward, numerous complexities arise. Ideally, we want to maintain semantically related text segments together, though what constitutes "semantic relevance" can vary based on the document type. This article explores several approaches to acheive this goal.
At a fundamental level, text splitters operate through these steps:
- Divide the text into small, semantically meaningful chunks (typically sentences).
- Begin combining these smaller chunks into larger segments until reaching a specific size (measured by a particular function).
- Once the size limit is reached, treat that segment as an independent text fragment, then start creating a new segment with some overlap to maintain context between chunks.
This provides two primary dimensions for customizing text splitters:
- How the text is divided
- How chunk size is measured
The recommended default text splitter is RecursiveCharacterTextSplitter. This splitter accepts a list of characters as parameters and attempts to segment based on the first character in the list. If any resulting chunk is too large, it proceeds to the next character, continuing this process sequentially. By default, it attempts to split on characters like \n\n, \n, and others. Beyond controlling the splitting characters, several additional parameters can be customized:
length_function: Determines how chunk length is calculated. The default counts characters, but it's common to provide a token counter instead.chunk_size: The maximum size for chunks (as measured by the length function).chunk_overlap: The maximum amount of overlap between consecutive chunks. Maintaining some overlap preserves continuity between segments (similar to a sliding window approach).add_start_index: Whether to include the starting position of each chunk in the original document's metadata within the metadata.
# This is a lengthy document we can segment.
with open('../../us_constitution.txt') as f:
constitutional_text = f.read()
from langchain.text_splitter import RecursiveCharacterTextSplitter
document_segmenter = RecursiveCharacterTextSplitter(
# Setting a small chunk size for demonstration purposes.
chunk_size = 150,
chunk_overlap = 15,
length_function = len,
add_start_index = True,
)
segments = document_segmenter.create_documents([constitutional_text])
print(segments[0])
print(segments[1])
This code demonstrates how to split a document into smaller, manageable chunks while maintaining some overlap between segments to preserve context. The add_start_index parameter helps track where each segment originated in the original document, which can be useful for reference or citation purposes.