Breaking Down Large Text for LLMs: The Power of Recursive Character Text Splitter

Ever tried reading a novel in one go? Neither have I! Just like our brains can only handle so much at once, Large Language Models (LLMs) like GPT-4 have limits on how much text they can process in one shot.

If you are trying to pass a lengthy content for a large language model to train, you just can’t pass it as it is. You need a strategy. Thats where text splitters steps in. Recursive character text splitter proves to be the recommended and the best one.

What does recursive character text splitter do?

It tries to split on them in order until the chunks are small enough. The default list is [“\n\n”, “\n”, ” “, “”]. But what does this actually mean?

“\n\n”: This means splitting by Paragraphs

Paragraph
This is the first paragraph.

This is the second paragraph.

Chunks
Chunk1: This is the first paragraph
Chunk2: This is the second paragraph

“\n”: This means splitting by Sentences

Sentence
This is the first paragraph.
This is the second paragraph.

Chunks
Chunk1: This is the first paragraph
Chunk2: This is the second paragraph

“ ” : This means splitting by words

Sentence
This is the first paragraph.

Chunks
Chunk1: This
Chunk2: is
Chunk3: the
Chunk4: first
Chunk5: paragraph

“” : This means splitting by character

Word
Hello

Chunks
Chunk1: H
Chunk2: E
Chunk3: L
Chunk4: L
Chunk5: O

Let’s see how the Recursive Character Text Splitter operates in action with a sample text.

Let’s apply the Recursive Character Text Splitter to it:

Lets breakdown the code:

The code reads the input file
Applies recursive character text splitter on the contents
The chunks are stored in separate document and returned

Within the recursive character text splitter we see two parameter , lets see what are they?

Chunk Size: LLMs have a limit on how much text they can process at once. By controlling the chunk size, you ensure that each piece of text stays within the model’s token limit. This allows you to process large documents in smaller, manageable segments without overwhelming the model.

Chunk Overlap: When you split a document into chunks, some important context might get cut off between chunks. Overlap ensures that the last part of one chunk is included in the next chunk, maintaining continuity and context, which leads to more coherent outputs from the model.

Output Example

The first split focuses on paragraph titles, followed by a sentence split that respects the chunk size of 100. The third chunk retains 20 characters from the previous chunk, continuing with the next set of sentences.

Conclusion
In conclusion, the Recursive Character Text Splitter isn’t just a handy tool—it’s a behind-the-scenes hero when it comes to optimizing large language models. By breaking down documents into manageable, meaningful chunks, it ensures that even the most complex texts are processed smoothly. Whether it’s crafting the perfect response or diving deep into your data, this clever technique keeps everything running efficiently. So next time you chat with an AI, remember—somewhere, text is being split just right, keeping your conversation flowing!

You Might Also Like

A Glimpse into Evolution of LLM based on NLP perspective

AI. The Next Stage in Evolution?

Unlocking MongoDB with Natural Language: Say It, Query It!

Leave a Reply Cancel reply