Semantic Chunking for RAG

What is Chunking ?
In order to abide by the context window of the LLM , we usually break text into smaller parts / pieces which is called chunking.
What is RAG?
LLMs, although capable of generating text that is both meaningful and grammatically correct, these LLMs suffer from a problem called hallucination. Hallucination in LLMs is the concept where the LLMs confidently generate wrong answers, that is they make up wrong answers in a way that makes us believe that it is true. This has been a major problem since the introduction of the LLMs. These hallucinations lead to incorrect and factually wrong answers. Hence Retrieval Augmented Generation was introduced.
In RAG, we take a list of documents/chunks of documents and encode these textual documents into a numerical representation called vector embeddings, where a single vector embedding represents a single chunk of document and stores them in a database called vector store. The models required for encoding these chunks into embeddings are called encoding models or bi-encoders. These encoders are trained on a large corpus of data, thus making them powerful enough to encode the chunks of documents in a single vector embedding representation.
The retrieval greatly depends on how the chunks are manifested and stored in the vectorstore. Finding the right chunk size for any given text is a very hard question in general.
Improving Retrieval can be done by various retrieval method. But it can also be done by better chunking strategy.
Different chunking methods:
- Fixed size chunking
- Recursive Chunking
- Document Specific Chunking
- Semantic Chunking
- Agentic Chunking
Fixed Size Chunking: This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking will be the best path in most common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any NLP libraries.
Recursive Chunking : Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size. Leverages what is good about fixed size chunk and overlap.
Document Specific Chunking: It takes into consideration the structure of the document . Instead of using a set number of characters or recursive process it creates chunks that align with the logical sections of the document like paragraphs or sub sections. By doing this it maintains the author’s organization of the content thereby keeping the text coherent. It makes the retrieved information more relevant and useful, particularly for structured documents with clearly defined sections. It can handle formats such as Markdown, Html, etc.
Sematic Chunking: Semantic Chunking considers the relationships within the text. It divides the text into meaningful, semantically complete chunks. This approach ensures the information’s integrity during retrieval, leading to a more accurate and contextually appropriate outcome. It is slower compared to the previous chunking strategy
Agentic Chunk: The hypothesis here is to process documents in a fashion that humans would do.
- We start at the top of the document, treating the first part as a chunk.
- We continue down the document, deciding if a new sentence or piece of information belongs with the first chunk or should start a new one
- We keep this up until we reach the end of the document.
This approach is still being tested and isn't quite ready for the big leagues due to the time it takes to process multiple LLM calls and the cost of those calls. There's no implementation available in public libraries just yet.
Here we will experiment with Semantic chunking and Recursive Retriever .
Comparison of methods steps:
- Load the Document
- Chunk the Document using the following two methods: Semantic chunking and Recursive Retriever .
- Assess qualitative and quantitative improvements with RAGAS
Semantic Chunks
Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.
By focusing on the text’s meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It’s a top-notch choice when maintaining the semantic integrity of the text is vital.
The hypothesis here is we can use embeddings of individual sentences to make more meaningful chunks. Basic idea is as follows :-
- Split the documents into sentences based on separators(.,?,!)
- Index each sentence based on position.
- Group: Choose how many sentences to be on either side. Add a buffer of sentences on either side of our selected sentence.
- Calculate distance between group of sentences.
- Merge groups based on similarity i.e. keep similar sentences together.
- Split the sentences that are not similar.
Technology Stack Used
- Langchain :LangChain is an open-source framework designed to simplify the creation of applications using large language models (LLMs). It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.
- LLM: Groq’s Language Processing Unit (LPU) is a cutting-edge technology designed to significantly enhance AI computing performance, especially for Large Language Models (LLMs). The primary goal of the Groq LPU system is to provide real-time, low-latency experiences with exceptional inference performance.
- Embedding Model: FastEmbed is a lightweight, fast, Python library built for embedding generation.
- Evaluation: Ragas offers metrics tailored for evaluating each component of your RAG pipeline in isolation.
Code Implementation
Install the required dependencies
!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas chromadb langchain-groq fastembed pypdf openai
Download data
! wget ""
Process the PDF Content
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("1810.04805.pdf")
documents = loader.load()
Perform Native Chunking(RecursiveCharacterTextSplitting)
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
naive_chunks = text_splitter.split_documents(documents)
for chunk in naive_chunks[10:15]:
print(chunk.page_content+ "\n")
E[CLS] E1 E[SEP] ... ENE1’... EM’
T[SEP] ...
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM
Question Paragraph Start/End Span
E[CLS] E1 E[SEP] ... ENE1’... EM’
T[SEP] ...
[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM
Masked Sentence A Masked Sentence B
Pre-training Fine-Tuning NSP Mask LM Mask LM
Unlabeled Sentence A and B Pair SQuAD
Question Answer Pair NER MNLI Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-
tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize
models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special
symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
ing and auto-encoder objectives have been used
for pre-training such models (Howard and Ruder,
2018; Radford et al., 2018; Dai and Le, 2015).
2.3 Transfer Learning from Supervised Data
There has also been work showing effective trans-
fer from supervised tasks with large datasets, such
as natural language inference (Conneau et al.,
2017) and machine translation (McCann et al.,
2017). Computer vision research has also demon-
strated the importance of transfer learning from
large pre-trained models, where an effective recipe
is to fine-tune models pre-trained with Ima-
geNet (Deng et al., 2009; Yosinski et al., 2014).
We introduce BERT and its detailed implementa-
tion in this section. There are two steps in our
framework: pre-training and fine-tuning . Dur-
ing pre-training, the model is trained on unlabeled
data over different pre-training tasks. For fine-
tuning, the BERT model is first initialized with
the pre-trained parameters, and all of the param-
eters are fine-tuned using labeled data from the
downstream tasks. Each downstream task has sep-
arate fine-tuned models, even though they are ini-
tialized with the same pre-trained parameters. The
question-answering example in Figure 1 will serve
as a running example for this section.
A distinctive feature of BERT is its unified ar-
chitecture across different tasks. There is mini-mal difference between the pre-trained architec-
ture and the final downstream architecture.
Model Architecture BERT’s model architec-
ture is a multi-layer bidirectional Transformer en-
coder based on the original implementation de-
scribed in Vaswani et al. (2017) and released in
thetensor2tensor library.1Because the use
of Transformers has become common and our im-
plementation is almost identical to the original,
we will omit an exhaustive background descrip-
tion of the model architecture and refer readers to
Vaswani et al. (2017) as well as excellent guides
such as “The Annotated Transformer.”2
In this work, we denote the number of layers
(i.e., Transformer blocks) as L, the hidden size as
H, and the number of self-attention heads as A.3
We primarily report results on two model sizes:
BERT BASE (L=12, H=768, A=12, Total Param-
eters=110M) and BERT LARGE (L=24, H=1024,
A=16, Total Parameters=340M).
BERT BASE was chosen to have the same model
size as OpenAI GPT for comparison purposes.
Critically, however, the BERT Transformer uses
bidirectional self-attention, while the GPT Trans-
former uses constrained self-attention where every
token can only attend to context to its left.4
3In all cases we set the feed-forward/filter size to be 4H,
i.e., 3072 for the H= 768 and 4096 for the H= 1024 .
4We note that in the literature the bidirectional Trans-
Input/Output Representations To make BERT
handle a variety of down-stream tasks, our input
representation is able to unambiguously represent
both a single sentence and a pair of sentences
(e.g.,⟨Question, Answer⟩) in one token sequence.
Throughout this work, a “sentence” can be an arbi-
trary span of contiguous text, rather than an actual
linguistic sentence. A “sequence” refers to the in-
put token sequence to BERT, which may be a sin-
gle sentence or two sentences packed together.
We use WordPiece embeddings (Wu et al.,
2016) with a 30,000 token vocabulary. The first
token of every sequence is always a special clas-
sification token ( [CLS] ). The final hidden state
corresponding to this token is used as the ag-
gregate sequence representation for classification
tasks. Sentence pairs are packed together into a
single sequence. We differentiate the sentences in
two ways. First, we separate them with a special
token ( [SEP] ). Second, we add a learned embed-
Instantiate Embedding Model
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")
Setup the API Key for LLM
from google.colab import userdata
from groq import Groq
from langchain_groq import ChatGroq
groq_api_key = userdata.get("GROQ_API_KEY")
Perform Semantic Chunking
We’re going to be using the `percentile` threshold as an example today — but there’s three different strategies you could use on Semantic Chunking):
- `percentile` (default) — In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.
- `standard_deviation` — In this method, any difference greater than X standard deviations is split.
- `interquartile` — In this method, the interquartile distance is used to split chunks.
NOTE: This method is currently experimental and is not in a stable final form — expect updates and improvements in the coming months
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
semantic_chunker = SemanticChunker(embed_model, breakpoint_threshold_type="percentile")
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])
for semantic_chunk in semantic_chunks:
if "Effect of Pre-training Tasks" in semantic_chunk.page_content:
Dev Set
(Acc) (Acc) (Acc) (Acc) (F1)
BERT BASE 84.4 88.4 86.7 92.7 88.5
No NSP 83.9 84.9 86.5 92.6 87.9
LTR & No NSP 82.1 84.3 77.5 92.1 77.8
+ BiLSTM 82.1 84.1 75.7 91.6 84.9
Table 5: Ablation over the pre-training tasks using the
BERT BASE architecture. “No NSP” is trained without
the next sentence prediction task. “LTR & No NSP” is
trained as a left-to-right LM without the next sentence
prediction, like OpenAI GPT. “+ BiLSTM” adds a ran-
domly initialized BiLSTM on top of the “LTR + No
NSP” model during fine-tuning. ablation studies can be found in Appendix C. 5.1 Effect of Pre-training Tasks
We demonstrate the importance of the deep bidi-
rectionality of BERT by evaluating two pre-
training objectives using exactly the same pre-
training data, fine-tuning scheme, and hyperpa-
rameters as BERT BASE :
No NSP : A bidirectional model which is trained
using the “masked LM” (MLM) but without the
“next sentence prediction” (NSP) task. LTR & No NSP : A left-context-only model which
is trained using a standard Left-to-Right (LTR)

Instantiate the Vectorstore
from langchain_community.vectorstores import Chroma
semantic_chunk_vectorstore = Chroma.from_documents(semantic_chunks, embedding=embed_model)
We will “limit” our semantic retriever to k = 1 to demonstrate the power of the semantic chunking strategy while maintaining similar token counts between the semantic and naive retrieved context.
Instantiate Retrieval Step
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
semantic_chunk_retriever.invoke("Describe the Feature-based Approach with BERT?")
[Document(page_content='The right part of the paper represents the\nDev set results. For the feature-based approach,\nwe concatenate the last 4 layers of BERT as the\nfeatures, which was shown to be the best approach\nin Section 5.3. From the table it can be seen that fine-tuning is\nsurprisingly robust to different masking strategies. However, as expected, using only the M ASK strat-\negy was problematic when applying the feature-\nbased approach to NER. Interestingly, using only\nthe R NDstrategy performs much worse than our\nstrategy as well.')]
Instantiate Augmentation Step(for content Augmentation)
from langchain_core.prompts import ChatPromptTemplate
rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.
User's Query:
rag_prompt = ChatPromptTemplate.from_template(rag_template)
Instantiate the Generation Step
chat_model = ChatGroq(temperature=0,
Creating a RAG Pipeline Utilizing Semantic Chunking
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
semantic_rag_chain = (
{"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
| rag_prompt
| chat_model
| StrOutputParser()
Ask Question 1
semantic_rag_chain.invoke("Describe the Feature-based Approach with BERT?")
################ RESPONSE ###################################
The feature-based approach with BERT, as mentioned in the context, involves using BERT as a feature extractor for a downstream natural language processing task, specifically Named Entity Recognition (NER) in this case.
To use BERT in a feature-based approach, the last 4 layers of BERT are concatenated to serve as the features for the task. This was found to be the most effective approach in Section 5.3 of the paper.
The context also mentions that fine-tuning BERT is surprisingly robust to different masking strategies. However, when using the feature-based approach for NER, using only the MASK strategy was problematic. Additionally, using only the RND strategy performed much worse than the proposed strategy.
In summary, the feature-based approach with BERT involves using the last 4 layers of BERT as features for a downstream NLP task, and fine-tuning these features for the specific task. The approach was found to be robust to different masking strategies, but using only certain strategies was problematic for NER.
Ask Question 2
semantic_rag_chain.invoke("What is SQuADv2.0?")
################ RESPONSE ###################################
SQuAD v2.0, or Squad Two Point Zero, is a version of the Stanford Question Answering Dataset (SQuAD) that extends the problem definition of SQuAD 1.1 by allowing for the possibility that no short answer exists in the provided paragraph. This makes the problem more realistic, as not all questions have a straightforward answer within the provided text. The SQuAD 2.0 task uses a simple approach to extend the SQuAD 1.1 BERT model for this task, by treating questions that do not have an answer as having an answer span with start and end at the [CLS] token, and comparing the score of the no-answer span to the score of the best non-null span for prediction. The document also mentions that the BERT ensemble, which is a combination of 7 different systems using different pre-training checkpoints and fine-tuning seeds, outperforms all existing systems by a wide margin in SQuAD 2.0, even when excluding entries that use BERT as one of their components.
Ask Question 3
semantic_rag_chain.invoke("What is the purpose of Ablation Studies?")
################ RESPONSE ###################################
Ablation studies are used to understand the impact of different components or settings of a machine learning model on its performance. In the provided context, ablation studies are used to answer questions about the effect of the number of training steps and masking procedures on the performance of the BERT model. By comparing the performance of the model under different conditions, researchers can gain insights into the importance of these components or settings and how they contribute to the overall performance of the model.
Implement a RAG pipeline using Naive Chunking Strategy
naive_chunk_vectorstore = Chroma.from_documents(naive_chunks, embedding=embed_model)
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 5})
naive_rag_chain = (
{"context" : naive_chunk_retriever, "question" : RunnablePassthrough()}
| rag_prompt
| chat_model
| StrOutputParser()
Note : Here we are going to use k = 5 ;this is to “make it a fair comparison” between the two strategies.
Ask Question 1
naive_rag_chain.invoke("Describe the Feature-based Approach with BERT?")
The Feature-based Approach with BERT involves extracting fixed features from the pre-trained BERT model, as opposed to the fine-tuning approach where all parameters are jointly fine-tuned on a downstream task. The feature-based approach has certain advantages, such as being applicable to tasks that cannot be easily represented by a Transformer encoder architecture, and providing major computational benefits by pre-computing an expensive representation of the training data once and then running many experiments with cheaper models on top of this representation. In the context provided, the feature-based approach is compared to the fine-tuning approach on the CoNLL-2003 Named Entity Recognition (NER) task, with the feature-based approach using a case-preserving WordPiece model and including the maximal document context provided by the data. The results presented in Table 7 show the performance of both approaches on the NER task.
Ask Question 2
naive_rag_chain.invoke("What is SQuADv2.0?")
SQuAD v2.0, or the Stanford Question Answering Dataset version 2.0, is a collection of question/answer pairs that extends the SQuAD v1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph. This makes the problem more realistic. The SQuAD v2.0 BERT model is extended from the SQuAD v1.1 model by treating questions that do not have an answer as having an answer span with start and end at the [CLS] token, and extending the probability space for the start and end answer span positions to include the position of the [CLS] token. For prediction, the score of the no-answer span is compared to the score of the best non-null span.
Ask Question 3
naive_rag_chain.invoke("What is the purpose of Ablation Studies?")
Ablation studies are used to evaluate the effect of different components or settings in a machine learning model. In the provided context, ablation studies are used to understand the impact of certain aspects of the BERT model, such as the number of training steps and masking procedures, on the model's performance.
For instance, one ablation study investigates the effect of the number of training steps on BERT's performance. The results show that BERT BASE achieves higher fine-tuning accuracy on MNLI when trained for 1M steps compared to 500k steps, indicating that a larger number of training steps contributes to better performance.
Another ablation study focuses on different masking procedures during pre-training. The study compares BERT's masked language model (MLM) with a left-to-right strategy. The results demonstrate that the masking strategies aim to reduce the mismatch between pre-training and fine-tuning, as the [MASK] symbol does not appear during the fine-tuning stage. The study also reports Dev set results for both MNLI and Named Entity Recognition (NER) tasks, considering fine-tuning and feature-based approaches for NER.
Ragas Assessment Comparison for Semantic Chunker
split documents using RecursiveCharacterTextSplitter
synthetic_data_splitter = RecursiveCharacterTextSplitter(
synthetic_data_chunks = synthetic_data_splitter.create_documents([d.page_content for d in documents])
Create the Following Datasets
- Questions — synthetically generated (grogq-mixtral-8x7b-32768)
- Contexts — created above(Synthetic data chunks)
- Ground Truths — synthetically generated (grogq-mixtral-8x7b-32768)
- Answers — generated from our Semantic RAG Chain
questions = []
ground_truths_semantic = []
contexts = []
answers = []
question_prompt = """\
You are a teacher preparing a test. Please create a question that can be answered by referencing the following context.
question_prompt = ChatPromptTemplate.from_template(question_prompt)
ground_truth_prompt = """\
Use the following context and question to answer this question using *only* the provided context.
ground_truth_prompt = ChatPromptTemplate.from_template(ground_truth_prompt)
question_chain = question_prompt | chat_model | StrOutputParser()
ground_truth_chain = ground_truth_prompt | chat_model | StrOutputParser()
for chunk in synthetic_data_chunks[10:20]:
questions.append(question_chain.invoke({"context" : chunk.page_content}))
ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
Note: for experimental purpose we have considered only 10 samples
Format the content generated into HuggingFace Dataset Format
from datasets import load_dataset, Dataset
qagc_list = []
for question, answer, context, ground_truth in zip(questions, answers, contexts, ground_truths_semantic):
"question" : question,
"answer" : answer,
"contexts" : context,
"ground_truth" : ground_truth
eval_dataset = Dataset.from_list(qagc_list)
features: ['question', 'answer', 'contexts', 'ground_truth'],
num_rows: 10
Implement Ragas metrics and evaluate our created dataset.

from ragas.metrics import (
from ragas import evaluate
result = evaluate(
Here I had tried to use open source LLM using Groq. But got a rate limit error :
groq.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `mixtral-8x7b-32768` in organization `org_01htsyxttnebyt0av6tmfn1fy6` on tokens per minute (TPM): Limit 4500, Used 3867, Requested ~1679. Please try again in 13.940333333s. Visit for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

So redirected the LLM to use OpenAI which is by default used in RAGAS framework.
Set up OpenAI API keys
import os
from google.colab import userdata
import openai
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
openai.api_key = os.environ['OPENAI_API_KEY']
from ragas import evaluate
result = evaluate(
{'context_precision': 1.0000, 'faithfulness': 0.8857, 'answer_relevancy': 0.9172, 'context_recall': 1.0000}
#Extract the details into a dataframe
results_df = result.to_pandas()

Ragas Assessment Comparison for Naive Chunker
import tqdm
questions = []
ground_truths_semantic = []
contexts = []
answers = []
for chunk in tqdm.tqdm(synthetic_data_chunks[10:20]):
questions.append(question_chain.invoke({"context" : chunk.page_content}))
ground_truths_semantic.append(ground_truth_chain.invoke({"question" : questions[-1], "context" : contexts[-1]}))
Formulate naive chunking evaluation dataset
qagc_list = []
for question, answer, context, ground_truth in zip(questions, answers, contexts, ground_truths_semantic):
"question" : question,
"answer" : answer,
"contexts" : context,
"ground_truth" : ground_truth
naive_eval_dataset = Dataset.from_list(qagc_list)
features: ['question', 'answer', 'contexts', 'ground_truth'],
num_rows: 10
Evaluate our created dataset using RAGAS framework
naive_result = evaluate(
{'context_precision': 1.0000, 'faithfulness': 0.9500, 'answer_relevancy': 0.9182, 'context_recall': 1.0000}
naive_results_df = naive_result.to_pandas()
###############################RESPONSE #######################
{'context_precision': 1.0000, 'faithfulness': 0.9500, 'answer_relevancy': 0.9182, 'context_recall': 1.0000}

Here we can see that the results of both Semantic Chunking and Naive Chunking are almost same except that Naive Chunker has a better factual representation of answer with a score of 0.95 when compared to score of 0.88 of Semantic Chunker.
In conclusion, semantic chunking enables the grouping of contextually similar information, allowing for the creation of independent and meaningful segments. This approach enhances the efficiency and effectiveness of large language models by providing them with focused inputs, ultimately improving their ability to comprehend and process natural language data.