Evaluate RAG Pipeline Response Using Python

12 min readJun 5, 2024

Evaluate Response Generated from RAG Pipeline

Introduction

Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt will be converting into an embedding and matching documents are fetched from a vector store. Then, the LLM is called with the matching documents as part of the prompt.

The RAG evaluation assesses how well our pipeline performs. The RAG evaluation measures the precision, recall, and faithfulness to facts produced during retrieval phrase by analyzing the top results produced by our system. This allows us to automatically track and monitor the performance of our pipeline.When designing an evaluation strategy for RAG applications, you should evaluate both steps:

Document retrieval from the vector store
LLM output generation

It’s important to evaluate these steps separately, because breaking your RAG into multiple steps makes it easier to pinpoint issues. There are several criteria used to evaluate RAG applications:

Output-based

Factuality (also called Correctness): Measures whether the LLM outputs are based on the provided ground truth.
Answer relevance: Measures how directly the answer addresses the question.

Context-based

Context adherence (also called Grounding or Faithfulness): Measures whether LLM outputs are based on the provided context.
Context recall: Measures whether the context contains the correct information, compared to a provided ground truth, in order to produce an answer.
Context relevance: Measures how much of the context is necessary to answer a given query.
Custom metrics: You know your application better than anyone else. Create test cases that focus on things that matter to you (examples include: whether a certain document is cited, whether the response is too long, etc.)

Here we will be evaluating the RAG pipeline on the Context retrieved from the vector store and the response generated by the LLM. We will evaluate the RAG Pipeline based on the following metrics:

* BLEU (Bilingual Evaluation Understudy) Score:

BLEU score is a widely used metric for machine translation tasks, where the goal is to automatically translate text from one language to another. It was proposed as a way to assess the quality of machine-generated translations by comparing them to a set of reference translations provided by human translators.

* ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score:

ROUGE score is a set of metrics commonly used for text summarization tasks, where the goal is to automatically generate a concise summary of a longer text. ROUGE was designed to evaluate the quality of machine-generated summaries by comparing them to reference summaries provided by humans.

* BERTScore

BERTScore is a metric for evaluating the quality of text generation models, such as machine translation or summarization. It utilizes pre-trained BERT contextual embeddings for both the generated and reference texts, and then calculates the cosine similarity between these embeddings. This topic covers the core concepts behind the BERTScore metric.

* Perplexity

Perplexity is a measure used in natural language processing and machine learning to evaluate the performance of language models. It measures how well the model predicts the next word or character based on the context provided by the previous words or characters. The lower the perplexity score, the better the model’s ability to predict the next word or character.

* Diversity

Diversity measures the uniqueness of bigrams in the generated output. Higher values indicate more diverse and varied output.

* Racial Bias

Racial bias in natural language processing (NLP) is the unfair treatment of people based on their race. It can occur in several ways, including:

Language models: Language models can produce biased language and have trouble with dialects and accents that aren’t well represented in their training data. This can lead to discrimination and reinforce racial stereotypes.
Data: NLP systems can reflect biases in the language data used to train them.
Annotations: Mismatches between the annotator population and the data can introduce bias.
Research design: Research design can also introduce bias

Code Implementation

The above code has been implemented in T4 Instance of Google Colab

Install Required Dependencies

!pip install sacrebleu 
!pip install rouge-score
!pip install bert-score
!pip install transformers
!pip install nltk
!pip install textblob
!pip install -qU dataset
!pip install -qU langchain
!pip install -qu langchain_community
!pip install -qU chromadb
!pip install -qU langchain-chroma
!pip install -qU langchain-huggingface
!pip install -qU sentence-transformers
!pip install -qU Flashrank
!pip install langchain_community

Build A RAG Pipeline Using Chroma and Phi-3-mini-4k-instruct

Instantiate LLM

from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="microsoft/Phi-3-mini-4k-instruct",
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 50,
        "temperature": 0.1,
    },
)

Instantiate Embedding Model

from langchain.embeddings import HuggingFaceEmbeddings
model_name ="BAAI/bge-small-en-v1.5"
model_kwargs ={"device":"cuda"}
encode_kwargs ={"normalize_embeddings":False}
embeddings = HuggingFaceEmbeddings(model_name=model_name,
                                   model_kwargs=model_kwargs,
                                   encode_kwargs=encode_kwargs)

Download Data

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    "https://blog.langchain.dev/langchain-v0-1-0/"
)

documents = loader.load()
print(len(documents))

Chunk the Documents

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1250,
    chunk_overlap = 100,
    length_function = len,
    is_separator_regex = False
)
#
split_docs = text_splitter.split_documents(documents)
print(len(split_docs))

Setup the VectorStore

from langchain_community.vectorstores import Chroma
vectorstore = Chroma(embedding_function=embeddings,
                     persist_directory="./chromadb1",
                     collection_name="full_documents")

Add the document chunks

vectorstore.add_documents(split_docs)
vectorstore.persist()

Advanced Retrieval Using FlashRank

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
#
retriever = vectorstore.as_retriever(search_kwargs={"k":10})
#
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=retriever
                                                      )

Setup a RetrievalQA Chain

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever,
                                 return_source_documents=True
                                )

Ask Question

result = qa.invoke({"query":"What is mentioned about Composability?"})
print(result)
#########################RESPONSE##############################
{'query': 'What is mentioned about Composability?', 
'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nchanges. These can now be reflected on an individual integration basis with proper versioning in the standalone integration package.ObservabilityBuilding LLM applications involves putting a non-deterministic component at the center of your system. These models can often output unexpected results, so having visibility into exactly what is happening in your system is integral. 💡We want to make langchain as observable and as debuggable as possible, whether through architectural decisions or tools we build on the side.We’ve set about this in a few ways.The main way we’ve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usage and more. Even in private beta, the demand for LangSmith has been overwhelming, and we’re investing a lot in scalability so that we can release a public beta and then make it generally available\n\na lot in scalability so that we can release a public beta and then make it generally available in the coming months. We are also already supporting an enterprise version, which comes with a within-VPC deployment for enterprises with strict data privacy policies.We’ve also tackled observability in other ways. We’ve long had built in verbose and debug modes for different levels of logging throughout the pipeline. We recently introduced methods to visualize the chain you created, as well as get all prompts used.ComposabilityWhile it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as data orchestration tools do for data engineering pipelines (batching, parallelization, fallbacks). It also provides some benefits unique to LLM workloads - mainly LLM-specific observability (covered above), and streaming, covered later in this post.The components for LCEL are in\n\nbug fixes. See more towards the end of this post on our plans for that.While re-architecting the package towards a path to a stable 0.1 release, we took the opportunity to talk to hundreds of developers about why they use LangChain and what they love about it. This input guided our direction and focus. We also used it as an opportunity to bring parity to the Python and JavaScript versions in the core areas outlined below. 💡While certain integrations and more tangential chains may be language specific, core abstractions and key functionality are implemented equally in both the Python and JavaScript packages.We want to share what we’ve heard and our plan to continually improve LangChain. We hope that sharing these learnings will increase transparency into our thinking and decisions, allowing others to better use, understand, and contribute to LangChain. After all, a huge part of LangChain is our community – both the user base and the 2000+ contributors – and we want everyone to come along for the journey.\xa0Third Party IntegrationsOne of the things that people most love about LangChain is how easy we make it to get started building on any stack. We have almost 700 integrations, ranging from LLMs to vector stores to tools for agents\n\nQuestion: What is mentioned about Composability?\nHelpful Answer:\nComposability: While it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as", 
'source_documents': [
Document(page_content='changes. These can now be reflected on an individual integration basis with proper versioning in the standalone integration package.ObservabilityBuilding LLM applications involves putting a non-deterministic component at the center of your system. These models can often output unexpected results, so having visibility into exactly what is happening in your system is integral. 💡We want to make langchain as observable and as debuggable as possible, whether through architectural decisions or tools we build on the side.We’ve set about this in a few ways.The main way we’ve tackled this is by building LangSmith. One of the main value props that LangSmith provides is a best-in-class debugging experience for your LLM application. We log exactly what steps are happening, what the inputs of each step are, what the outputs of each step are, how long each step takes, and more data. We display this in a user-friendly way, allowing you to identify which steps are taking the longest, enter a playground to debug unexpected LLM responses, track token usage and more. Even in private beta, the demand for LangSmith has been overwhelming, and we’re investing a lot in scalability so that we can release a public beta and then make it generally available', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.9986438}), 
Document(page_content='a lot in scalability so that we can release a public beta and then make it generally available in the coming months. We are also already supporting an enterprise version, which comes with a within-VPC deployment for enterprises with strict data privacy policies.We’ve also tackled observability in other ways. We’ve long had built in verbose and debug modes for different levels of logging throughout the pipeline. We recently introduced methods to visualize the chain you created, as well as get all prompts used.ComposabilityWhile it’s helpful to have prebuilt chains to get started, we very often see teams breaking outside of those architectures and wanting to customize their chain - not only customize the prompt, but also customize different parts of the orchestration. 💡Over the past few months, we’ve invested heavily in LangChain Expression Language (LCEL). This enables composition of arbitrary sequences, providing a lot of the same benefits as data orchestration tools do for data engineering pipelines (batching, parallelization, fallbacks). It also provides some benefits unique to LLM workloads - mainly LLM-specific observability (covered above), and streaming, covered later in this post.The components for LCEL are in', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.9979234}), 
Document(page_content='bug fixes. See more towards the end of this post on our plans for that.While re-architecting the package towards a path to a stable 0.1 release, we took the opportunity to talk to hundreds of developers about why they use LangChain and what they love about it. This input guided our direction and focus. We also used it as an opportunity to bring parity to the Python and JavaScript versions in the core areas outlined below. 💡While certain integrations and more tangential chains may be language specific, core abstractions and key functionality are implemented equally in both the Python and JavaScript packages.We want to share what we’ve heard and our plan to continually improve LangChain. We hope that sharing these learnings will increase transparency into our thinking and decisions, allowing others to better use, understand, and contribute to LangChain. After all, a huge part of LangChain is our community – both the user base and the 2000+ contributors – and we want everyone to come along for the journey.\xa0Third Party IntegrationsOne of the things that people most love about LangChain is how easy we make it to get started building on any stack. We have almost 700 integrations, ranging from LLMs to vector stores to tools for agents', metadata={'language': 'en', 'source': 'https://blog.langchain.dev/langchain-v0-1-0/', 'title': 'LangChain v0.1.0', 'relevance_score': 0.99791926})
]
}

Code to Evaluate the Response Generated By RAG Pipeline

Evaluation

answer = result['result']
context = " ".join([d.page_content for d in result['source_documents']])

import torch
from sacrebleu import corpus_bleu
from rouge_score import rouge_scorer
from bert_score import score
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import nltk
from nltk.util import ngrams

class RAGEvaluator:
    def __init__(self):
        self.gpt2_model, self.gpt2_tokenizer = self.load_gpt2_model()
        self.bias_pipeline = pipeline("zero-shot-classification", model="Hate-speech-CNERG/dehatebert-mono-english")

    def load_gpt2_model(self):
        model = GPT2LMHeadModel.from_pretrained('gpt2')
        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        return model, tokenizer

    def evaluate_bleu_rouge(self, candidates, references):
        bleu_score = corpus_bleu(candidates, [references]).score
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        rouge_scores = [scorer.score(ref, cand) for ref, cand in zip(references, candidates)]
        rouge1 = sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores)
        return bleu_score, rouge1

    def evaluate_bert_score(self, candidates, references):
        P, R, F1 = score(candidates, references, lang="en", model_type='bert-base-multilingual-cased')
        return P.mean().item(), R.mean().item(), F1.mean().item()

    def evaluate_perplexity(self, text):
        encodings = self.gpt2_tokenizer(text, return_tensors='pt')
        max_length = self.gpt2_model.config.n_positions
        stride = 512
        lls = []
        for i in range(0, encodings.input_ids.size(1), stride):
            begin_loc = max(i + stride - max_length, 0)
            end_loc = min(i + stride, encodings.input_ids.size(1))
            trg_len = end_loc - i
            input_ids = encodings.input_ids[:, begin_loc:end_loc]
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100
            with torch.no_grad():
                outputs = self.gpt2_model(input_ids, labels=target_ids)
                log_likelihood = outputs[0] * trg_len
            lls.append(log_likelihood)
        ppl = torch.exp(torch.stack(lls).sum() / end_loc)
        return ppl.item()

    def evaluate_diversity(self, texts):
        all_tokens = [tok for text in texts for tok in text.split()]
        unique_bigrams = set(ngrams(all_tokens, 2))
        diversity_score = len(unique_bigrams) / len(all_tokens) if all_tokens else 0
        return diversity_score

    def evaluate_racial_bias(self, text):
        results = self.bias_pipeline([text], candidate_labels=["hate speech", "not hate speech"])
        bias_score = results[0]['scores'][results[0]['labels'].index('hate speech')]
        return bias_score

    def evaluate_all(self, question, response, reference):
        candidates = [response]
        references = [reference]
        bleu, rouge1 = self.evaluate_bleu_rouge(candidates, references)
        bert_p, bert_r, bert_f1 = self.evaluate_bert_score(candidates, references)
        perplexity = self.evaluate_perplexity(response)
        diversity = self.evaluate_diversity(candidates)
        racial_bias = self.evaluate_racial_bias(response)
        return {
            "BLEU": bleu,
            "ROUGE-1": rouge1,
            "BERT P": bert_p,
            "BERT R": bert_r,
            "BERT F1": bert_f1,
            "Perplexity": perplexity,
            "Diversity": diversity,
            "Racial Bias": racial_bias
        }

Instantiate the Evaluator

evaluator = RAGEvaluator()
question = "What is mentioned about Composability?"
response = answer 
reference = context
metrics = evaluator.evaluate_all(question, response, reference)

Format the metrics response generated

or k,v in metrics.items():
    if k == 'BLEU':
        print(f"BLEU measures the overlap between the generated output and reference text based on n-grams. Higher scores indicate better match.score obtained :{v}")
    elif k == "ROUGE-1":
        print(f"ROUGE-1 measures the overlap of unigrams between the generated output and reference text. Higher scores indicate better match.Score obtained:{v}") 
    elif k == 'BERT P':
        print(f"BERTScore evaluates the semantic similarity between the generated output and reference text using BERT embeddings.")

        print(f"\n\n**BERT Precision**: {metrics['BERT P']}")
        print(f"**BERT Recall**: {metrics['BERT R']}")
        print(f"**BERT F1 Score**: {metrics['BERT F1']}")
    elif k == 'Perplexity':
        print(f"Perplexity measures how well a language model predicts the text. Lower values indicate better fluency and coherence. score obtained :{v}")
    elif k == 'Diversity':
        
        print(f"Diversity measures the uniqueness of bigrams in the generated output. Higher values indicate more diverse and varied output.score obtained:{v}")
    elif k == 'Racial Bias':
        print(f"Racial Bias score indicates the presence of biased language in the generated output. Higher scores indicate more bias.score obtained:{v}")

* BLEU measures the overlap between the generated output and reference text based on n-grams. Higher scores indicate better match.score obtained :84.31369119311947
* ROUGE-1 measures the overlap of unigrams between the generated output and reference text. Higher scores indicate better match.Score obtained:0.9166666666666666
* BERTScore evaluates the semantic similarity between the generated output and reference text using BERT embeddings.
**BERT Precision**:0.9040709137916565
**BERT Recall**:0.910340428352356
**BERT F1 Score**:0.9071947932243347
* Perplexity measures how well a language model predicts the text. Lower values indicate better fluency and coherence. score obtained :27.312009811401367
* Diversity measures the uniqueness of bigrams in the generated output. Higher values indicate more diverse and varied output.score obtained:0.8409090909090909
* Racial Bias score indicates the presence of biased language in the generated output. Higher scores indicate more bias.score obtained:

Conclusion :

Here we have built a RAG Response evaluator using basic Language evaluation metrics using Python. Moreover this solution is independent of any Evaluation tools available although it uses the same evaluation metrics.

connect with