Build an Advanced Reranking-RAG System Using Llama-Index, Llama 3 and Qdrant

Plaban Nayak
9 min readApr 30, 2024

--

Introduction

LLMs, despite their ability to generate meaningful and grammatically correct text, face a challenge known as hallucination. Hallucination in LLMs refers to their tendency to confidently produce incorrect answers, creating false information that can appear convincing. This issue has been prevalent since the inception of LLMs and often results in inaccurate and factually incorrect outputs.

To address hallucination, Fact Checking is crucial. One approach to prototyping LLMs for Fact Checking involves three methods:

  1. Prompt Engineering
  2. 2. Retrieval Augmented Generation (RAG)
  3. 3. Fine-Tuning

In this context, we will utilize RAG (Retrieval Augmented Generation) to mitigate hallucination

What Is RAG ?

RAG = DenseVector Retrieval (R) + Incontext Learning (AG)

Retrieval: Find References for the question asked in your Document.

Augmented: Add References to your Prompt.

Generation: Improve answers to the question asked.

In RAG, we process a collection of textual documents or document segments by encoding them into numerical representations known as vector embeddings. Each vector embedding corresponds to a single document segment and is stored in a database called the vector store. The models responsible for encoding these segments into embeddings are called encoding models or bi-encoders. These models are trained on extensive data sets, giving them the ability to create powerful representations of the document segments in a single vector embedding. To avoid hallucination, RAG leverages factual knowledge sources that are kept separate from the reasoning capabilities of LLMs. This knowledge is stored externally and can be easily accessed and updated.

There are two types of knowledge sources:

  1. Parametric knowledge: This knowledge is acquired during training and is implicitly stored in the neural network’s weights.
  2. Non-parametric knowledge: This type of knowledge is stored in an external source, such as a vector database.

Why RAG Before Fine-Tuning (Order of Operation)?

  1. Cheap: No additional training required.
  2. Easier to update with the latest information.
  3. More trustworthy because of fact-checkable references

The optimization workflow gives a summary of the approach that can be used based on the following two factors:

  1. Content Optimization : What the model needs to know.
  2. LLM Optimization : How the model needs to act.

RAG Data Stack

📁 Load Language Data

� Process Language Data

🤖 Embed Language Data

🗂 Load Vectors into Database

Stages Involved in RAG

The stages involved in RAG are:

  1. Data Loading: This involves retrieving data from various sources such as text files, PDFs, websites, databases, or APIs and integrating it into your pipeline. Llama Hub offers a wide range of connectors for this purpose.
  2. Indexing: This stage focuses on creating a structured format for data querying. For LLMs, indexing typically involves generating vector embeddings, which are numerical representations of the data’s meaning, along with additional metadata strategies to facilitate accurate and contextually relevant data retrieval.
  3. Storage: After indexing, it’s common practice to store the index and associated metadata to avoid the need for repeated indexing in the future.
  4. Querying: There are multiple ways to utilize LLMs and Llama-Index data structures for querying, including sub-queries, multi-step queries, and hybrid strategies, depending on the chosen indexing strategy.
  5. Evaluation: This step is crucial for assessing the effectiveness of the pipeline compared to alternative strategies or when implementing changes. Evaluation provides objective metrics regarding the accuracy, fidelity, and speed of query responses.

Our RAG stack is built using Llama-Index, Qdrant, and Llama 3.

What Is Llama-Index ?

Llama-Index serves as a framework designed for developing LLM applications enriched with context. Context augmentation involves utilizing LLMs with your private or domain-specific data.

Some popular applications of this framework include:

  1. Question-Answering Chatbots (often known as RAG systems, short for “Retrieval-Augmented Generation”)
  2. Document Understanding and Extraction
  3. Autonomous Agents capable of conducting research and taking actions

Llama-Index offers a comprehensive set of tools to facilitate the development of these applications, from initial prototypes to production-ready solutions. These tools enable data ingestion and processing, as well as the implementation of sophisticated query workflows that combine data access with LLM-based prompting.

Here we have used llama-index >= v0.10

source : https://www.llamaindex.ai/blog/llamaindex-v0-10-838e735948f8

Major Enhancements

ServiceContext is deprecated: Every LlamaIndex user is familiar with ServiceContext, which has gradually become outdated and cumbersome for managing LLMs, embeddings, chunk sizes, callbacks, and other functionalities. Consequently, we are fully deprecating it; you can now either specify arguments directly or set a default.

Revamped Folder Structure:

  1. llama-index-core: This folder encompasses all core Llama-Index abstractions.
  2. llama-index-integrations: This folder includes third-party integrations for 19 Llama-Index abstractions, covering data loaders, LLMs, embedding models, vector stores, and more.
  3. llama-index-packs: Here, you’ll find our collection of 50+ LlamaPacks, which are templates aimed at jumpstarting a user’s application.

LlamaHub will serve as the central hub for all integrations.

Llama 3

Meta’s Llama 3 is the latest version of the open-access Llama series, accessible through Hugging Face. It serves as the language model for response synthesis. Llama 3 is available in two sizes: 8B for streamlined deployment and development on consumer-grade GPUs, and 70B for extensive AI applications. Each size variant offers both base and instruction-tuned versions. Additionally, a new iteration of Llama Guard, fine-tuned on Llama 3 8B, has been introduced as Llama Guard 2.

What Is Qdrant ?

Qdrant is a vector similarity search engine that offers a production-ready service through an easy-to-use API. It specializes in storing, searching, and managing points (vectors) along with additional payload information. It is optimized for efficiently storing and querying high-dimensional vectors. Vector databases like Qdrant leverage specialized data structures and indexing techniques such as Hierarchical Navigable Small World (HNSW) for implementing Approximate Nearest Neighbors and Product Quantization, among others. These optimizations enable fast similarity and semantic search, allowing users to locate vectors that closely match a given query vector based on a specified distance metric. Commonly used distance metrics supported by Qdrant include Euclidean Distance, Cosine Similarity, and Dot Product.

Technology Stack Used

  • Application Framework: Llama-index
  • Embedding Model: BAAI/bge-small-en-v1.5
  • LLM: Meta-Llama-3
  • Vector Store: Qdrant

Code Implementation

Install Required Libraries

%%writefile requirements.txt
llama-index
llama-index-llms-huggingface
llama-index-embeddings-fastembed
fastembed
Unstructured[md]
qdrant
llama-index-vector-stores-qdrant
einops
accelerate
sentence-transformers

#
!pip install -r requirements.txt
accelerate==0.29.3
einops==0.7.0
sentence-transformers==2.7.0
transformers==4.39.3
qdrant-client==1.9.0
llama-index==0.10.32
llama-index-agent-openai==0.2.3
llama-index-cli==0.1.12
llama-index-core==0.10.32
llama-index-embeddings-fastembed==0.1.4
llama-index-legacy==0.9.48
llama-index-llms-huggingface==0.1.4
llama-index-vector-stores-qdrant==0.2.8

Download the Dataset

!mkdir Data
! wget "https://arxiv.org/pdf/1810.04805.pdf" -O Data/arxiv.pdf

Load the Documents

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("/content/Data").load_data()

Instantiate the Embedding Model

from llama_index.embeddings.fastembed import
FastEmbedEmbedding
from llama_index.core import Settings
#
embed_model =
FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
#
Settings.embed_model = embed_model
#
Settings.chunk_size = 512
#

Define the System Prompt

from llama_index.core import PromptTemplate
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on
the instructions and context provided."
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

Instantiate the LLM

Since we are using Llama 3 as the LLM, we need to do the following:

  1. Generate HuggingFace Access Tokens
  2. Request access to use the Model
from huggingface_hub import notebook_login
notebook_login()
import torch
from transformers import AutoModelForCausalLM,
AutoTokenizer
from llama_index.llms.huggingface import HuggingFaceLLM
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct"
)
stopping_ids = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
llm = HuggingFaceLLM(
context_window=8192,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample":
False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
device_map="auto",
stopping_ids=stopping_ids,
tokenizer_kwargs={"max_length": 4096},
# uncomment this if using CUDA to reduce memory
usage
model_kwargs={"torch_dtype": torch.float16}
)
Settings.llm = llm
Settings.chunk_size = 512

Instantiate the Vector Store and Load the Vector Embeddings

from IPython.display import Markdown, display
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
#
client = qdrant_client.QdrantClient(
# you can use :memory: mode for fast and light-weight experiments,
# it does not require to have Qdrant deployed anywhere
# but requires qdrant-client >= 1.1.1
location=":memory:"
# otherwise set Qdrant instance address with:
# url="http://<host>:<port>"
# otherwise set Qdrant instance with host and port:
#host="localhost",
#port=6333
# set API KEY for Qdrant Cloud
#api_key=<YOUR API KEY>
)

vector_store = QdrantVectorStore(client=client,collection_name="test")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents,storage_context=storage_context,)

Instantiate Reranker Module

Source:llama-index

The retrieval model retrieves the top-k documents based on embedding similarity to the query. There are numerous benefits to embedding-based retrieval:

  1. It’s highly efficient, especially in computing dot products, as it doesn’t require any model calls during query-time.
  2. Although not perfect, embeddings can adequately encode the semantics of both the document and the query. This results in a subset of queries where embedding-based retrieval provides highly relevant results.

However, despite these advantages, embedding-based retrieval can sometimes be imprecise and return irrelevant context to the query. This, in turn, diminishes the overall quality of the RAG system, irrespective of the quality of the LLM.

In this approach, we implement a two-stage retrieval process.

  • The first stage employs embedding-based retrieval with a high top-k value to prioritize recall, even at the cost of lower precision.
  • Subsequently, the second stage employs a slightly more computationally intensive process that emphasizes precision over recall. This stage is designed to “rerank” the initially retrieved candidates, enhancing the quality of the final results
from llama_index.core.postprocessor import
SentenceTransformerRerank
rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3)

Instantiate the Query Engine

import time
query_engine = index.as_query_engine(similarity_top_k=10, node_postprocessors=[rerank] )

Ask Question 1

now = time.time()
response = query_engine.query("What is instruction finetuning?",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

Response Synthesized by the Generation Module in RAG

Response Generated: Instruction fine-tuning is not
explicitly mentioned in the provided context. However,
based on the text, it can be inferred that fine-tuning is a
process where a pre-trained model like BERT is adapted to a
specific task by swapping out the appropriate inputs and
outputs. This process is described as "straightforward" and
allows BERT to model many downstream tasks by fine-tuning
all the parameters end-to-end.
Elapsed: 7.32s

Ask Question 2

now = time.time()
response = query_engine.query("Describe the Feature-based Approach with BERT??",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

Response Synthesized by the Generation Module in RAG

Response Generated: According to the text, the
Feature-based Approach with BERT involves extracting the
activations from one or more layers of BERT without
fine-tuning any parameters of BERT. These contextual
embeddings are then used as input to a randomly initialized
two-layer 768-dimensional BiLSTM before the classification
layer. This approach is used to ablate the fine-tuning
approach and demonstrate the effectiveness of BERT for both
fine-tuning and feature-based approaches.
Elapsed: 6.78s

Ask Question 3

now = time.time()
response = query_engine.query("What is SQuADv2.0?",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

Response synthesized by Generation Module in RAG

Response Generated: According to the provided context,
SQuAD v2.0 is an extension of the SQuAD 1.1 problem
definition, allowing for the possibility that no short
answer exists in the provided paragraph, making the problem
more realistic.
Elapsed: 4.15s

Conclusion

Here we have developed an advanced RAG Question Answering system that operates on private data. We’ve incorporated the LlamaIndex reranking concept to prioritize the most relevant context among the retrieved contexts from the retriever. This approach ensures the factual accuracy of the generated response.

References

connect with me

--

--