Ask questions to your images using LangChain and Python
--
What is LangChain ?
LangChain is an open source framework available in Python or JavaScript (TypeScript) packages, enabling AI developers to integrate Large Language Models (LLMs) like GPT-4 with external data. Unlike traditional limitations imposed by LLMs, which are trained on data up until 2021, LangChain overcomes this constraint by allowing LLMs to access and utilize custom data and computations. By leveraging LangChain, developers can enable their GPT models to access and incorporate real-time information from databases, reports, documents, and websites. The versatility of LangChain, especially when combined with the capabilities of powerful LLMs like GPT-4, has gained significant popularity since its introduction, particularly after the release of GPT-4.
Components of Langchain
A LangChain application consists of the below main components:
Models (LLM Wrappers) : This is used to interact with Larage Language Models
Prompts : A “prompt” refers to the input to the model. This input is rarely hard coded, but rather is often constructed from multiple components. A PromptTemplate is responsible for the construction of this input. LangChain provides several classes and functions to make constructing and working with prompts easy.
Chains : Chains are a sequence of predetermined steps, so they are good to get started with as they give you more control and let you understand what is happening better.
Tool : A specific abstraction around a function that makes it easy for a language model to interact with it. Specifically, the interface of a tool has a single text input and a single text output.
Agents : The language model that drives decision making.
Embeddings and Vector Stores : This is where we incorporate the custom data aspect of LangChain.The idea behind embeddings and Vector Stores is to break large data into chunks and store those to be queried when relevant.
Here we will implement a Custom LangChain agent to interact with the images. Basic functionality involves :
i. Generating a caption for the image uploaded.
ii. Identifying the objects in the image.
iii. Extracting the bouinding boxes for the object identified
Step 1 : install necessary packages
!pip install -qU openai langchain transformers tabulate timm
Step 2 : Import Required libraries
from langchain.tools import BaseTool
from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
from PIL import Image
import torch
#
import os
from tempfile import NamedTemporaryFile
from langchain.agents import initialize_agent
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
Step3 : Build Custom Tools
class ImageCaptionTool(BaseTool):
name = "Image captioner"
description = "Use this tool when given the path to an image that you would like to be described. " \
"It will return a simple caption describing the image."
def _run(self, img_path):
image = Image.open(img_path).convert('RGB')
model_name = "Salesforce/blip-image-captioning-large"
device = "cpu" # cuda
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
inputs = processor(image, return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=20)
caption = processor.decode(output[0], skip_special_tokens=True)
return caption
def _arun(self, query: str):
raise NotImplementedError("This tool does not support async")
class ObjectDetectionTool(BaseTool):
name = "Object detector"
description = "Use this tool when given the path to an image that you would like to detect objects. " \
"It will return a list of all detected objects. Each element in the list in the format: " \
"[x1, y1, x2, y2] class_name confidence_score."
def _run(self, img_path):
image = Image.open(img_path).convert('RGB')
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
detections = ""
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
detections += ' {}'.format(model.config.id2label[int(label)])
detections += ' {}\n'.format(float(score))
return detections
def _arun(self, query: str):
raise NotImplementedError("This tool does not support async")
Step 4: Define Helper Functions
def get_image_caption(image_path):
"""
Generates a short caption for the provided image.
Args:
image_path (str): The path to the image file.
Returns:
str: A string representing the caption for the image.
"""
image = Image.open(image_path).convert('RGB')
model_name = "Salesforce/blip-image-captioning-large"
device = "cpu" # cuda
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)
inputs = processor(image, return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=20)
caption = processor.decode(output[0], skip_special_tokens=True)
return caption
def detect_objects(image_path):
"""
Detects objects in the provided image.
Args:
image_path (str): The path to the image file.
Returns:
str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
"""
image = Image.open(image_path).convert('RGB')
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]
detections = ""
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
detections += ' {}'.format(model.config.id2label[int(label)])
detections += ' {}\n'.format(float(score))
return detections
Step 5: Main Processing
import openai
from getpass import getpass
#set the openai_api_key
openai_api_key = getpass()
#initialize the gent
tools = [ImageCaptionTool(), ObjectDetectionTool()]
conversational_memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=5,
return_messages=True
)
llm = ChatOpenAI(
openai_api_key= openai_api_key,
temperature=0,
model_name="gpt-3.5-turbo"
)
agent = initialize_agent(
agent="chat-conversational-react-description",
tools=tools,
llm=llm,
max_iterations=5,
verbose=True,
memory=conversational_memory,
early_stopping_method='generate'
)
#download the image
!wget https://www.smartcitiesworld.net/AcuCustom/Sitename/DAM/019/Parsons_PR.jpg
Ask questions pertaining to the image
Question 1
image_path = "/content/Parsons_PR.jpg"
user_question = "generate a caption for this iamge?"
response = agent.run(f'{user_question}, this is the image path: {image_path}')
print(response)
>Question 1 Entering new AgentExecutor chain...
{
"action": "Image captioner",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: cars are driving down the street in traffic at a green light
Thought:{
"action": "Final Answer",
"action_input": "The image shows cars driving down the street in traffic at a green light."
}
> Finished chain.
response: The image shows cars driving down the street in traffic at a green light
.
Question 2:
image_path = "/content/Parsons_PR.jpg"
user_question = "Please tell me what are the items present in the image."
response = agent.run(f'{user_question}, this is the image path: {image_path}')
print(response)
> Entering new AgentExecutor chain...
{
"action": "Object detector",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: [518, 40, 582, 110] car 0.9300937652587891
[188, 381, 311, 469] car 0.9253759384155273
[1068, 223, 1104, 342] person 0.987162172794342
[828, 233, 949, 329] car 0.9450376629829407
[1076, 263, 1106, 347] bicycle 0.9070376753807068
[635, 71, 713, 135] car 0.921174168586731
[0, 433, 100, 603] car 0.9781951308250427
[151, 747, 339, 799] car 0.9839044809341431
[389, 267, 493, 367] car 0.9801359176635742
[192, 478, 341, 633] car 0.995318591594696
[578, 117, 828, 550] traffic light 0.9860804677009583
[802, 666, 1028, 798] car 0.982887327671051
[0, 639, 84, 799] car 0.9630037546157837
[1057, 608, 1199, 766] car 0.9652799367904663
[988, 218, 1031, 347] person 0.9471640586853027
[751, 524, 909, 675] car 0.9911800026893616
[489, 560, 670, 749] car 0.9970000386238098
Thought:{
"action": "Final Answer",
"action_input": "The objects present in the image are: car, car, person, car, bicycle, car, car, car, car, car, traffic light, car, car, car, person, car, car."
}
> Finished chain.
response: "The objects present in the image are: car, car, person, car, bicycle, ca
, car, car, car, car, traffic light, car, car, car, person, car, car."
Question 3
image_path = "/content/Parsons_PR.jpg"
user_question = "Please tell me the bounding boxes of all detected objects in the image."
response = agent.run(f'{user_question}, this is the image path: {image_ath}')
print(response)
> Entering new AgentExecutor chain...
{
"action": "Object detector",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: [518, 40, 582, 110] car 0.9300937652587891
[188, 381, 311, 469] car 0.9253759384155273
[1068, 223, 1104, 342] person 0.987162172794342
[828, 233, 949, 329] car 0.9450376629829407
[1076, 263, 1106, 347] bicycle 0.9070376753807068
[635, 71, 713, 135] car 0.921174168586731
[0, 433, 100, 603] car 0.9781951308250427
[151, 747, 339, 799] car 0.9839044809341431
[389, 267, 493, 367] car 0.9801359176635742
[192, 478, 341, 633] car 0.995318591594696
[578, 117, 828, 550] traffic light 0.9860804677009583
[802, 666, 1028, 798] car 0.982887327671051
[0, 639, 84, 799] car 0.9630037546157837
[1057, 608, 1199, 766] car 0.9652799367904663
[988, 218, 1031, 347] person 0.9471640586853027
[751, 524, 909, 675] car 0.9911800026893616
[489, 560, 670, 749] car 0.9970000386238098
Thought:{
"action": "Final Answer",
"action_input": "The detected objects in the image are: \n[518, 40, 582, 110] car \n[188, 381, 311, 469] car \n[1068, 223, 1104, 342] person \n[828, 233, 949, 329] car \n[1076, 263, 1106, 347] bicycle \n[635, 71, 713, 135] car \n[0, 433, 100, 603] car \n[151, 747, 339, 799] car \n[389, 267, 493, 367] car \n[192, 478, 341, 633] car \n[578, 117, 828, 550] traffic light \n[802, 666, 1028, 798] car \n[0, 639, 84, 799] car \n[1057, 608, 1199, 766] car \n[988, 218, 1031, 347] person \n[751, 524, 909, 675] car \n[489, 560, 670, 749] car"
}
> Finished chain.
response: The detected objects in the image are: \n[518, 40, 582, 110] car \n[188, 381, 311, 469] car \n[1068, 223, 1104, 342] person \n[828, 233, 949, 329] car \n[1076, 263, 1106, 347] bicycle \n[635, 71, 713, 135] car \n[0, 433, 100, 603] car \n[151, 747, 339, 799] car \n[389, 267, 493, 367] car \n[192, 478, 341, 633] car \n[578, 117, 828, 550] traffic light \n[802, 666, 1028, 798] car \n[0, 639, 84, 799] car \n[1057, 608, 1199, 766] car \n[988, 218, 1031, 347] person \n[751, 524, 909, 675] car \n[489, 560, 670, 749] car"
}
Referrences: