Ask questions to your images using LangChain and Python

Plaban Nayak
7 min readJun 4

--

What is LangChain ?

LangChain is an open source framework available in Python or JavaScript (TypeScript) packages, enabling AI developers to integrate Large Language Models (LLMs) like GPT-4 with external data. Unlike traditional limitations imposed by LLMs, which are trained on data up until 2021, LangChain overcomes this constraint by allowing LLMs to access and utilize custom data and computations. By leveraging LangChain, developers can enable their GPT models to access and incorporate real-time information from databases, reports, documents, and websites. The versatility of LangChain, especially when combined with the capabilities of powerful LLMs like GPT-4, has gained significant popularity since its introduction, particularly after the release of GPT-4.

Components of Langchain

A LangChain application consists of the below main components:

Models (LLM Wrappers) : This is used to interact with Larage Language Models

Prompts : A “prompt” refers to the input to the model. This input is rarely hard coded, but rather is often constructed from multiple components. A PromptTemplate is responsible for the construction of this input. LangChain provides several classes and functions to make constructing and working with prompts easy.

Chains : Chains are a sequence of predetermined steps, so they are good to get started with as they give you more control and let you understand what is happening better.

Tool : A specific abstraction around a function that makes it easy for a language model to interact with it. Specifically, the interface of a tool has a single text input and a single text output.

Agents : The language model that drives decision making.

Embeddings and Vector Stores : This is where we incorporate the custom data aspect of LangChain.The idea behind embeddings and Vector Stores is to break large data into chunks and store those to be queried when relevant.

Here we will implement a Custom LangChain agent to interact with the images. Basic functionality involves :

i. Generating a caption for the image uploaded.

ii. Identifying the objects in the image.

iii. Extracting the bouinding boxes for the object identified

Step 1 : install necessary packages

!pip install -qU openai langchain transformers tabulate timm

Step 2 : Import Required libraries

from langchain.tools import BaseTool
from transformers import BlipProcessor, BlipForConditionalGeneration, DetrImageProcessor, DetrForObjectDetection
from PIL import Image
import torch
#
import os
from tempfile import NamedTemporaryFile
from langchain.agents import initialize_agent
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

Step3 : Build Custom Tools

class ImageCaptionTool(BaseTool):
name = "Image captioner"
description = "Use this tool when given the path to an image that you would like to be described. " \
"It will return a simple caption describing the image."

def _run(self, img_path):
image = Image.open(img_path).convert('RGB')

model_name = "Salesforce/blip-image-captioning-large"
device = "cpu" # cuda

processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)

inputs = processor(image, return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=20)

caption = processor.decode(output[0], skip_special_tokens=True)

return caption

def _arun(self, query: str):
raise NotImplementedError("This tool does not support async")


class ObjectDetectionTool(BaseTool):
name = "Object detector"
description = "Use this tool when given the path to an image that you would like to detect objects. " \
"It will return a list of all detected objects. Each element in the list in the format: " \
"[x1, y1, x2, y2] class_name confidence_score."

def _run(self, img_path):
image = Image.open(img_path).convert('RGB')

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

detections = ""
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
detections += ' {}'.format(model.config.id2label[int(label)])
detections += ' {}\n'.format(float(score))

return detections

def _arun(self, query: str):
raise NotImplementedError("This tool does not support async")

Step 4: Define Helper Functions

def get_image_caption(image_path):
"""
Generates a short caption for the provided image.

Args:
image_path (str): The path to the image file.

Returns:
str: A string representing the caption for the image.
"""
image = Image.open(image_path).convert('RGB')

model_name = "Salesforce/blip-image-captioning-large"
device = "cpu" # cuda

processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)

inputs = processor(image, return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=20)

caption = processor.decode(output[0], skip_special_tokens=True)

return caption


def detect_objects(image_path):
"""
Detects objects in the provided image.

Args:
image_path (str): The path to the image file.

Returns:
str: A string with all the detected objects. Each object as '[x1, x2, y1, y2, class_name, confindence_score]'.
"""
image = Image.open(image_path).convert('RGB')

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# convert outputs (bounding boxes and class logits) to COCO API
# let's only keep detections with score > 0.9
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

detections = ""
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
detections += '[{}, {}, {}, {}]'.format(int(box[0]), int(box[1]), int(box[2]), int(box[3]))
detections += ' {}'.format(model.config.id2label[int(label)])
detections += ' {}\n'.format(float(score))

return detections

Step 5: Main Processing

import openai
from getpass import getpass
#set the openai_api_key
openai_api_key = getpass()
#initialize the gent
tools = [ImageCaptionTool(), ObjectDetectionTool()]

conversational_memory = ConversationBufferWindowMemory(
memory_key='chat_history',
k=5,
return_messages=True
)

llm = ChatOpenAI(
openai_api_key= openai_api_key,
temperature=0,
model_name="gpt-3.5-turbo"
)

agent = initialize_agent(
agent="chat-conversational-react-description",
tools=tools,
llm=llm,
max_iterations=5,
verbose=True,
memory=conversational_memory,
early_stopping_method='generate'
)
#download the image
!wget https://www.smartcitiesworld.net/AcuCustom/Sitename/DAM/019/Parsons_PR.jpg

Ask questions pertaining to the image

Question 1

image_path = "/content/Parsons_PR.jpg"
user_question = "generate a caption for this iamge?"
response = agent.run(f'{user_question}, this is the image path: {image_path}')
print(response)
>Question 1 Entering new AgentExecutor chain...
{
"action": "Image captioner",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: cars are driving down the street in traffic at a green light
Thought:{
"action": "Final Answer",
"action_input": "The image shows cars driving down the street in traffic at a green light."
}

> Finished chain.
response: The image shows cars driving down the street in traffic at a green light
.

Question 2:

image_path = "/content/Parsons_PR.jpg"
user_question = "Please tell me what are the items present in the image."
response = agent.run(f'{user_question}, this is the image path: {image_path}')
print(response)
> Entering new AgentExecutor chain...
{
"action": "Object detector",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: [518, 40, 582, 110] car 0.9300937652587891
[188, 381, 311, 469] car 0.9253759384155273
[1068, 223, 1104, 342] person 0.987162172794342
[828, 233, 949, 329] car 0.9450376629829407
[1076, 263, 1106, 347] bicycle 0.9070376753807068
[635, 71, 713, 135] car 0.921174168586731
[0, 433, 100, 603] car 0.9781951308250427
[151, 747, 339, 799] car 0.9839044809341431
[389, 267, 493, 367] car 0.9801359176635742
[192, 478, 341, 633] car 0.995318591594696
[578, 117, 828, 550] traffic light 0.9860804677009583
[802, 666, 1028, 798] car 0.982887327671051
[0, 639, 84, 799] car 0.9630037546157837
[1057, 608, 1199, 766] car 0.9652799367904663
[988, 218, 1031, 347] person 0.9471640586853027
[751, 524, 909, 675] car 0.9911800026893616
[489, 560, 670, 749] car 0.9970000386238098

Thought:{
"action": "Final Answer",
"action_input": "The objects present in the image are: car, car, person, car, bicycle, car, car, car, car, car, traffic light, car, car, car, person, car, car."
}

> Finished chain.

response: "The objects present in the image are: car, car, person, car, bicycle, ca
, car, car, car, car, traffic light, car, car, car, person, car, car."

Question 3

image_path = "/content/Parsons_PR.jpg"
user_question = "Please tell me the bounding boxes of all detected objects in the image."
response = agent.run(f'{user_question}, this is the image path: {image_ath}')
print(response)


> Entering new AgentExecutor chain...
{
"action": "Object detector",
"action_input": "/content/Parsons_PR.jpg"
}
Observation: [518, 40, 582, 110] car 0.9300937652587891
[188, 381, 311, 469] car 0.9253759384155273
[1068, 223, 1104, 342] person 0.987162172794342
[828, 233, 949, 329] car 0.9450376629829407
[1076, 263, 1106, 347] bicycle 0.9070376753807068
[635, 71, 713, 135] car 0.921174168586731
[0, 433, 100, 603] car 0.9781951308250427
[151, 747, 339, 799] car 0.9839044809341431
[389, 267, 493, 367] car 0.9801359176635742
[192, 478, 341, 633] car 0.995318591594696
[578, 117, 828, 550] traffic light 0.9860804677009583
[802, 666, 1028, 798] car 0.982887327671051
[0, 639, 84, 799] car 0.9630037546157837
[1057, 608, 1199, 766] car 0.9652799367904663
[988, 218, 1031, 347] person 0.9471640586853027
[751, 524, 909, 675] car 0.9911800026893616
[489, 560, 670, 749] car 0.9970000386238098

Thought:{
"action": "Final Answer",
"action_input": "The detected objects in the image are: \n[518, 40, 582, 110] car \n[188, 381, 311, 469] car \n[1068, 223, 1104, 342] person \n[828, 233, 949, 329] car \n[1076, 263, 1106, 347] bicycle \n[635, 71, 713, 135] car \n[0, 433, 100, 603] car \n[151, 747, 339, 799] car \n[389, 267, 493, 367] car \n[192, 478, 341, 633] car \n[578, 117, 828, 550] traffic light \n[802, 666, 1028, 798] car \n[0, 639, 84, 799] car \n[1057, 608, 1199, 766] car \n[988, 218, 1031, 347] person \n[751, 524, 909, 675] car \n[489, 560, 670, 749] car"
}

> Finished chain.
response: The detected objects in the image are: \n[518, 40, 582, 110] car \n[188, 381, 311, 469] car \n[1068, 223, 1104, 342] person \n[828, 233, 949, 329] car \n[1076, 263, 1106, 347] bicycle \n[635, 71, 713, 135] car \n[0, 433, 100, 603] car \n[151, 747, 339, 799] car \n[389, 267, 493, 367] car \n[192, 478, 341, 633] car \n[578, 117, 828, 550] traffic light \n[802, 666, 1028, 798] car \n[0, 639, 84, 799] car \n[1057, 608, 1199, 766] car \n[988, 218, 1031, 347] person \n[751, 524, 909, 675] car \n[489, 560, 670, 749] car"
}

Referrences:

Connect with me

--

--

Plaban Nayak

Machine Learning and Deep Learning enthusiast