Detecting sentiments from Images using Tesserecat and TextHero

Plaban Nayak
4 min readAug 6, 2020

Detecting sentiments of a quote

Categorize quotes that are uploaded during Pride Month on the basis of its sentiment —

  • positive,
  • negative, and
  • random.

Task

To build an engine that combines the concepts of OCR and NLP that accepts a .jpg file as input, extracts the text, if any, and classifies sentiment as positive or negative. If the text sentiment is neutral or an image file does not have any text, then it is classified as random.

Dataset

The dataset consists of quotes that are uploaded during Pride Month.

Install TextHero for automatic NLP

!pip install texthero

Install Tesseract for reading text from Images

!sudo apt install tesseract-ocr
!pip install pytesseract

Import corresponding Libraries

import pytesseract
import shutil
import os
import random

Sample text processing from Images

Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Negative.jpg')
extractedInformation = pytesseract.image_to_string(Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Negative.jpg'))print(extractedInformation)

Output:

Image.open('/content/lgbt.JPG')
extractedInformation = pytesseract.image_to_string('/content/lgbt.JPG')
extractedInformation
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Random.jpg')
extractedInformation = pytesseract.image_to_string(Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Random.jpg'))
print(extractedInformation)

Output : no text as the image has not text embibed.

Detect polarity of the text detected in the messages

path_train = '/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset'

In [88]:

from textblob import TextBlob
##
def
detect_polarity(text):
return TextBlob(text).sentiment.polarity
##
import
os
##
img_txt =[]
image_name = []
polarity = []
for images in os.listdir(path_train):
extractedInformation = pytesseract.image_to_string(Image.open(path_train+'/'+ images))
text = extractedInformation.encode('utf-8')
converted_list = []
for element in extractedInformation.split(" "):
converted_list.append(element.replace("\n","").lower())
img_txt.append(" ".join(converted_list))
polarity.append(detect_polarity(extractedInformation))
image_name.append(images)
import pandas as pd
train_df = pd.DataFrame({'Image':image_name,'text':img_txt,'polarity':polarity})

Identify homophobic comments

homophobic_words = ['gay','straight','homosexual','lesbian','sex','cruel','dead','bad','closeted','flammer','faggot','fuck','homo','lgbt','lgbtq','lgbtqi','lgrt','bisexual','sassy','homophobia','lgbtqia','mlm','pansexual','lgbtq ']def detect_category(text):

if text == 'no text':
return 'Random'
else:
p = re.compile('|'.join(r'\b{}\b'.format(word) for word in homophobic_words))
if re.search(p, text):
return 'Negative'
else:
return 'Positive'
train_df['category'] = train_df['text'].apply(detect_category)import seaborn as sns
sns.countplot(train_df['category'])

Use texthero for text processing :

import texthero as hero
data['clean_text'] = hero.clean(data['text'])
from texthero import preprocessing
#create a custom cleaning pipeline
custom_pipeline = [preprocessing.fillna
, preprocessing.lowercase
, preprocessing.remove_digits
, preprocessing.remove_punctuation
, preprocessing.remove_diacritics
, preprocessing.remove_stopwords
, preprocessing.remove_whitespace
, preprocessing.stem]
tfidf,name = hero.tfidf(df['clean_text'],max_features=300,return_feature_names=True)
data_copy = df_train.drop(['clean_text'],axis=1)data_copy.head()from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder()
lb_enc = lb.fit_transform(data_copy['category'])
data_copy['target'] = lb_enc
data_copy.head()

Train Test Split

X_train,X_test,y_train,y_test = train_test_split(data_copy.drop(['target','category'],axis=1),data_copy['target'],test_size=0.2,random_state=42)## Logistic Regression Model
lr = LogisticRegression(multi_class='ovr')
lr.fit(X_train,y_train)
from sklearn.metrics import recall_score,f1_score,accuracy_score,confusion_matrix

y_pred = lr.predict(X_test)
accuracy_score(y_pred,y_test)
Accuracy Score — LogisticRegression
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
print(accuracy_score(rf_pred,y_test))

Accuracy Score — RandomForestClassifier : 0.916

Hyperparameter Optimization — RandomForestClassifier

from sklearn.model_selection import GridSearchCV
parameters = {
'n_estimators' : [320,330,340],
'max_depth' : [8, 9, 10, 11, 12],
'random_state' : [0,42,100]

}
clf = GridSearchCV(RandomForestClassifier(), parameters, cv=10, n_jobs=-1)
clf.fit(X_train,y_train)

Best Parameters : {‘max_depth’: 10, ‘n_estimators’: 320, ‘random_state’: 0}

Confusion Matrix:

Classification Report:

Make predictions on test data :

test = pd.read_csv('/content/drive/My Drive/Detect_Sentiments/Data Files/Test.csv')
tstimg = test.Filename.values.tolist()
#
import os
img_txt =[]
image_name = []
polarity = []
for images in tstimg:
extractedInformation = pytesseract.image_to_string(Image.open(path_train+'/'+ images))
text = extractedInformation.encode('utf-8')
converted_list = []
for element in extractedInformation.split(" "):
converted_list.append(element.replace("\n","").lower())
img_txt.append(" ".join(converted_list))
polarity.append(detect_polarity(extractedInformation))
image_name.append(images)
#
test['clean_text'] = hero.clean(test['text'], pipeline = custom_pipeline)
#
tfidf_test,name_test = hero.tfidf(test['clean_text'],max_features=300,return_feature_names=True)
#
tf_list_test =[] for item in tfidf_test.values: tf_list_test.append(item)
#
tfidf_test = pd.DataFrame(tf_list_test,columns=name_test)
#
pred = clf.best_estimator_.predict(tfidf_test)
#
label_pred = lb.inverse_transform(pred)
#
sub = pd.read_csv('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample_Submission.csv')
print(sub.head())
#
sub[sub['Category'] == 'Negative']
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test1122.jpg')
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test107.jpg')
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test2792.jpg')

Note: This is an attempt to use pytesseract to read text imbibed in Pictures and classify the polarity of the text. There can be better techniques as well. The sole intuition here was to apply OCR and text processing techniques using new package texthero and pytesseract.

Contact me

--

--