Detecting sentiments from Images using Tesserecat and TextHero
Detecting sentiments of a quote
Categorize quotes that are uploaded during Pride Month on the basis of its sentiment —
- positive,
- negative, and
- random.
Task
To build an engine that combines the concepts of OCR and NLP that accepts a .jpg file as input, extracts the text, if any, and classifies sentiment as positive or negative. If the text sentiment is neutral or an image file does not have any text, then it is classified as random.
Dataset
The dataset consists of quotes that are uploaded during Pride Month.
Install TextHero for automatic NLP
!pip install texthero
Install Tesseract for reading text from Images
!sudo apt install tesseract-ocr
!pip install pytesseract
Import corresponding Libraries
import pytesseract
import shutil
import os
import random
Sample text processing from Images
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Negative.jpg')
extractedInformation = pytesseract.image_to_string(Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Negative.jpg'))print(extractedInformation)
Output:
Image.open('/content/lgbt.JPG')
extractedInformation = pytesseract.image_to_string('/content/lgbt.JPG')
extractedInformation
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Random.jpg')
extractedInformation = pytesseract.image_to_string(Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample Data Files/Sample_Random.jpg'))
print(extractedInformation)
Output : no text as the image has not text embibed.
Detect polarity of the text detected in the messages
path_train = '/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset'
In [88]:
from textblob import TextBlob
##
def detect_polarity(text):
return TextBlob(text).sentiment.polarity
##
import os##
img_txt =[]
image_name = []
polarity = []
for images in os.listdir(path_train):
extractedInformation = pytesseract.image_to_string(Image.open(path_train+'/'+ images))
text = extractedInformation.encode('utf-8')
converted_list = []
for element in extractedInformation.split(" "):
converted_list.append(element.replace("\n","").lower())
img_txt.append(" ".join(converted_list))
polarity.append(detect_polarity(extractedInformation))
image_name.append(images)import pandas as pd
train_df = pd.DataFrame({'Image':image_name,'text':img_txt,'polarity':polarity})
Identify homophobic comments
homophobic_words = ['gay','straight','homosexual','lesbian','sex','cruel','dead','bad','closeted','flammer','faggot','fuck','homo','lgbt','lgbtq','lgbtqi','lgrt','bisexual','sassy','homophobia','lgbtqia','mlm','pansexual','lgbtq ']def detect_category(text):
if text == 'no text':
return 'Random'
else:
p = re.compile('|'.join(r'\b{}\b'.format(word) for word in homophobic_words))
if re.search(p, text):
return 'Negative'
else:
return 'Positive'train_df['category'] = train_df['text'].apply(detect_category)import seaborn as sns
sns.countplot(train_df['category'])
Use texthero for text processing :
import texthero as hero
data['clean_text'] = hero.clean(data['text'])
from texthero import preprocessing
#create a custom cleaning pipeline
custom_pipeline = [preprocessing.fillna
, preprocessing.lowercase
, preprocessing.remove_digits
, preprocessing.remove_punctuation
, preprocessing.remove_diacritics
, preprocessing.remove_stopwords
, preprocessing.remove_whitespace
, preprocessing.stem]tfidf,name = hero.tfidf(df['clean_text'],max_features=300,return_feature_names=True)
data_copy = df_train.drop(['clean_text'],axis=1)data_copy.head()from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
lb_enc = lb.fit_transform(data_copy['category'])
data_copy['target'] = lb_enc
data_copy.head()
Train Test Split
X_train,X_test,y_train,y_test = train_test_split(data_copy.drop(['target','category'],axis=1),data_copy['target'],test_size=0.2,random_state=42)## Logistic Regression Model
lr = LogisticRegression(multi_class='ovr')
lr.fit(X_train,y_train)
from sklearn.metrics import recall_score,f1_score,accuracy_score,confusion_matrix
y_pred = lr.predict(X_test)accuracy_score(y_pred,y_test)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
print(accuracy_score(rf_pred,y_test))
Accuracy Score — RandomForestClassifier : 0.916
Hyperparameter Optimization — RandomForestClassifier
from sklearn.model_selection import GridSearchCV
parameters = {
'n_estimators' : [320,330,340],
'max_depth' : [8, 9, 10, 11, 12],
'random_state' : [0,42,100]
}
clf = GridSearchCV(RandomForestClassifier(), parameters, cv=10, n_jobs=-1)
clf.fit(X_train,y_train)
Best Parameters : {‘max_depth’: 10, ‘n_estimators’: 320, ‘random_state’: 0}
Confusion Matrix:
Classification Report:
Make predictions on test data :
test = pd.read_csv('/content/drive/My Drive/Detect_Sentiments/Data Files/Test.csv')
tstimg = test.Filename.values.tolist()
#
import os
img_txt =[]
image_name = []
polarity = []
for images in tstimg:
extractedInformation = pytesseract.image_to_string(Image.open(path_train+'/'+ images))
text = extractedInformation.encode('utf-8')
converted_list = []
for element in extractedInformation.split(" "):
converted_list.append(element.replace("\n","").lower())
img_txt.append(" ".join(converted_list))
polarity.append(detect_polarity(extractedInformation))
image_name.append(images)
#
test['clean_text'] = hero.clean(test['text'], pipeline = custom_pipeline)
#
tfidf_test,name_test = hero.tfidf(test['clean_text'],max_features=300,return_feature_names=True)
#
tf_list_test =[] for item in tfidf_test.values: tf_list_test.append(item)
#
tfidf_test = pd.DataFrame(tf_list_test,columns=name_test)
#
pred = clf.best_estimator_.predict(tfidf_test)
#
label_pred = lb.inverse_transform(pred)
#
sub = pd.read_csv('/content/drive/My Drive/Detect_Sentiments/Data Files/Sample_Submission.csv')
print(sub.head())
#
sub[sub['Category'] == 'Negative']
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test1122.jpg')
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test107.jpg')
Image.open('/content/drive/My Drive/Detect_Sentiments/Data Files/Dataset/Test2792.jpg')
Note: This is an attempt to use pytesseract to read text imbibed in Pictures and classify the polarity of the text. There can be better techniques as well. The sole intuition here was to apply OCR and text processing techniques using new package texthero and pytesseract.