Transfer Learning for Languages

5 min readJun 12, 2020

The intuition of this article is to illustrate how transfer learning can be applied to NLP tasks.

In conventional NLP task we perform the followings tasks in general:

Text Preprocessing
Tokenization
Normalize data using stemming/lemmatization
Create an embedding vector
Develop the final model

Rather than doing all these steps is there an easy way out where we just feed the incoming texts and a pretrianed model takes care of the subsequent processes?

Here we make use of Tensorflow Hub which allows models to be easily loaded into Tensorflow. To install TensorHub use the following command.

pip install tensorflow_hub

It is also necessary to install TensorFlow datasets .Use the following command to download .

pip install tensorflow_datasets

Import Libraries

Load Amazon Personal Care Appiances Review data from tensorflow dataset

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazons iconic products. In a period of over two decades since the first review in 1995, millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), amongst others. Accordingly, we are releasing this data to further research in multiple disciplines related to understanding customer product experiences. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews

The dataset contains the following columns :
marketplace - 2 letter country code of the marketplace where the review was written.
customer_id - Random identifier that can be used to aggregate reviews written by a single author.
review_id - The unique ID of the review.
product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id.
product_parent - Random identifier that can be used to aggregate reviews for the same product.
product_title - Title of the product.
product_category - Broad product category that can be used to group reviews
(also used to group the dataset into coherent parts).
star_rating - The 1-5 star rating of the review.
helpful_votes - Number of helpful votes.
total_votes - Number of total votes the review received.
vine - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline - The title of the review.
review_body - The review text.
review_date - The date the review was written

we are concerned with the review_body as a predictor in order to predict the label star_rating

Consuming the data loaded

Create a function to generate the labels

Define a function to take tensor object as input and retrieve text and label

Apply the function on the sample dataset to fetch review text and corresponding label

Fetch a pretrained model from tensorflow hub

Pretrained Model : Token based text embedding trained on English Google News 200B corpus(https://tfhub.dev/google/nnlm-en-dim128/2)

Steps:

Takes as input a 1D tensor
The input split it based on spaces
Return individual embedding for each word and combine it into a sentence embedding and return the value

Add lop layers to the pretrained model base

Compile the Model

Shuffle the data and create a batch of 512 vectors

Train the model for 5 epochs

Visualize Training and validation Accuracy per epoch

Visualize Training and validation loss

Evaluate the trained model for the Test Data

The model does not over fit as evident from the validation accuracy which is 85%. Also we have only built a base model

Make Predictions

Conclusion

We could see that Transfer learning can also be applicable to NLP tasks , not limiting to image classification.

We did not perform any of the generic text preprocessing part before feeding the data to the model.

The text processing part was handled by the pretrained model.

Note :The accuracy of the current problem statement can still be improved by adding a Conv1D layer or introducing a LSTM layer , but here the main intuition was to apply transfer learning to text data.